The warm_start
parameter in scikit-learn’s ExtraTreesClassifier
allows for incremental learning by adding trees to an existing forest.
Extra Trees Classifier is an ensemble method that builds a forest of unpruned decision trees. It’s similar to Random Forest but with two key differences: it splits nodes by choosing cut-points fully at random and uses the whole learning sample to grow the trees.
The warm_start
parameter, when set to True
, allows you to fit additional trees to an existing forest, rather than creating a new forest from scratch each time you fit the model.
By default, warm_start
is set to False
. It’s commonly set to True
when you want to incrementally train your model on new data without discarding previously learned information.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=0, random_state=42)
# Split into initial training set and new data
X_initial, X_new, y_initial, y_new = train_test_split(X, y, test_size=0.5, random_state=42)
# Create ExtraTreesClassifier with warm_start=False
clf = ExtraTreesClassifier(n_estimators=50, random_state=42, warm_start=False)
# Train on initial data and evaluate
clf.fit(X_initial, y_initial)
y_pred = clf.predict(X_new)
initial_accuracy = accuracy_score(y_new, y_pred)
initial_n_trees = len(clf.estimators_)
print(f"Initial training - Accuracy: {initial_accuracy:.3f}, Trees: {initial_n_trees}")
# Set warm_start=True and add more trees
clf.set_params(warm_start=True, n_estimators=100)
clf.fit(X_new, y_new)
y_pred = clf.predict(X_new)
final_accuracy = accuracy_score(y_new, y_pred)
final_n_trees = len(clf.estimators_)
print(f"After incremental learning - Accuracy: {final_accuracy:.3f}, Trees: {final_n_trees}")
Running the example gives an output like:
Initial training - Accuracy: 0.912, Trees: 50
After incremental learning - Accuracy: 1.000, Trees: 100
The key steps in this example are:
- Generate a synthetic classification dataset
- Split the data into an initial training set and new data
- Train an ExtraTreesClassifier with
warm_start=False
on the initial data - Set
warm_start=True
and train on new data, adding more trees - Compare the accuracy and number of trees before and after incremental learning
Some tips for using warm_start
:
- Use
warm_start=True
when you want to add trees to an existing forest - Increase
n_estimators
to specify how many trees to add - Monitor performance to ensure the model is still improving with new data
Issues to consider:
- Memory usage increases with the number of trees
- The model may become biased towards more recently seen data
- It’s important to shuffle your data when using
warm_start
to avoid order-dependent results