Configure BaggingClassifier "warm_start" Parameter

The warm_start parameter in scikit-learn’s BaggingClassifier allows for incremental fitting of additional estimators to an existing ensemble.

Bagging (Bootstrap Aggregating) is an ensemble method that combines predictions from multiple base estimators trained on different subsets of the data. The warm_start parameter enables adding more estimators to the ensemble without retraining from scratch.

When warm_start is set to True, subsequent calls to fit() will add estimators to the existing ensemble rather than creating a new one. This can be useful for iteratively increasing the number of estimators or for online learning scenarios.

The default value for warm_start is False, which means a new ensemble is created each time fit() is called.

In practice, warm_start is often used when experimenting with the optimal number of estimators or when dealing with large datasets that require incremental training.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize BaggingClassifier with warm_start=False
bc = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10,
                       random_state=42, warm_start=False)

# Fit and evaluate initial model
start_time = time.time()
bc.fit(X_train, y_train)
initial_time = time.time() - start_time
initial_score = accuracy_score(y_test, bc.predict(X_test))

print(f"Initial model (10 estimators):")
print(f"Time: {initial_time:.3f} seconds")
print(f"Accuracy: {initial_score:.3f}")

# Set warm_start=True and add more estimators
bc.set_params(warm_start=True, n_estimators=20)

# Fit additional estimators and evaluate
start_time = time.time()
bc.fit(X_train, y_train)
additional_time = time.time() - start_time
final_score = accuracy_score(y_test, bc.predict(X_test))

print(f"\nFinal model (20 estimators):")
print(f"Time to add 10 estimators: {additional_time:.3f} seconds")
print(f"Accuracy: {final_score:.3f}")

Running the example gives an output like:

Initial model (10 estimators):
Time: 0.087 seconds
Accuracy: 0.885

Final model (20 estimators):
Time to add 10 estimators: 0.082 seconds
Accuracy: 0.875

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Create a BaggingClassifier with warm_start=False and 10 estimators
Fit and evaluate the initial model
Set warm_start=True and increase n_estimators to 20
Fit additional estimators and evaluate the final model
Compare performance and training time

Some tips and heuristics for using warm_start:

Use warm_start when experimenting with the optimal number of estimators
It can be beneficial for large datasets where incremental training is necessary
Monitor performance to determine when to stop adding estimators

Issues to consider:

warm_start may not always lead to faster training times for small datasets
The order of training data can affect the model when using warm_start
Ensure consistent hyperparameters when adding estimators to avoid unexpected behavior

See Also