The warm_start
parameter in scikit-learn’s HistGradientBoostingClassifier
allows for incremental fitting by reusing the solution of the previous call to fit.
HistGradientBoostingClassifier
is a fast implementation of gradient boosting trees, using histogram-based algorithms for efficient training. It builds an ensemble of decision trees sequentially, with each tree correcting errors made by the previous ones.
When warm_start
is set to True
, subsequent calls to fit()
will add more estimators to the ensemble instead of creating a new model. This can be useful for fine-tuning the number of estimators or for online learning scenarios.
The default value for warm_start
is False
, meaning each call to fit()
creates a new ensemble. Setting it to True
is common when you want to incrementally add estimators or perform early stopping.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize models
model_cold = HistGradientBoostingClassifier(random_state=42)
model_warm = HistGradientBoostingClassifier(warm_start=True, random_state=42)
# Training with incremental fitting
n_estimators_list = [10, 20, 50, 100]
cold_times, warm_times = [], []
cold_scores, warm_scores = [], []
for n_estimators in n_estimators_list:
# Cold start
start_time = time.time()
model_cold.set_params(max_iter=n_estimators)
model_cold.fit(X_train, y_train)
cold_times.append(time.time() - start_time)
cold_scores.append(accuracy_score(y_test, model_cold.predict(X_test)))
# Warm start
start_time = time.time()
model_warm.set_params(max_iter=n_estimators)
model_warm.fit(X_train, y_train)
warm_times.append(time.time() - start_time)
warm_scores.append(accuracy_score(y_test, model_warm.predict(X_test)))
# Print results
for i, n_estimators in enumerate(n_estimators_list):
print(f"n_estimators={n_estimators}:")
print(f" Cold start - Time: {cold_times[i]:.3f}s, Accuracy: {cold_scores[i]:.3f}")
print(f" Warm start - Time: {warm_times[i]:.3f}s, Accuracy: {warm_scores[i]:.3f}")
Running the example gives an output like:
n_estimators=10:
Cold start - Time: 0.058s, Accuracy: 0.890
Warm start - Time: 0.053s, Accuracy: 0.890
n_estimators=20:
Cold start - Time: 0.084s, Accuracy: 0.915
Warm start - Time: 0.053s, Accuracy: 0.915
n_estimators=50:
Cold start - Time: 0.186s, Accuracy: 0.935
Warm start - Time: 0.116s, Accuracy: 0.935
n_estimators=100:
Cold start - Time: 0.312s, Accuracy: 0.943
Warm start - Time: 0.168s, Accuracy: 0.943
The key steps in this example are:
- Generate a synthetic classification dataset
- Split the data into train and test sets
- Create two
HistGradientBoostingClassifier
models, one withwarm_start=False
(default) and one withwarm_start=True
- Train both models incrementally, increasing the number of estimators in stages
- Compare training times and accuracy scores for each stage
Some tips for using warm_start
:
- Use
warm_start=True
when you want to fine-tune the number of estimators without retraining from scratch - Combine with early stopping to automatically find the optimal number of estimators
- Useful for online learning scenarios where you want to update the model with new data
Issues to consider:
- Models with
warm_start=True
retain the previous solution, which may increase memory usage - Changing certain parameters (e.g.,
learning_rate
) betweenfit()
calls can lead to unexpected results - Warm starting may not always converge to the same solution as a single fit with the total number of iterations