Configure HistGradientBoostingClassifier "warm_start" Parameter

The warm_start parameter in scikit-learn’s HistGradientBoostingClassifier allows for incremental fitting by reusing the solution of the previous call to fit.

HistGradientBoostingClassifier is a fast implementation of gradient boosting trees, using histogram-based algorithms for efficient training. It builds an ensemble of decision trees sequentially, with each tree correcting errors made by the previous ones.

When warm_start is set to True, subsequent calls to fit() will add more estimators to the ensemble instead of creating a new model. This can be useful for fine-tuning the number of estimators or for online learning scenarios.

The default value for warm_start is False, meaning each call to fit() creates a new ensemble. Setting it to True is common when you want to incrementally add estimators or perform early stopping.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
model_cold = HistGradientBoostingClassifier(random_state=42)
model_warm = HistGradientBoostingClassifier(warm_start=True, random_state=42)

# Training with incremental fitting
n_estimators_list = [10, 20, 50, 100]
cold_times, warm_times = [], []
cold_scores, warm_scores = [], []

for n_estimators in n_estimators_list:
    # Cold start
    start_time = time.time()
    model_cold.set_params(max_iter=n_estimators)
    model_cold.fit(X_train, y_train)
    cold_times.append(time.time() - start_time)
    cold_scores.append(accuracy_score(y_test, model_cold.predict(X_test)))

    # Warm start
    start_time = time.time()
    model_warm.set_params(max_iter=n_estimators)
    model_warm.fit(X_train, y_train)
    warm_times.append(time.time() - start_time)
    warm_scores.append(accuracy_score(y_test, model_warm.predict(X_test)))

# Print results
for i, n_estimators in enumerate(n_estimators_list):
    print(f"n_estimators={n_estimators}:")
    print(f"  Cold start - Time: {cold_times[i]:.3f}s, Accuracy: {cold_scores[i]:.3f}")
    print(f"  Warm start - Time: {warm_times[i]:.3f}s, Accuracy: {warm_scores[i]:.3f}")

Running the example gives an output like:

n_estimators=10:
  Cold start - Time: 0.058s, Accuracy: 0.890
  Warm start - Time: 0.053s, Accuracy: 0.890
n_estimators=20:
  Cold start - Time: 0.084s, Accuracy: 0.915
  Warm start - Time: 0.053s, Accuracy: 0.915
n_estimators=50:
  Cold start - Time: 0.186s, Accuracy: 0.935
  Warm start - Time: 0.116s, Accuracy: 0.935
n_estimators=100:
  Cold start - Time: 0.312s, Accuracy: 0.943
  Warm start - Time: 0.168s, Accuracy: 0.943

The key steps in this example are:

Generate a synthetic classification dataset
Split the data into train and test sets
Create two HistGradientBoostingClassifier models, one with warm_start=False (default) and one with warm_start=True
Train both models incrementally, increasing the number of estimators in stages
Compare training times and accuracy scores for each stage

Some tips for using warm_start:

Use warm_start=True when you want to fine-tune the number of estimators without retraining from scratch
Combine with early stopping to automatically find the optimal number of estimators
Useful for online learning scenarios where you want to update the model with new data

Issues to consider:

Models with warm_start=True retain the previous solution, which may increase memory usage
Changing certain parameters (e.g., learning_rate) between fit() calls can lead to unexpected results
Warm starting may not always converge to the same solution as a single fit with the total number of iterations

See Also