The warm_start
parameter in scikit-learn’s HistGradientBoostingRegressor
allows for incremental fitting of the model by reusing the solution of the previous call to fit.
HistGradientBoostingRegressor
is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets.
When warm_start
is set to True
, the model can be trained incrementally, adding more estimators to an existing fitted model. This is particularly useful for online learning or when dealing with large datasets that don’t fit in memory.
The default value for warm_start
is False
. It’s commonly set to True
when you want to continue training a model with additional data or increase the number of estimators without starting from scratch.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create models with warm_start=False and warm_start=True
model_cold = HistGradientBoostingRegressor(max_iter=10, random_state=42)
model_warm = HistGradientBoostingRegressor(max_iter=10, warm_start=True, random_state=42)
# Train models incrementally
n_iterations = [10, 20, 30, 40, 50]
for n in n_iterations:
model_cold.set_params(max_iter=n)
model_cold.fit(X_train, y_train)
y_pred_cold = model_cold.predict(X_test)
mse_cold = mean_squared_error(y_test, y_pred_cold)
model_warm.set_params(max_iter=n)
model_warm.fit(X_train, y_train)
y_pred_warm = model_warm.predict(X_test)
mse_warm = mean_squared_error(y_test, y_pred_warm)
print(f"Iterations: {n}")
print(f"Cold MSE: {mse_cold:.4f}")
print(f"Warm MSE: {mse_warm:.4f}")
print()
Running the example gives an output like:
Iterations: 10
Cold MSE: 6192.5739
Warm MSE: 6192.5739
Iterations: 20
Cold MSE: 3390.2090
Warm MSE: 3390.2090
Iterations: 30
Cold MSE: 2137.6786
Warm MSE: 2137.6786
Iterations: 40
Cold MSE: 1547.2405
Warm MSE: 1547.2405
Iterations: 50
Cold MSE: 1306.1602
Warm MSE: 1306.1602
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Create two
HistGradientBoostingRegressor
models: one withwarm_start=False
, one withwarm_start=True
- Train both models incrementally, increasing the number of iterations
- Evaluate the models’ performance using mean squared error at each step
Tips and heuristics for using warm_start
:
- Use
warm_start=True
when you want to continue training a model with new data - It’s useful for online learning scenarios where data arrives in batches
- Can be combined with early stopping to find the optimal number of estimators
Issues to consider:
- With
warm_start=True
, changing other parameters may reset the model - Incremental fitting may lead to different results compared to fitting all at once
- Memory usage increases with the number of estimators, so monitor resource usage