Configure GradientBoostingRegressor "warm_start" Parameter

The warm_start parameter in scikit-learn’s GradientBoostingRegressor allows the reuse of the solution of the previous call to fit and adds more estimators to the ensemble, which can be useful for training continuation and parameter tuning.

Gradient Boosting is a powerful ensemble learning technique that builds models sequentially, each new model attempting to correct errors made by the previous ones. The warm_start parameter enables efficient updates to the existing model without retraining from scratch.

By default, warm_start is set to False, meaning the model does not retain the state of previous fits. Common values are True for enabling the warm start and False for the default behavior.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Split train into train and new
X_train, X_new, y_train, y_new = train_test_split(X, y, test_size=0.5, random_state=42)

# Train with warm_start=False
gbr = GradientBoostingRegressor(n_estimators=100, warm_start=False, random_state=42)
gbr.fit(X_train, y_train)

# Predict and evaluate
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"warm_start=False, MSE: {mse:.3f}")

# Update the model with new data via warm_start=True
gbr.set_params(warm_start=True, n_estimators=200)
gbr.fit(X_new, y_new)

# Predict and evaluate
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"warm_start=True, Updated n_estimators=200, MSE: {mse:.3f}")

Running the example gives an output like:

warm_start=False, MSE: 1615.174
warm_start=True, Updated n_estimators=200, MSE: 146.373

The key steps in this example are:

Generate a synthetic regression dataset with informative features and some noise.
Split the data into train and test sets.
Train a GradientBoostingRegressor model with warm_start set to False.
Update the same model with warm_start set to True and increase the number of estimators, then continue training with more data.
Evaluate the mean squared error (MSE) of the updated model on the test set.

Some tips and heuristics for setting warm_start:

Use warm_start to continue training and add more estimators without retraining from scratch.
Ideal for scenarios where incremental learning is beneficial.
Monitor the performance after adding more estimators to ensure it improves without overfitting.

Issues to consider:

Using warm_start may lead to overfitting if not monitored properly.
Ensure the additional training steps are needed and beneficial for the model’s performance.
Computational resources should be considered when adding more estimators.

See Also