Configure HistGradientBoostingRegressor "early_stopping" Parameter

The early_stopping parameter in scikit-learn’s HistGradientBoostingRegressor controls whether to stop training early based on validation score.

Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that uses binning to speed up training. Early stopping can help prevent overfitting and reduce training time by halting the addition of new trees when the validation score stops improving.

When early_stopping is True, a fraction of the training data is set aside as a validation set. The model stops training when the validation score hasn’t improved for a specified number of iterations.

The default value for early_stopping is ‘auto’, which enables early stopping if the dataset has more than 10,000 samples. Setting it to True always enables early stopping, while False disables it.

In practice, early stopping is often beneficial for large datasets or when training time is a concern.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=20000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different early_stopping settings
early_stopping_values = [True, False, 'auto']
results = []

for early_stop in early_stopping_values:
    start_time = time.time()
    model = HistGradientBoostingRegressor(max_iter=1000, early_stopping=early_stop, random_state=42)
    model.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    n_iter = model.n_iter_

    results.append((early_stop, mse, train_time, n_iter))
    print(f"early_stopping={early_stop}, MSE: {mse:.3f}, Training Time: {train_time:.2f}s, Iterations: {n_iter}")

Running the example gives an output like:

early_stopping=True, MSE: 622.397, Training Time: 1.50s, Iterations: 479
early_stopping=False, MSE: 560.420, Training Time: 2.89s, Iterations: 1000
early_stopping=auto, MSE: 622.397, Training Time: 1.52s, Iterations: 479

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different early_stopping settings
Evaluate the mean squared error of each model on the test set
Compare training time and number of iterations for each setting

Some tips for configuring early_stopping:

Use early stopping for large datasets to reduce training time and prevent overfitting
Monitor the number of iterations to ensure the model isn’t stopping too early
Consider increasing n_iter_no_change if early stopping occurs too soon
Use validation_fraction to control the size of the validation set

Issues to consider:

Early stopping may result in underfitting if stopped too soon
The effectiveness depends on the quality of the validation set
Disabling early stopping may lead to overfitting and longer training times
The ‘auto’ setting may not always be optimal for your specific use case

See Also