SKLearner Home | About | Contact | Examples

Configure HistGradientBoostingRegressor "early_stopping" Parameter

The early_stopping parameter in scikit-learn’s HistGradientBoostingRegressor controls whether to stop training early based on validation score.

Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that uses binning to speed up training. Early stopping can help prevent overfitting and reduce training time by halting the addition of new trees when the validation score stops improving.

When early_stopping is True, a fraction of the training data is set aside as a validation set. The model stops training when the validation score hasn’t improved for a specified number of iterations.

The default value for early_stopping is ‘auto’, which enables early stopping if the dataset has more than 10,000 samples. Setting it to True always enables early stopping, while False disables it.

In practice, early stopping is often beneficial for large datasets or when training time is a concern.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=20000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different early_stopping settings
early_stopping_values = [True, False, 'auto']
results = []

for early_stop in early_stopping_values:
    start_time = time.time()
    model = HistGradientBoostingRegressor(max_iter=1000, early_stopping=early_stop, random_state=42)
    model.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    n_iter = model.n_iter_

    results.append((early_stop, mse, train_time, n_iter))
    print(f"early_stopping={early_stop}, MSE: {mse:.3f}, Training Time: {train_time:.2f}s, Iterations: {n_iter}")

Running the example gives an output like:

early_stopping=True, MSE: 622.397, Training Time: 1.50s, Iterations: 479
early_stopping=False, MSE: 560.420, Training Time: 2.89s, Iterations: 1000
early_stopping=auto, MSE: 622.397, Training Time: 1.52s, Iterations: 479

The key steps in this example are:

  1. Generate a synthetic regression dataset
  2. Split the data into train and test sets
  3. Train HistGradientBoostingRegressor models with different early_stopping settings
  4. Evaluate the mean squared error of each model on the test set
  5. Compare training time and number of iterations for each setting

Some tips for configuring early_stopping:

Issues to consider:



See Also