The early_stopping
parameter in scikit-learn’s HistGradientBoostingRegressor
controls whether to stop training early based on validation score.
Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that uses binning to speed up training. Early stopping can help prevent overfitting and reduce training time by halting the addition of new trees when the validation score stops improving.
When early_stopping
is True, a fraction of the training data is set aside as a validation set. The model stops training when the validation score hasn’t improved for a specified number of iterations.
The default value for early_stopping
is ‘auto’, which enables early stopping if the dataset has more than 10,000 samples. Setting it to True always enables early stopping, while False disables it.
In practice, early stopping is often beneficial for large datasets or when training time is a concern.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time
# Generate synthetic dataset
X, y = make_regression(n_samples=20000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different early_stopping settings
early_stopping_values = [True, False, 'auto']
results = []
for early_stop in early_stopping_values:
start_time = time.time()
model = HistGradientBoostingRegressor(max_iter=1000, early_stopping=early_stop, random_state=42)
model.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
n_iter = model.n_iter_
results.append((early_stop, mse, train_time, n_iter))
print(f"early_stopping={early_stop}, MSE: {mse:.3f}, Training Time: {train_time:.2f}s, Iterations: {n_iter}")
Running the example gives an output like:
early_stopping=True, MSE: 622.397, Training Time: 1.50s, Iterations: 479
early_stopping=False, MSE: 560.420, Training Time: 2.89s, Iterations: 1000
early_stopping=auto, MSE: 622.397, Training Time: 1.52s, Iterations: 479
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentearly_stopping
settings - Evaluate the mean squared error of each model on the test set
- Compare training time and number of iterations for each setting
Some tips for configuring early_stopping
:
- Use early stopping for large datasets to reduce training time and prevent overfitting
- Monitor the number of iterations to ensure the model isn’t stopping too early
- Consider increasing
n_iter_no_change
if early stopping occurs too soon - Use
validation_fraction
to control the size of the validation set
Issues to consider:
- Early stopping may result in underfitting if stopped too soon
- The effectiveness depends on the quality of the validation set
- Disabling early stopping may lead to overfitting and longer training times
- The ‘auto’ setting may not always be optimal for your specific use case