The early_stopping
parameter in scikit-learn’s SGDRegressor
determines whether to use early stopping to terminate training when validation score is not improving.
Stochastic Gradient Descent (SGD) is an iterative optimization algorithm. Early stopping can prevent overfitting by halting training when the model’s performance on a validation set stops improving.
When early_stopping
is set to True
, the algorithm uses a fraction of the training data as a validation set. It stops training if the validation score doesn’t improve for a number of consecutive epochs.
The default value for early_stopping
is False
. Common values are True
with default settings, or True
with custom validation_fraction
and n_iter_no_change
.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different early_stopping configurations
configs = [
{'early_stopping': False},
{'early_stopping': True},
{'early_stopping': True, 'validation_fraction': 0.1, 'n_iter_no_change': 5}
]
for config in configs:
sgd = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42, **config)
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Config: {config}")
print(f"Number of iterations: {sgd.n_iter_}")
print(f"MSE: {mse:.4f}\n")
Running the example gives an output like:
Config: {'early_stopping': False}
Number of iterations: 7
MSE: 0.0106
Config: {'early_stopping': True}
Number of iterations: 6
MSE: 0.0107
Config: {'early_stopping': True, 'validation_fraction': 0.1, 'n_iter_no_change': 5}
Number of iterations: 6
MSE: 0.0107
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Create
SGDRegressor
instances with differentearly_stopping
configurations - Train models and evaluate performance on the test set
- Compare the number of iterations and mean squared error for each configuration
Tips for setting early_stopping
:
- Enable early stopping when dealing with large datasets or when training time is a concern
- Adjust
validation_fraction
based on your dataset size; smaller datasets may need a larger fraction - Tune
n_iter_no_change
to control the patience of the early stopping criterion
Issues to consider:
- Early stopping may prevent the model from finding the optimal solution if stopped too early
- The effectiveness of early stopping depends on the quality of the validation set
- Using early stopping may increase the memory usage as it requires holding out a portion of the training data