Configure SGDRegressor "validation_fraction" Parameter

The validation_fraction parameter in scikit-learn’s SGDRegressor determines the proportion of training data to set aside for early stopping.

Stochastic Gradient Descent (SGD) is an efficient method for fitting linear models, especially on large datasets. Early stopping helps prevent overfitting by monitoring the model’s performance on a validation set during training.

The validation_fraction parameter controls how much of the training data is used for validation. A larger fraction provides a more reliable estimate of generalization performance but reduces the amount of data available for training.

The default value for validation_fraction is 0.1 (10% of the training data).

In practice, values between 0.1 and 0.3 are commonly used, depending on the size of the dataset and the stability of the learning curves.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different validation_fraction values
validation_fractions = [0.1, 0.2, 0.3]
mse_scores = []

for fraction in validation_fractions:
    sgd = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42, validation_fraction=fraction)
    sgd.fit(X_train, y_train)
    y_pred = sgd.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"validation_fraction={fraction}, MSE: {mse:.3f}, n_iter_: {sgd.n_iter_}")

best_fraction = validation_fractions[np.argmin(mse_scores)]
print(f"Best validation_fraction: {best_fraction}")

Running the example gives an output like:

validation_fraction=0.1, MSE: 0.011, n_iter_: 7
validation_fraction=0.2, MSE: 0.011, n_iter_: 7
validation_fraction=0.3, MSE: 0.011, n_iter_: 7
Best validation_fraction: 0.1

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train SGDRegressor models with different validation_fraction values
Evaluate the mean squared error (MSE) of each model on the test set
Compare the number of iterations and performance for different fractions

Some tips and heuristics for setting validation_fraction:

Start with the default value of 0.1 and adjust based on dataset size and model stability
Use a larger fraction for smaller datasets to get a more reliable estimate of generalization performance
Consider using cross-validation instead for very small datasets

Issues to consider:

A larger validation fraction reduces the amount of data available for training
The optimal fraction depends on the dataset size, noise level, and model complexity
Early stopping with a validation set may not always outperform other regularization methods

See Also