The eta0
parameter in scikit-learn’s SGDRegressor
sets the initial learning rate for the model’s gradient descent optimization.
Stochastic Gradient Descent (SGD) is an iterative optimization algorithm used for fitting linear models. It updates the model’s parameters based on the gradient of the loss function with respect to a single training example at each iteration.
The eta0
parameter controls the step size taken during each update. A larger value can lead to faster initial convergence but may overshoot the optimal solution, while a smaller value provides more precise updates but may require more iterations to converge.
The default value for eta0
is 0.01.
In practice, values between 0.1 and 0.0001 are commonly used, depending on the specific problem and dataset characteristics.
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different eta0 values
eta0_values = [0.1, 0.01, 0.001, 0.0001]
mse_scores = []
for eta0 in eta0_values:
sgd = SGDRegressor(eta0=eta0, random_state=42, max_iter=1000, tol=1e-3)
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"eta0={eta0}, MSE: {mse:.3f}")
Running the example gives an output like:
eta0=0.1, MSE: 0.010
eta0=0.01, MSE: 0.010
eta0=0.001, MSE: 0.026
eta0=0.0001, MSE: 22.820
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
SGDRegressor
models with differenteta0
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting eta0
:
- Start with the default value of 0.01 and adjust based on model performance
- Use larger values (e.g., 0.1) for faster initial convergence on simple problems
- Use smaller values (e.g., 0.001 or 0.0001) for more complex problems or when fine-tuning is needed
- Consider using adaptive learning rate schedules (e.g., ‘optimal’ or ‘invscaling’ learning_rate)
Issues to consider:
- Too large
eta0
can cause overshooting and unstable convergence - Too small
eta0
may result in slow convergence or getting stuck in local optima - The optimal
eta0
depends on the scale of the features and the complexity of the problem - Consider combining
eta0
tuning with feature scaling for better results