The power_t
parameter in scikit-learn’s SGDRegressor
controls the learning rate decay during training.
Stochastic Gradient Descent (SGD) is an optimization method used for training linear models. The power_t
parameter determines how quickly the learning rate decreases over time, affecting the model’s convergence and performance.
A higher power_t
value leads to faster decay, potentially speeding up convergence but risking premature convergence. A lower value results in slower decay, allowing for more exploration but potentially slowing down convergence.
The default value for power_t
is 0.25. Common values range from 0 to 1, with 0.5 being another popular choice.
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different power_t values
power_t_values = [0, 0.25, 0.5, 1]
# Evaluate final performance
for power_t in power_t_values:
sgd = SGDRegressor(power_t=power_t, random_state=42, max_iter=1000, tol=1e-3)
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"power_t={power_t}, MSE: {mse:.4f}")
Running the example gives an output like:
power_t=0, MSE: 0.0101
power_t=0.25, MSE: 0.0096
power_t=0.5, MSE: 0.0499
power_t=1, MSE: 13521.9053
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
SGDRegressor
models with differentpower_t
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting power_t
:
- Start with the default value of 0.25 and adjust based on model performance
- Use lower values (0.1-0.3) for complex datasets to allow more exploration
- Use higher values (0.4-0.7) for simpler datasets to speed up convergence
- Monitor learning curves to detect issues like slow convergence or instability
Issues to consider:
- The optimal
power_t
value depends on the dataset’s complexity and noise level - Very high values (close to 1) may cause the learning rate to decay too quickly
- Very low values (close to 0) may result in slow convergence
- The effect of
power_t
interacts with other parameters likeeta0
(initial learning rate)