The epsilon
parameter in scikit-learn’s SGDRegressor
defines the margin of tolerance in the epsilon-insensitive loss function.
SGDRegressor
is a linear model that uses Stochastic Gradient Descent for optimization. The epsilon-insensitive loss function ignores errors within epsilon
distance of the true value, making the model less sensitive to small fluctuations.
epsilon
controls the width of the insensitive region. A larger value makes the model more tolerant to errors, potentially leading to a simpler model but potentially underfitting. A smaller value makes the model more sensitive, potentially capturing more nuances but risking overfitting.
The default value for epsilon
is 0.1. In practice, values between 0.01 and 1.0 are commonly used, depending on the scale and noise level of the target variable.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different epsilon values
epsilon_values = [0.01, 0.1, 0.5, 1.0]
mse_scores = []
for eps in epsilon_values:
sgd = SGDRegressor(epsilon=eps, random_state=42)
sgd.fit(X_train, y_train)
y_pred = sgd.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"epsilon={eps}, MSE: {mse:.3f}")
# Find best epsilon
best_epsilon = epsilon_values[np.argmin(mse_scores)]
print(f"Best epsilon: {best_epsilon}")
Running the example gives an output like:
epsilon=0.01, MSE: 0.010
epsilon=0.1, MSE: 0.010
epsilon=0.5, MSE: 0.010
epsilon=1.0, MSE: 0.010
Best epsilon: 0.01
The key steps in this example are:
- Generate a synthetic regression dataset with some noise
- Split the data into train and test sets
- Train
SGDRegressor
models with differentepsilon
values - Evaluate the mean squared error of each model on the test set
- Identify the best
epsilon
value based on the lowest MSE
Some tips and heuristics for setting epsilon
:
- Start with the default value of 0.1 and adjust based on model performance
- For noisy data, a larger epsilon may help to avoid overfitting
- Consider the scale of your target variable when setting epsilon
Issues to consider:
- A too-large epsilon may cause underfitting, while a too-small epsilon may lead to overfitting
- The optimal epsilon value often depends on the noise level in your data
- Epsilon interacts with other parameters like learning rate and regularization strength