The epsilon
parameter in scikit-learn’s SVR
(Support Vector Regression) class controls the width of the insensitive region around the regression line, within which errors are ignored.
SVR tries to find a function that approximates the training data, with a tolerance for errors specified by epsilon
. Points within the epsilon
-insensitive tube do not contribute to the loss function or the optimization process.
Smaller values of epsilon
lead to more support vectors being used, potentially causing overfitting. Larger values result in fewer support vectors and possibly underfitting.
The default value for epsilon
is 0.1.
In practice, values between 0.01 and 1.0 are commonly used, depending on the scale and distribution of the target variable.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different epsilon values
epsilon_values = [0.01, 0.1, 0.5, 1.0]
mse_scores = []
for eps in epsilon_values:
svr = SVR(epsilon=eps)
svr.fit(X_train, y_train)
y_pred = svr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"epsilon={eps}, MSE: {mse:.3f}")
Running the example gives an output like:
epsilon=0.01, MSE: 12756.517
epsilon=0.1, MSE: 12758.146
epsilon=0.5, MSE: 12762.300
epsilon=1.0, MSE: 12761.253
The key steps in this example are:
- Generate a synthetic regression dataset with noise
- Split the data into train and test sets
- Train
SVR
models with differentepsilon
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting epsilon
:
- Start with the default value of 0.1 and adjust based on the model’s performance
- Smaller values may lead to overfitting, while larger values may cause underfitting
- Consider the scale and distribution of the target variable when choosing
epsilon
Issues to consider:
- The choice of
epsilon
is related to the choice ofC
(regularization parameter) - Cross-validation can be used to find the best combination of
epsilon
andC
- The impact of
epsilon
may vary depending on the kernel used (e.g., linear, RBF)