The l2_regularization
parameter in scikit-learn’s HistGradientBoostingClassifier
controls the strength of L2 regularization applied to the model’s leaf values.
L2 regularization, also known as ridge regularization, adds a penalty term to the loss function that is proportional to the square of the feature weights. This helps prevent overfitting by discouraging large weights.
Increasing l2_regularization
makes the model more conservative, potentially reducing overfitting at the cost of underfitting if set too high. Decreasing it allows the model to fit the training data more closely, which may improve performance on simple datasets but risks overfitting on complex ones.
The default value for l2_regularization
is 0.0, which means no regularization is applied.
In practice, values between 0.01 and 10 are commonly used, depending on the dataset’s characteristics and the desired trade-off between bias and variance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, log_loss
import numpy as np
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different l2_regularization values
l2_values = [0.0, 0.1, 1.0, 10.0]
results = []
for l2 in l2_values:
hgbc = HistGradientBoostingClassifier(l2_regularization=l2, random_state=42)
hgbc.fit(X_train, y_train)
y_pred = hgbc.predict(X_test)
y_pred_proba = hgbc.predict_proba(X_test)
accuracy = accuracy_score(y_test, y_pred)
logloss = log_loss(y_test, y_pred_proba)
results.append((l2, accuracy, logloss))
print(f"l2_regularization={l2}, Accuracy: {accuracy:.3f}, Log Loss: {logloss:.3f}")
Running the example gives an output like:
l2_regularization=0.0, Accuracy: 0.925, Log Loss: 0.170
l2_regularization=0.1, Accuracy: 0.925, Log Loss: 0.180
l2_regularization=1.0, Accuracy: 0.920, Log Loss: 0.182
l2_regularization=10.0, Accuracy: 0.905, Log Loss: 0.229
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and noise features
- Split the data into train and test sets
- Train
HistGradientBoostingClassifier
models with differentl2_regularization
values - Evaluate each model’s accuracy and log loss on the test set
Some tips and heuristics for setting l2_regularization
:
- Start with the default value of 0.0 and gradually increase it if overfitting is observed
- Use cross-validation to find the optimal value for your specific dataset
- Consider the trade-off between model complexity and regularization strength
Issues to consider:
- The optimal regularization strength depends on the dataset’s size, complexity, and noise level
- Too little regularization may lead to overfitting, while too much can cause underfitting
- The effect of L2 regularization may be less pronounced for datasets with few features or large sample sizes