The l2_regularization
parameter in scikit-learn’s HistGradientBoostingRegressor
controls the strength of L2 regularization applied to the model’s leaf values.
HistGradientBoostingRegressor
is a histogram-based gradient boosting algorithm that builds an ensemble of decision trees sequentially. It’s designed for efficiency and can handle large datasets.
The l2_regularization
parameter adds a penalty term to the loss function, discouraging large leaf values. This helps prevent overfitting by reducing the model’s complexity.
The default value for l2_regularization
is 0, which means no regularization. Typical values range from 0 to 1, but can be higher depending on the dataset and problem.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different l2_regularization values
l2_values = [0, 0.1, 1, 10]
mse_scores = []
for l2 in l2_values:
hgbr = HistGradientBoostingRegressor(l2_regularization=l2, random_state=42)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"l2_regularization={l2}, MSE: {mse:.3f}")
# Find best l2_regularization value
best_l2 = l2_values[np.argmin(mse_scores)]
print(f"Best l2_regularization value: {best_l2}")
Running the example gives an output like:
l2_regularization=0, MSE: 3073.589
l2_regularization=0.1, MSE: 3053.149
l2_regularization=1, MSE: 3327.394
l2_regularization=10, MSE: 3430.270
Best l2_regularization value: 0.1
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentl2_regularization
values - Evaluate the mean squared error of each model on the test set
- Identify the best
l2_regularization
value based on the lowest MSE
Tips for setting l2_regularization
:
- Start with the default value of 0 and gradually increase it if overfitting occurs
- Use cross-validation to find the optimal value for your specific dataset
- Consider the trade-off between model complexity and performance
Issues to consider:
- Higher values of
l2_regularization
may lead to underfitting if set too high - The optimal value depends on the characteristics of your dataset and the complexity of the underlying relationship
- Monitor both training and validation performance to ensure a good balance