Configure HistGradientBoostingRegressor "l2_regularization" Parameter

The l2_regularization parameter in scikit-learn’s HistGradientBoostingRegressor controls the strength of L2 regularization applied to the model’s leaf values.

HistGradientBoostingRegressor is a histogram-based gradient boosting algorithm that builds an ensemble of decision trees sequentially. It’s designed for efficiency and can handle large datasets.

The l2_regularization parameter adds a penalty term to the loss function, discouraging large leaf values. This helps prevent overfitting by reducing the model’s complexity.

The default value for l2_regularization is 0, which means no regularization. Typical values range from 0 to 1, but can be higher depending on the dataset and problem.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different l2_regularization values
l2_values = [0, 0.1, 1, 10]
mse_scores = []

for l2 in l2_values:
    hgbr = HistGradientBoostingRegressor(l2_regularization=l2, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"l2_regularization={l2}, MSE: {mse:.3f}")

# Find best l2_regularization value
best_l2 = l2_values[np.argmin(mse_scores)]
print(f"Best l2_regularization value: {best_l2}")

Running the example gives an output like:

l2_regularization=0, MSE: 3073.589
l2_regularization=0.1, MSE: 3053.149
l2_regularization=1, MSE: 3327.394
l2_regularization=10, MSE: 3430.270
Best l2_regularization value: 0.1

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different l2_regularization values
Evaluate the mean squared error of each model on the test set
Identify the best l2_regularization value based on the lowest MSE

Tips for setting l2_regularization:

Start with the default value of 0 and gradually increase it if overfitting occurs
Use cross-validation to find the optimal value for your specific dataset
Consider the trade-off between model complexity and performance

Issues to consider:

Higher values of l2_regularization may lead to underfitting if set too high
The optimal value depends on the characteristics of your dataset and the complexity of the underlying relationship
Monitor both training and validation performance to ensure a good balance

See Also