Configure HistGradientBoostingRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s HistGradientBoostingRegressor controls the minimum number of samples required to be at a leaf node.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based techniques for faster training on large datasets. It builds an ensemble of decision trees sequentially, with each tree correcting the errors of the previous ones.

The min_samples_leaf parameter affects the complexity of the individual trees in the ensemble. A smaller value allows for more complex trees, potentially leading to overfitting, while a larger value results in simpler trees, which may underfit.

The default value for min_samples_leaf is 20. In practice, values between 1 and 50 are commonly used, depending on the dataset size and complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 20, 50]
mse_scores = []

for min_samples in min_samples_leaf_values:
    hgbr = HistGradientBoostingRegressor(min_samples_leaf=min_samples, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")

# Find best min_samples_leaf
best_min_samples = min_samples_leaf_values[np.argmin(mse_scores)]
print(f"Best min_samples_leaf: {best_min_samples}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 1294.047
min_samples_leaf=5, MSE: 1176.680
min_samples_leaf=20, MSE: 1023.074
min_samples_leaf=50, MSE: 999.664
Best min_samples_leaf: 50

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different min_samples_leaf values
Evaluate the mean squared error of each model on the test set
Identify the best min_samples_leaf value based on the lowest MSE

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 20 and adjust based on model performance
Smaller values allow for more complex trees, which may be beneficial for large datasets
Larger values create simpler trees, which can help prevent overfitting on smaller datasets

Issues to consider:

The optimal value depends on the size and complexity of your dataset
Very small values may lead to overfitting, especially on noisy data
Large values may cause underfitting if the trees are too simple to capture the underlying patterns
There’s often a trade-off between model complexity and generalization ability

See Also