The min_samples_leaf
parameter in scikit-learn’s HistGradientBoostingRegressor
controls the minimum number of samples required to be at a leaf node.
HistGradientBoostingRegressor
is a gradient boosting algorithm that uses histogram-based techniques for faster training on large datasets. It builds an ensemble of decision trees sequentially, with each tree correcting the errors of the previous ones.
The min_samples_leaf
parameter affects the complexity of the individual trees in the ensemble. A smaller value allows for more complex trees, potentially leading to overfitting, while a larger value results in simpler trees, which may underfit.
The default value for min_samples_leaf
is 20. In practice, values between 1 and 50 are commonly used, depending on the dataset size and complexity.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 20, 50]
mse_scores = []
for min_samples in min_samples_leaf_values:
hgbr = HistGradientBoostingRegressor(min_samples_leaf=min_samples, random_state=42)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")
# Find best min_samples_leaf
best_min_samples = min_samples_leaf_values[np.argmin(mse_scores)]
print(f"Best min_samples_leaf: {best_min_samples}")
Running the example gives an output like:
min_samples_leaf=1, MSE: 1294.047
min_samples_leaf=5, MSE: 1176.680
min_samples_leaf=20, MSE: 1023.074
min_samples_leaf=50, MSE: 999.664
Best min_samples_leaf: 50
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentmin_samples_leaf
values - Evaluate the mean squared error of each model on the test set
- Identify the best
min_samples_leaf
value based on the lowest MSE
Some tips and heuristics for setting min_samples_leaf
:
- Start with the default value of 20 and adjust based on model performance
- Smaller values allow for more complex trees, which may be beneficial for large datasets
- Larger values create simpler trees, which can help prevent overfitting on smaller datasets
Issues to consider:
- The optimal value depends on the size and complexity of your dataset
- Very small values may lead to overfitting, especially on noisy data
- Large values may cause underfitting if the trees are too simple to capture the underlying patterns
- There’s often a trade-off between model complexity and generalization ability