Configure GradientBoostingRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s GradientBoostingRegressor controls the minimum number of samples required to be at a leaf node.

Gradient Boosting is a powerful ensemble method that builds models sequentially to correct errors from previous models. It combines the strengths of multiple models to improve overall performance.

The min_samples_leaf parameter determines the minimum number of samples required to form a leaf in the decision trees of the ensemble. This parameter affects the granularity of the model.

Generally, setting min_samples_leaf to higher values can prevent overfitting by ensuring that leaf nodes have a sufficient number of samples, which helps in generalizing better. However, too high a value can lead to underfitting by making the model too simplistic.

The default value for min_samples_leaf is 1.

In practice, values between 1 and 10 are commonly used, balancing model complexity and generalization.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10]
errors = []

for min_samples in min_samples_leaf_values:
    gbr = GradientBoostingRegressor(min_samples_leaf=min_samples, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    error = mean_squared_error(y_test, y_pred)
    errors.append(error)
    print(f"min_samples_leaf={min_samples}, MSE: {error:.3f}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 1234.753
min_samples_leaf=5, MSE: 1185.637
min_samples_leaf=10, MSE: 1154.572

The key steps in this example are:

Generate a synthetic regression dataset.
Split the data into training and test sets.
Train GradientBoostingRegressor models with different min_samples_leaf values.
Evaluate the Mean Squared Error (MSE) of each model on the test set.

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 1 and adjust based on model performance and complexity.
Lower values may lead to overfitting, while higher values may cause underfitting.
Balance between bias and variance to achieve optimal performance.

Issues to consider:

The optimal value depends on the dataset size and distribution.
Too small min_samples_leaf can lead to overly complex models.
Too large min_samples_leaf can lead to overly simplified models.

See Also