SKLearner Home | About | Contact | Examples

Configure GradientBoostingRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s GradientBoostingRegressor controls the minimum number of samples required to be at a leaf node.

Gradient Boosting is a powerful ensemble method that builds models sequentially to correct errors from previous models. It combines the strengths of multiple models to improve overall performance.

The min_samples_leaf parameter determines the minimum number of samples required to form a leaf in the decision trees of the ensemble. This parameter affects the granularity of the model.

Generally, setting min_samples_leaf to higher values can prevent overfitting by ensuring that leaf nodes have a sufficient number of samples, which helps in generalizing better. However, too high a value can lead to underfitting by making the model too simplistic.

The default value for min_samples_leaf is 1.

In practice, values between 1 and 10 are commonly used, balancing model complexity and generalization.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10]
errors = []

for min_samples in min_samples_leaf_values:
    gbr = GradientBoostingRegressor(min_samples_leaf=min_samples, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    error = mean_squared_error(y_test, y_pred)
    errors.append(error)
    print(f"min_samples_leaf={min_samples}, MSE: {error:.3f}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 1234.753
min_samples_leaf=5, MSE: 1185.637
min_samples_leaf=10, MSE: 1154.572

The key steps in this example are:

  1. Generate a synthetic regression dataset.
  2. Split the data into training and test sets.
  3. Train GradientBoostingRegressor models with different min_samples_leaf values.
  4. Evaluate the Mean Squared Error (MSE) of each model on the test set.

Some tips and heuristics for setting min_samples_leaf:

Issues to consider:



See Also