Configure GradientBoostingClassifier "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s GradientBoostingClassifier controls the minimum number of samples required to be at a leaf node.

Gradient Boosting builds an ensemble of shallow decision trees in a sequential manner, where each tree tries to correct the errors made by the previous trees. The min_samples_leaf parameter determines the minimum number of samples that must be present in a leaf node for a split to be considered.

Smaller values of min_samples_leaf allow the model to capture more fine-grained patterns, potentially leading to overfitting. Larger values create simpler, more generalizable models by requiring a higher number of samples in each leaf, potentially leading to underfitting.

The default value for min_samples_leaf is 1. In practice, values typically range from 1 to 100, depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 20, 50]
accuracies = []

for min_samples in min_samples_leaf_values:
    gb = GradientBoostingClassifier(min_samples_leaf=min_samples, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_leaf={min_samples}, Accuracy: {accuracy:.3f}")

The output of running this example would look like:

min_samples_leaf=1, Accuracy: 0.785
min_samples_leaf=5, Accuracy: 0.780
min_samples_leaf=20, Accuracy: 0.805
min_samples_leaf=50, Accuracy: 0.820

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different min_samples_leaf values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Smaller values (e.g., 1) allow the model to create complex decision boundaries, but may overfit
Larger values (e.g., 50) create simpler, more generalizable models, but may underfit
The optimal value depends on the dataset; experiment with different values and use cross-validation to find the best setting

Issues to consider:

min_samples_leaf interacts with other parameters like learning_rate and n_estimators, so tune them together
Higher values of min_samples_leaf are more computationally efficient, as they result in smaller trees
There is no universally optimal value; the best setting depends on the specific dataset and problem

See Also