The min_samples_leaf
parameter in scikit-learn’s GradientBoostingClassifier
controls the minimum number of samples required to be at a leaf node.
Gradient Boosting builds an ensemble of shallow decision trees in a sequential manner, where each tree tries to correct the errors made by the previous trees. The min_samples_leaf
parameter determines the minimum number of samples that must be present in a leaf node for a split to be considered.
Smaller values of min_samples_leaf
allow the model to capture more fine-grained patterns, potentially leading to overfitting. Larger values create simpler, more generalizable models by requiring a higher number of samples in each leaf, potentially leading to underfitting.
The default value for min_samples_leaf
is 1. In practice, values typically range from 1 to 100, depending on the size and complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 20, 50]
accuracies = []
for min_samples in min_samples_leaf_values:
gb = GradientBoostingClassifier(min_samples_leaf=min_samples, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"min_samples_leaf={min_samples}, Accuracy: {accuracy:.3f}")
The output of running this example would look like:
min_samples_leaf=1, Accuracy: 0.785
min_samples_leaf=5, Accuracy: 0.780
min_samples_leaf=20, Accuracy: 0.805
min_samples_leaf=50, Accuracy: 0.820
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentmin_samples_leaf
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting min_samples_leaf
:
- Smaller values (e.g., 1) allow the model to create complex decision boundaries, but may overfit
- Larger values (e.g., 50) create simpler, more generalizable models, but may underfit
- The optimal value depends on the dataset; experiment with different values and use cross-validation to find the best setting
Issues to consider:
min_samples_leaf
interacts with other parameters likelearning_rate
andn_estimators
, so tune them together- Higher values of
min_samples_leaf
are more computationally efficient, as they result in smaller trees - There is no universally optimal value; the best setting depends on the specific dataset and problem