SKLearner Home | About | Contact | Examples

Configure HistGradientBoostingClassifier "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s HistGradientBoostingClassifier controls the minimum number of samples required to be at a leaf node.

HistGradientBoostingClassifier is a histogram-based gradient boosting algorithm that offers faster training on large datasets compared to traditional gradient boosting methods. It builds an ensemble of decision trees sequentially, where each tree corrects the errors of the previous ones.

The min_samples_leaf parameter affects the complexity of the individual trees in the ensemble. A smaller value allows for more complex trees, potentially leading to overfitting, while a larger value results in simpler trees, which may underfit.

The default value for min_samples_leaf is 20. In practice, values between 1 and 50 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 20, 50, 100]
f1_scores = []

for msl in min_samples_leaf_values:
    hgbc = HistGradientBoostingClassifier(min_samples_leaf=msl, random_state=42)
    hgbc.fit(X_train, y_train)
    y_pred = hgbc.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='weighted')
    f1_scores.append(f1)
    print(f"min_samples_leaf={msl}, F1-score: {f1:.3f}")

Running the example gives an output like:

min_samples_leaf=1, F1-score: 0.915
min_samples_leaf=5, F1-score: 0.914
min_samples_leaf=20, F1-score: 0.912
min_samples_leaf=50, F1-score: 0.914
min_samples_leaf=100, F1-score: 0.917

The key steps in this example are:

  1. Generate a synthetic multi-class classification dataset
  2. Split the data into train and test sets
  3. Train HistGradientBoostingClassifier models with different min_samples_leaf values
  4. Evaluate the F1-score of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Issues to consider:



See Also