Configure ExtraTreesClassifier "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s ExtraTreesClassifier controls the minimum number of samples required to be at a leaf node.

ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.

The min_samples_leaf parameter sets a lower bound on the number of samples in a leaf. This can have a significant impact on the model’s ability to capture the underlying patterns in the data without overfitting.

By default, min_samples_leaf is set to 1, which allows leaves with only one sample. In practice, values between 1 and 10 are commonly used, depending on the dataset size and complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
leaf_samples = [1, 5, 10, 20]
f1_scores = []

for samples in leaf_samples:
    etc = ExtraTreesClassifier(n_estimators=100, min_samples_leaf=samples, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    f1_scores.append(f1)
    print(f"min_samples_leaf={samples}, F1-score: {f1:.3f}")

Running the example gives an output like:

min_samples_leaf=1, F1-score: 0.925
min_samples_leaf=5, F1-score: 0.896
min_samples_leaf=10, F1-score: 0.890
min_samples_leaf=20, F1-score: 0.868

The key steps in this example are:

Generate a synthetic classification dataset with informative and noisy features
Split the data into train and test sets
Train ExtraTreesClassifier models with different min_samples_leaf values
Evaluate the F1-score of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 1 and increase it if the model is overfitting
For smaller datasets, using larger values can help prevent overfitting
For larger datasets, smaller values may be appropriate to capture fine-grained patterns

Issues to consider:

Increasing min_samples_leaf reduces model complexity but may lead to underfitting
The optimal value depends on the noise level and complexity of your data
Consider using cross-validation to find the best value for your specific dataset

See Also