Configure RandomForestClassifier "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s RandomForestClassifier controls the minimum number of samples required to be at a leaf node in each decision tree. This parameter affects the complexity and generalization ability of the individual trees in the ensemble.

Random Forest is an ensemble learning method that trains multiple decision trees on different subsets of the data and features. The final prediction is made by aggregating the predictions of all the trees, which helps to reduce overfitting and improve generalization performance.

The min_samples_leaf parameter determines the minimum number of samples that must be present in a leaf node for a split to be considered valid. A smaller value allows for more complex trees with deeper splits, while a larger value constrains the tree depth and forces more samples into each leaf.

The default value for min_samples_leaf is 1, meaning that a split will be made until each leaf node contains only one sample. In practice, common values for this parameter include 1, 5, 10, and 20, depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20]
accuracies = []

for min_samples in min_samples_leaf_values:
    rf = RandomForestClassifier(min_samples_leaf=min_samples, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_leaf={min_samples}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_leaf=1, Accuracy: 0.855
min_samples_leaf=5, Accuracy: 0.840
min_samples_leaf=10, Accuracy: 0.810
min_samples_leaf=20, Accuracy: 0.805

The key steps in this example are:

Generate a synthetic multiclass classification dataset with informative features
Split the data into train and test sets
Train RandomForestClassifier models with different min_samples_leaf values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 1 and increase it to reduce overfitting
Smaller values allow for more complex trees, while larger values constrain tree depth
Consider the size of the dataset when setting this parameter, as larger datasets can afford more complex trees

Issues to consider:

Setting min_samples_leaf too small can lead to overfitting, while too large a value may cause underfitting
Large values may result in underspecified trees that fail to capture important patterns in the data
The optimal value depends on the specific dataset and problem at hand, so it’s important to tune this parameter for each case

See Also