SKLearner Home | About | Contact | Examples

Configure RandomForestClassifier "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s RandomForestClassifier controls the minimum number of samples required to be at a leaf node in each decision tree. This parameter affects the complexity and generalization ability of the individual trees in the ensemble.

Random Forest is an ensemble learning method that trains multiple decision trees on different subsets of the data and features. The final prediction is made by aggregating the predictions of all the trees, which helps to reduce overfitting and improve generalization performance.

The min_samples_leaf parameter determines the minimum number of samples that must be present in a leaf node for a split to be considered valid. A smaller value allows for more complex trees with deeper splits, while a larger value constrains the tree depth and forces more samples into each leaf.

The default value for min_samples_leaf is 1, meaning that a split will be made until each leaf node contains only one sample. In practice, common values for this parameter include 1, 5, 10, and 20, depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20]
accuracies = []

for min_samples in min_samples_leaf_values:
    rf = RandomForestClassifier(min_samples_leaf=min_samples, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_leaf={min_samples}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_leaf=1, Accuracy: 0.855
min_samples_leaf=5, Accuracy: 0.840
min_samples_leaf=10, Accuracy: 0.810
min_samples_leaf=20, Accuracy: 0.805

The key steps in this example are:

  1. Generate a synthetic multiclass classification dataset with informative features
  2. Split the data into train and test sets
  3. Train RandomForestClassifier models with different min_samples_leaf values
  4. Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Issues to consider:



See Also