Configure DecisionTreeClassifier "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s DecisionTreeClassifier controls the minimum number of samples required to be at a leaf node.

DecisionTreeClassifier is a non-parametric supervised learning algorithm that learns decision rules from feature values to predict a target class. The min_samples_leaf parameter regularizes the tree by setting a minimum threshold of samples for leaf nodes.

Smaller values allow the tree to capture more fine-grained patterns, potentially leading to overfitting. Larger values constrain the tree’s depth and can lead to underfitting.

The default value for min_samples_leaf is 1, allowing leaves with just one sample. In practice, values between 1 and 20 are common, depending on the dataset’s size and complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20]
accuracies = []

for min_samples in min_samples_leaf_values:
    dt = DecisionTreeClassifier(min_samples_leaf=min_samples, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_leaf={min_samples}, Accuracy: {accuracy:.3f}")

The example code output would look like:

min_samples_leaf=1, Accuracy: 0.740
min_samples_leaf=5, Accuracy: 0.740
min_samples_leaf=10, Accuracy: 0.755
min_samples_leaf=20, Accuracy: 0.765

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train decision tree models with different min_samples_leaf values
Evaluate the accuracy of each model on the test set

Tips and heuristics for tuning min_samples_leaf:

Start with the default value of 1 and increase it to create smaller trees and reduce overfitting
Typical values range from 1 to 20, but the optimal value depends on the dataset
Increasing min_samples_leaf trades off model complexity for generalization

Issues to consider when setting min_samples_leaf:

Setting the value too high can lead to underfitting and poor performance
Setting it too low allows the tree to overfit the training data
The best value strikes a balance and depends on the dataset’s size and complexity

See Also