Configure RandomForestClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s RandomForestClassifier controls the minimum number of samples required to split an internal node in the decision trees.

This parameter affects the growth of the trees in the ensemble. A higher value of min_samples_split requires more samples at a node to consider splitting further, leading to shallower trees. Conversely, a lower value allows the trees to grow deeper by considering splits for nodes with fewer samples.

The default value for min_samples_split is 2. In practice, values between 2 and 20 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []

for min_split in min_samples_split_values:
    rf = RandomForestClassifier(min_samples_split=min_split, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_split={min_split}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.860
min_samples_split=5, Accuracy: 0.840
min_samples_split=10, Accuracy: 0.845
min_samples_split=20, Accuracy: 0.845

The key steps in this example are:

Generate a synthetic multiclass classification dataset with informative and noise features
Split the data into train and test sets
Train RandomForestClassifier models with different min_samples_split values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_split:

Start with the default value of 2 and increase it if the model appears to be overfitting
Increasing min_samples_split can reduce overfitting by requiring more samples to make a split, but setting it too high may lead to underfitting
Adjust the value based on the model’s performance on a validation set or using cross-validation

Issues to consider:

The optimal value of min_samples_split depends on the size and complexity of the dataset
Setting min_samples_split too low can result in overfitting by growing deep, complex trees
Setting min_samples_split too high can cause underfitting by creating shallow, simplistic trees

See Also