Configure ExtraTreesClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s ExtraTreesClassifier determines the minimum number of samples required to split an internal node.

ExtraTreesClassifier is an ensemble method that builds multiple randomized decision trees and uses averaging to improve predictive accuracy and control over-fitting. The min_samples_split parameter affects the growth of individual trees in the ensemble.

This parameter controls the trade-off between underfitting and overfitting. Lower values allow for more complex trees, potentially leading to overfitting, while higher values result in shallower trees, which may underfit.

The default value for min_samples_split is 2, which allows the algorithm to create very detailed trees.

In practice, values between 2 and 20 are commonly used, depending on the dataset size and complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []

for mss in min_samples_split_values:
    etc = ExtraTreesClassifier(n_estimators=100, min_samples_split=mss, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_split={mss}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.925
min_samples_split=5, Accuracy: 0.920
min_samples_split=10, Accuracy: 0.890
min_samples_split=20, Accuracy: 0.910

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and redundant features
Split the data into train and test sets
Train ExtraTreesClassifier models with different min_samples_split values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_split:

Start with the default value of 2 and increase it if overfitting is observed
Use cross-validation to find the optimal value for your specific dataset
Consider the size of your dataset; larger datasets may benefit from higher values

Issues to consider:

Lower values lead to more complex trees, which may overfit but can capture intricate patterns
Higher values result in simpler trees, which may underfit but are less prone to noise
The optimal value depends on the noise level and complexity of your data
Computational resources increase with lower values as trees become deeper

See Also