SKLearner Home | About | Contact | Examples

Configure ExtraTreesClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s ExtraTreesClassifier determines the minimum number of samples required to split an internal node.

ExtraTreesClassifier is an ensemble method that builds multiple randomized decision trees and uses averaging to improve predictive accuracy and control over-fitting. The min_samples_split parameter affects the growth of individual trees in the ensemble.

This parameter controls the trade-off between underfitting and overfitting. Lower values allow for more complex trees, potentially leading to overfitting, while higher values result in shallower trees, which may underfit.

The default value for min_samples_split is 2, which allows the algorithm to create very detailed trees.

In practice, values between 2 and 20 are commonly used, depending on the dataset size and complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []

for mss in min_samples_split_values:
    etc = ExtraTreesClassifier(n_estimators=100, min_samples_split=mss, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_split={mss}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.925
min_samples_split=5, Accuracy: 0.920
min_samples_split=10, Accuracy: 0.890
min_samples_split=20, Accuracy: 0.910

The key steps in this example are:

  1. Generate a synthetic binary classification dataset with informative and redundant features
  2. Split the data into train and test sets
  3. Train ExtraTreesClassifier models with different min_samples_split values
  4. Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_split:

Issues to consider:



See Also