Configure ExtraTreesClassifier "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s ExtraTreesClassifier controls the threshold for node splitting based on the decrease in impurity.

ExtraTreesClassifier is an ensemble method that builds multiple decision trees using bootstrap samples and random feature selection. It differs from Random Forest in how it selects split points, leading to increased randomness and often better generalization.

The min_impurity_decrease parameter sets a minimum threshold for the impurity decrease required to split a node. If the impurity decrease is below this threshold, the node will not be split further, effectively pruning the tree.

The default value for min_impurity_decrease is 0.0, which means no early stopping based on impurity decrease. In practice, small positive values (e.g., 1e-7 to 1e-3) are often used to control tree growth and prevent overfitting.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_values = [0.0, 1e-5, 1e-4, 1e-3, 1e-2]
accuracies = []

for value in min_impurity_values:
    etc = ExtraTreesClassifier(n_estimators=100, min_impurity_decrease=value, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_impurity_decrease={value}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, Accuracy: 0.925
min_impurity_decrease=1e-05, Accuracy: 0.915
min_impurity_decrease=0.0001, Accuracy: 0.920
min_impurity_decrease=0.001, Accuracy: 0.905
min_impurity_decrease=0.01, Accuracy: 0.800

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and noise features
Split the data into train and test sets
Train ExtraTreesClassifier models with different min_impurity_decrease values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_impurity_decrease:

Start with small values (e.g., 1e-5) and gradually increase to find the optimal balance
Higher values lead to shallower trees and can help prevent overfitting
The optimal value depends on the dataset size, number of features, and noise level

Issues to consider:

Too high values may result in underfitting, while too low values can lead to overfitting
The effect of min_impurity_decrease interacts with other parameters like max_depth
Computational efficiency improves with higher values as trees are pruned earlier

See Also