Configure ExtraTreesClassifier "criterion" Parameter

The criterion parameter in scikit-learn’s ExtraTreesClassifier determines the function used to measure the quality of a split.

ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees (extra-trees) on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.

The criterion parameter affects how the algorithm decides on the best split at each node. It influences the tree structure and, consequently, the model’s performance and generalization ability.

The default value for criterion is “gini”. The main alternative is “entropy”, which uses information gain instead of the Gini impurity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criterion_values = ['gini', 'entropy']
accuracies = []

for criterion in criterion_values:
    etc = ExtraTreesClassifier(n_estimators=100, criterion=criterion, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"criterion={criterion}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

criterion=gini, Accuracy: 0.845
criterion=entropy, Accuracy: 0.870

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train ExtraTreesClassifier models with different criterion values
Evaluate the accuracy of each model on the test set

Tips and heuristics for setting criterion:

“gini” is often the default choice due to its simplicity and efficiency
“entropy” might perform better on some datasets, especially with multi-class problems
Experiment with both criteria and choose the one that yields better performance
Consider the computational cost, as “entropy” is generally slower to compute

Issues to consider:

The choice between “gini” and “entropy” often leads to similar results
The impact of criterion may vary depending on other hyperparameters and dataset characteristics
In practice, the difference in performance between criteria is usually small
Other parameters like n_estimators or max_depth often have a more significant impact on model performance

See Also