The criterion
parameter in scikit-learn’s ExtraTreesClassifier
determines the function used to measure the quality of a split.
ExtraTreesClassifier
is an ensemble method that fits a number of randomized decision trees (extra-trees) on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.
The criterion
parameter affects how the algorithm decides on the best split at each node. It influences the tree structure and, consequently, the model’s performance and generalization ability.
The default value for criterion
is “gini”. The main alternative is “entropy”, which uses information gain instead of the Gini impurity.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criterion_values = ['gini', 'entropy']
accuracies = []
for criterion in criterion_values:
etc = ExtraTreesClassifier(n_estimators=100, criterion=criterion, random_state=42)
etc.fit(X_train, y_train)
y_pred = etc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"criterion={criterion}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
criterion=gini, Accuracy: 0.845
criterion=entropy, Accuracy: 0.870
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
ExtraTreesClassifier
models with differentcriterion
values - Evaluate the accuracy of each model on the test set
Tips and heuristics for setting criterion
:
- “gini” is often the default choice due to its simplicity and efficiency
- “entropy” might perform better on some datasets, especially with multi-class problems
- Experiment with both criteria and choose the one that yields better performance
- Consider the computational cost, as “entropy” is generally slower to compute
Issues to consider:
- The choice between “gini” and “entropy” often leads to similar results
- The impact of
criterion
may vary depending on other hyperparameters and dataset characteristics - In practice, the difference in performance between criteria is usually small
- Other parameters like
n_estimators
ormax_depth
often have a more significant impact on model performance