The criterion
parameter in scikit-learn’s RandomForestClassifier
determines the impurity measure used to split nodes when building the decision trees in the forest.
There are two options for this parameter: “gini” for Gini impurity and “entropy” for information gain. Gini impurity measures the probability of misclassifying a randomly chosen element if it were labeled randomly according to the class distribution. Information gain measures the decrease in entropy after splitting a node based on an attribute.
The default value for criterion
is “gini”.
In practice, there is often little difference between the two criteria in terms of model performance. Gini impurity is slightly faster to compute, while entropy may create trees that are slightly shorter and easier to interpret.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criteria = ['gini', 'entropy']
accuracies = []
for criterion in criteria:
rf = RandomForestClassifier(criterion=criterion, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"criterion={criterion}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
criterion=gini, Accuracy: 0.920
criterion=entropy, Accuracy: 0.920
The key steps in this example are:
- Generate a synthetic binary classification dataset
- Split the data into train and test sets
- Train
RandomForestClassifier
models with both “gini” and “entropy” criteria - Evaluate the accuracy of each model on the test set
Some tips and heuristics for choosing the criterion
:
- “gini” is slightly faster to compute and is a good default choice
- “entropy” may create slightly shorter and more interpretable trees
- In most cases, the difference in performance is negligible
Issues to consider:
- The optimal choice of criterion may depend on the specific dataset and problem
- It is not well established if there are situations where one criterion reliably outperforms the other