Configure RandomForestClassifier "criterion" Parameter

The criterion parameter in scikit-learn’s RandomForestClassifier determines the impurity measure used to split nodes when building the decision trees in the forest.

There are two options for this parameter: “gini” for Gini impurity and “entropy” for information gain. Gini impurity measures the probability of misclassifying a randomly chosen element if it were labeled randomly according to the class distribution. Information gain measures the decrease in entropy after splitting a node based on an attribute.

The default value for criterion is “gini”.

In practice, there is often little difference between the two criteria in terms of model performance. Gini impurity is slightly faster to compute, while entropy may create trees that are slightly shorter and easier to interpret.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criteria = ['gini', 'entropy']
accuracies = []

for criterion in criteria:
    rf = RandomForestClassifier(criterion=criterion, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"criterion={criterion}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

criterion=gini, Accuracy: 0.920
criterion=entropy, Accuracy: 0.920

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train RandomForestClassifier models with both “gini” and “entropy” criteria
Evaluate the accuracy of each model on the test set

Some tips and heuristics for choosing the criterion:

“gini” is slightly faster to compute and is a good default choice
“entropy” may create slightly shorter and more interpretable trees
In most cases, the difference in performance is negligible

Issues to consider:

The optimal choice of criterion may depend on the specific dataset and problem
It is not well established if there are situations where one criterion reliably outperforms the other

See Also