Configure DecisionTreeClassifier "criterion" Parameter

The DecisionTreeClassifier is a non-parametric supervised learning algorithm used for classification tasks. It learns decision rules from features to predict the target class.

The criterion parameter determines the function used to measure the quality of a split at each node of the tree. It influences how the tree is built and can impact the model’s performance.

The default value for criterion is "gini", which refers to the Gini impurity. An alternative option is "entropy", which uses information gain as the splitting criterion.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criteria = ['gini', 'entropy']
accuracies = []

for criterion in criteria:
    dt = DecisionTreeClassifier(criterion=criterion, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"criterion={criterion}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

criterion=gini, Accuracy: 0.785
criterion=entropy, Accuracy: 0.775

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train DecisionTreeClassifier models with "gini" and "entropy" criteria
Evaluate the accuracy of each model on the test set

Tips and heuristics for choosing between "gini" and "entropy":

"gini" is generally faster to compute and often has similar performance to "entropy"
"entropy" may perform better on datasets with complex interactions between features

Issues to consider:

The choice of criterion may have a smaller impact on performance compared to other parameters like max_depth or min_samples_split
The optimal criterion can depend on the specific characteristics of the dataset and problem at hand

See Also