The DecisionTreeClassifier
is a non-parametric supervised learning algorithm used for classification tasks. It learns decision rules from features to predict the target class.
The criterion
parameter determines the function used to measure the quality of a split at each node of the tree. It influences how the tree is built and can impact the model’s performance.
The default value for criterion
is "gini"
, which refers to the Gini impurity. An alternative option is "entropy"
, which uses information gain as the splitting criterion.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_informative=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criteria = ['gini', 'entropy']
accuracies = []
for criterion in criteria:
dt = DecisionTreeClassifier(criterion=criterion, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"criterion={criterion}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
criterion=gini, Accuracy: 0.785
criterion=entropy, Accuracy: 0.775
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train
DecisionTreeClassifier
models with"gini"
and"entropy"
criteria - Evaluate the accuracy of each model on the test set
Tips and heuristics for choosing between "gini"
and "entropy"
:
"gini"
is generally faster to compute and often has similar performance to"entropy"
"entropy"
may perform better on datasets with complex interactions between features
Issues to consider:
- The choice of
criterion
may have a smaller impact on performance compared to other parameters likemax_depth
ormin_samples_split
- The optimal
criterion
can depend on the specific characteristics of the dataset and problem at hand