Configure DecisionTreeClassifier "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s DecisionTreeClassifier controls when a node is split based on the decrease in impurity.

Decision trees recursively split nodes based on feature values to create homogeneous subsets. The splits are chosen to maximize the decrease in impurity, which is measured by Gini impurity or entropy.

A higher value of min_impurity_decrease requires a larger decrease in impurity for a split to occur, leading to smaller trees. This can help prevent overfitting by avoiding splits that only marginally reduce impurity.

The default value for min_impurity_decrease is 0.0, allowing nodes to be split as long as there is any decrease in impurity.

In practice, values between 0.0 and 0.5 are commonly used depending on the complexity of the dataset and the desired tree size.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.1, 0.2, 0.5]
train_accuracies = []
test_accuracies = []

for min_imp_dec in min_impurity_decrease_values:
    dt = DecisionTreeClassifier(min_impurity_decrease=min_imp_dec, random_state=42)
    dt.fit(X_train, y_train)

    train_pred = dt.predict(X_train)
    test_pred = dt.predict(X_test)

    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)

    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)

    print(f"min_impurity_decrease={min_imp_dec}")
    print(f"Train Accuracy: {train_acc:.3f}, Test Accuracy: {test_acc:.3f}")
    print(f"Tree Depth: {dt.get_depth()}, Number of Leaves: {dt.get_n_leaves()}")
    print()

Running the example gives an output like:

min_impurity_decrease=0.0
Train Accuracy: 1.000, Test Accuracy: 0.885
Tree Depth: 11, Number of Leaves: 59

min_impurity_decrease=0.1
Train Accuracy: 0.828, Test Accuracy: 0.805
Tree Depth: 1, Number of Leaves: 2

min_impurity_decrease=0.2
Train Accuracy: 0.828, Test Accuracy: 0.805
Tree Depth: 1, Number of Leaves: 2

min_impurity_decrease=0.5
Train Accuracy: 0.500, Test Accuracy: 0.500
Tree Depth: 0, Number of Leaves: 1

The key steps in this example are:

Generate a synthetic classification dataset with informative and redundant features
Split the data into train and test sets
Train decision trees with different min_impurity_decrease thresholds
Evaluate training and test accuracy to show impact on overfitting and tree size

Some tips and heuristics for setting min_impurity_decrease:

Start with the default of 0.0 and increase if the tree shows signs of overfitting
Typical effective values are in the range of 0.0 to 0.5
Higher values produce smaller trees that are less prone to overfitting noisy data
Setting the value too high can lead to underfitting by preventing useful splits

Issues to consider when tuning min_impurity_decrease:

The optimal value depends on the specific dataset and problem
It should be tuned in conjunction with other tree parameters like max_depth
Very small impurity decreases may not be meaningful, so a small non-zero value is often best

See Also