SKLearner Home | About | Contact | Examples

Configure DecisionTreeClassifier "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s DecisionTreeClassifier controls when a node is split based on the decrease in impurity.

Decision trees recursively split nodes based on feature values to create homogeneous subsets. The splits are chosen to maximize the decrease in impurity, which is measured by Gini impurity or entropy.

A higher value of min_impurity_decrease requires a larger decrease in impurity for a split to occur, leading to smaller trees. This can help prevent overfitting by avoiding splits that only marginally reduce impurity.

The default value for min_impurity_decrease is 0.0, allowing nodes to be split as long as there is any decrease in impurity.

In practice, values between 0.0 and 0.5 are commonly used depending on the complexity of the dataset and the desired tree size.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.1, 0.2, 0.5]
train_accuracies = []
test_accuracies = []

for min_imp_dec in min_impurity_decrease_values:
    dt = DecisionTreeClassifier(min_impurity_decrease=min_imp_dec, random_state=42)
    dt.fit(X_train, y_train)

    train_pred = dt.predict(X_train)
    test_pred = dt.predict(X_test)

    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)

    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)

    print(f"min_impurity_decrease={min_imp_dec}")
    print(f"Train Accuracy: {train_acc:.3f}, Test Accuracy: {test_acc:.3f}")
    print(f"Tree Depth: {dt.get_depth()}, Number of Leaves: {dt.get_n_leaves()}")
    print()

Running the example gives an output like:

min_impurity_decrease=0.0
Train Accuracy: 1.000, Test Accuracy: 0.885
Tree Depth: 11, Number of Leaves: 59

min_impurity_decrease=0.1
Train Accuracy: 0.828, Test Accuracy: 0.805
Tree Depth: 1, Number of Leaves: 2

min_impurity_decrease=0.2
Train Accuracy: 0.828, Test Accuracy: 0.805
Tree Depth: 1, Number of Leaves: 2

min_impurity_decrease=0.5
Train Accuracy: 0.500, Test Accuracy: 0.500
Tree Depth: 0, Number of Leaves: 1

The key steps in this example are:

  1. Generate a synthetic classification dataset with informative and redundant features
  2. Split the data into train and test sets
  3. Train decision trees with different min_impurity_decrease thresholds
  4. Evaluate training and test accuracy to show impact on overfitting and tree size

Some tips and heuristics for setting min_impurity_decrease:

Issues to consider when tuning min_impurity_decrease:



See Also