The min_impurity_decrease
parameter in scikit-learn’s DecisionTreeClassifier
controls when a node is split based on the decrease in impurity.
Decision trees recursively split nodes based on feature values to create homogeneous subsets. The splits are chosen to maximize the decrease in impurity, which is measured by Gini impurity or entropy.
A higher value of min_impurity_decrease
requires a larger decrease in impurity for a split to occur, leading to smaller trees. This can help prevent overfitting by avoiding splits that only marginally reduce impurity.
The default value for min_impurity_decrease
is 0.0, allowing nodes to be split as long as there is any decrease in impurity.
In practice, values between 0.0 and 0.5 are commonly used depending on the complexity of the dataset and the desired tree size.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.1, 0.2, 0.5]
train_accuracies = []
test_accuracies = []
for min_imp_dec in min_impurity_decrease_values:
dt = DecisionTreeClassifier(min_impurity_decrease=min_imp_dec, random_state=42)
dt.fit(X_train, y_train)
train_pred = dt.predict(X_train)
test_pred = dt.predict(X_test)
train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)
train_accuracies.append(train_acc)
test_accuracies.append(test_acc)
print(f"min_impurity_decrease={min_imp_dec}")
print(f"Train Accuracy: {train_acc:.3f}, Test Accuracy: {test_acc:.3f}")
print(f"Tree Depth: {dt.get_depth()}, Number of Leaves: {dt.get_n_leaves()}")
print()
Running the example gives an output like:
min_impurity_decrease=0.0
Train Accuracy: 1.000, Test Accuracy: 0.885
Tree Depth: 11, Number of Leaves: 59
min_impurity_decrease=0.1
Train Accuracy: 0.828, Test Accuracy: 0.805
Tree Depth: 1, Number of Leaves: 2
min_impurity_decrease=0.2
Train Accuracy: 0.828, Test Accuracy: 0.805
Tree Depth: 1, Number of Leaves: 2
min_impurity_decrease=0.5
Train Accuracy: 0.500, Test Accuracy: 0.500
Tree Depth: 0, Number of Leaves: 1
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Train decision trees with different
min_impurity_decrease
thresholds - Evaluate training and test accuracy to show impact on overfitting and tree size
Some tips and heuristics for setting min_impurity_decrease
:
- Start with the default of 0.0 and increase if the tree shows signs of overfitting
- Typical effective values are in the range of 0.0 to 0.5
- Higher values produce smaller trees that are less prone to overfitting noisy data
- Setting the value too high can lead to underfitting by preventing useful splits
Issues to consider when tuning min_impurity_decrease
:
- The optimal value depends on the specific dataset and problem
- It should be tuned in conjunction with other tree parameters like
max_depth
- Very small impurity decreases may not be meaningful, so a small non-zero value is often best