Configure DecisionTreeRegressor "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s DecisionTreeRegressor is a pruning parameter that controls the complexity of the decision tree. It sets the minimum decrease in impurity required to make a split at a node.

Decision Tree Regression is a non-parametric supervised learning method that learns decision rules inferred from the data features to predict a target variable. The min_impurity_decrease parameter influences the tree’s structure and complexity.

A higher value of min_impurity_decrease results in smaller, more pruned trees, as it requires a larger decrease in impurity for a split to occur. This can help prevent overfitting.

The default value for min_impurity_decrease is 0.0, meaning that any split that decreases the impurity is allowed.

In practice, values between 0.0 and 1.0 are commonly used depending on the dataset and the desired complexity of the model.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.01, 0.1, 1.0]
mse_scores = []

for min_impurity_decrease in min_impurity_decrease_values:
    dt = DecisionTreeRegressor(min_impurity_decrease=min_impurity_decrease, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_impurity_decrease={min_impurity_decrease}, MSE: {mse:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, MSE: 6350.428
min_impurity_decrease=0.01, MSE: 6818.518
min_impurity_decrease=0.1, MSE: 6433.889
min_impurity_decrease=1.0, MSE: 6455.289

The key steps in this example are:

Generate a synthetic regression dataset with a non-linear relationship and some noise
Split the data into train and test sets
Train DecisionTreeRegressor models with different min_impurity_decrease values
Evaluate the mean squared error (MSE) of each model on the test set

Some tips and heuristics for setting min_impurity_decrease:

Start with the default value of 0.0 and increase it to prune the tree
Higher values lead to smaller trees and can help prevent overfitting
Too high a value may lead to underfitting

Issues to consider:

The optimal value depends on the specific dataset and problem
There is a trade-off between model complexity and generalization performance
It may be necessary to tune this parameter using cross-validation or a separate validation set

See Also