Configure GradientBoostingClassifier "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s GradientBoostingClassifier controls the minimum decrease in impurity required to split an internal node.

Gradient Boosting is an ensemble learning method that sequentially adds decision trees to correct the errors made by the previous trees. The min_impurity_decrease parameter determines the minimum reduction in impurity required to make a split.

Setting a higher value for min_impurity_decrease will result in a more conservative model that only splits nodes when there is a significant decrease in impurity. This can help to prevent overfitting.

The default value for min_impurity_decrease is 0.0, meaning that even a very small decrease in impurity will cause a split.

In practice, values between 0.0 and 0.5 are commonly used depending on the noise level and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.1, 0.3, 0.5]
accuracies = []

for min_impurity_decrease in min_impurity_decrease_values:
    gbc = GradientBoostingClassifier(min_impurity_decrease=min_impurity_decrease, random_state=42)
    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_impurity_decrease={min_impurity_decrease}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, Accuracy: 0.900
min_impurity_decrease=0.1, Accuracy: 0.905
min_impurity_decrease=0.3, Accuracy: 0.895
min_impurity_decrease=0.5, Accuracy: 0.905

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and noise features
Split the data into train and test sets
Train GradientBoostingClassifier models with different min_impurity_decrease values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_impurity_decrease:

Start with the default value of 0.0 and increase it if the model seems to be overfitting
Higher values will lead to a more conservative model that is less likely to overfit
The optimal value depends on the noise level and complexity of the dataset

Issues to consider:

Setting min_impurity_decrease too high can cause the model to underfit
The effect of min_impurity_decrease interacts with other parameters like max_depth and learning_rate
It may be necessary to tune min_impurity_decrease in conjunction with other parameters to find the optimal configuration

See Also