The min_impurity_decrease
parameter in scikit-learn’s GradientBoostingClassifier
controls the minimum decrease in impurity required to split an internal node.
Gradient Boosting is an ensemble learning method that sequentially adds decision trees to correct the errors made by the previous trees. The min_impurity_decrease
parameter determines the minimum reduction in impurity required to make a split.
Setting a higher value for min_impurity_decrease
will result in a more conservative model that only splits nodes when there is a significant decrease in impurity. This can help to prevent overfitting.
The default value for min_impurity_decrease
is 0.0, meaning that even a very small decrease in impurity will cause a split.
In practice, values between 0.0 and 0.5 are commonly used depending on the noise level and complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.1, 0.3, 0.5]
accuracies = []
for min_impurity_decrease in min_impurity_decrease_values:
gbc = GradientBoostingClassifier(min_impurity_decrease=min_impurity_decrease, random_state=42)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"min_impurity_decrease={min_impurity_decrease}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
min_impurity_decrease=0.0, Accuracy: 0.900
min_impurity_decrease=0.1, Accuracy: 0.905
min_impurity_decrease=0.3, Accuracy: 0.895
min_impurity_decrease=0.5, Accuracy: 0.905
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and noise features
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentmin_impurity_decrease
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting min_impurity_decrease
:
- Start with the default value of 0.0 and increase it if the model seems to be overfitting
- Higher values will lead to a more conservative model that is less likely to overfit
- The optimal value depends on the noise level and complexity of the dataset
Issues to consider:
- Setting
min_impurity_decrease
too high can cause the model to underfit - The effect of
min_impurity_decrease
interacts with other parameters likemax_depth
andlearning_rate
- It may be necessary to tune
min_impurity_decrease
in conjunction with other parameters to find the optimal configuration