The min_impurity_decrease parameter in scikit-learn’s GradientBoostingClassifier controls the minimum decrease in impurity required to split an internal node.
Gradient Boosting is an ensemble learning method that sequentially adds decision trees to correct the errors made by the previous trees. The min_impurity_decrease parameter determines the minimum reduction in impurity required to make a split.
Setting a higher value for min_impurity_decrease will result in a more conservative model that only splits nodes when there is a significant decrease in impurity. This can help to prevent overfitting.
The default value for min_impurity_decrease is 0.0, meaning that even a very small decrease in impurity will cause a split.
In practice, values between 0.0 and 0.5 are commonly used depending on the noise level and complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.1, 0.3, 0.5]
accuracies = []
for min_impurity_decrease in min_impurity_decrease_values:
gbc = GradientBoostingClassifier(min_impurity_decrease=min_impurity_decrease, random_state=42)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"min_impurity_decrease={min_impurity_decrease}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
min_impurity_decrease=0.0, Accuracy: 0.900
min_impurity_decrease=0.1, Accuracy: 0.905
min_impurity_decrease=0.3, Accuracy: 0.895
min_impurity_decrease=0.5, Accuracy: 0.905
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and noise features
- Split the data into train and test sets
- Train
GradientBoostingClassifiermodels with differentmin_impurity_decreasevalues - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting min_impurity_decrease:
- Start with the default value of 0.0 and increase it if the model seems to be overfitting
- Higher values will lead to a more conservative model that is less likely to overfit
- The optimal value depends on the noise level and complexity of the dataset
Issues to consider:
- Setting
min_impurity_decreasetoo high can cause the model to underfit - The effect of
min_impurity_decreaseinteracts with other parameters likemax_depthandlearning_rate - It may be necessary to tune
min_impurity_decreasein conjunction with other parameters to find the optimal configuration