Configure GradientBoostingRegressor "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s GradientBoostingRegressor controls the minimum threshold for impurity decrease required for a split to occur.

GradientBoostingRegressor is a powerful ensemble learning algorithm that builds an additive model in a forward stage-wise fashion; it optimizes for a loss function by fitting decision trees in a sequential manner.

The min_impurity_decrease parameter specifies the minimum decrease in impurity required for a split to be made in a tree. This can help control the complexity of the model by preventing the algorithm from making splits that result in a very small decrease in impurity.

The default value for min_impurity_decrease is 0.0.

In practice, values between 0.0 and 0.1 are commonly used depending on the dataset and the desired balance between model complexity and performance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.01, 0.05, 0.1]
mse_scores = []

for min_impurity in min_impurity_decrease_values:
    gbr = GradientBoostingRegressor(min_impurity_decrease=min_impurity, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_impurity_decrease={min_impurity}, MSE: {mse:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, MSE: 1234.753
min_impurity_decrease=0.01, MSE: 1234.753
min_impurity_decrease=0.05, MSE: 1234.753
min_impurity_decrease=0.1, MSE: 1234.753

The key steps in this example are:

Generate a synthetic regression dataset with informative features.
Split the data into training and testing sets.
Train GradientBoostingRegressor models with different min_impurity_decrease values.
Evaluate the mean squared error (MSE) of each model on the test set.

Some tips and heuristics for setting min_impurity_decrease:

Start with the default value of 0.0 and increase it to prevent overfitting.
Higher values of min_impurity_decrease can lead to simpler models with potentially higher bias but lower variance.
Adjust the value based on the dataset size and noise level.

Issues to consider:

The optimal value depends on the specific dataset and problem at hand.
Too high a value may lead to underfitting, while too low a value can result in overfitting.
Experiment with different values to find the best trade-off between bias and variance.

See Also