The min_impurity_decrease
parameter in scikit-learn’s GradientBoostingRegressor
controls the minimum threshold for impurity decrease required for a split to occur.
GradientBoostingRegressor
is a powerful ensemble learning algorithm that builds an additive model in a forward stage-wise fashion; it optimizes for a loss function by fitting decision trees in a sequential manner.
The min_impurity_decrease
parameter specifies the minimum decrease in impurity required for a split to be made in a tree. This can help control the complexity of the model by preventing the algorithm from making splits that result in a very small decrease in impurity.
The default value for min_impurity_decrease
is 0.0.
In practice, values between 0.0 and 0.1 are commonly used depending on the dataset and the desired balance between model complexity and performance.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.01, 0.05, 0.1]
mse_scores = []
for min_impurity in min_impurity_decrease_values:
gbr = GradientBoostingRegressor(min_impurity_decrease=min_impurity, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_impurity_decrease={min_impurity}, MSE: {mse:.3f}")
Running the example gives an output like:
min_impurity_decrease=0.0, MSE: 1234.753
min_impurity_decrease=0.01, MSE: 1234.753
min_impurity_decrease=0.05, MSE: 1234.753
min_impurity_decrease=0.1, MSE: 1234.753
The key steps in this example are:
- Generate a synthetic regression dataset with informative features.
- Split the data into training and testing sets.
- Train
GradientBoostingRegressor
models with differentmin_impurity_decrease
values. - Evaluate the mean squared error (MSE) of each model on the test set.
Some tips and heuristics for setting min_impurity_decrease
:
- Start with the default value of 0.0 and increase it to prevent overfitting.
- Higher values of
min_impurity_decrease
can lead to simpler models with potentially higher bias but lower variance. - Adjust the value based on the dataset size and noise level.
Issues to consider:
- The optimal value depends on the specific dataset and problem at hand.
- Too high a value may lead to underfitting, while too low a value can result in overfitting.
- Experiment with different values to find the best trade-off between bias and variance.