The min_weight_fraction_leaf
parameter in scikit-learn’s GradientBoostingRegressor
controls the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.
GradientBoostingRegressor
is a powerful regression technique that builds an ensemble of weak prediction models, typically decision trees, to minimize the loss function. The min_weight_fraction_leaf
parameter specifies the minimum weighted fraction of the sum total of weights required to be at a leaf node.
Generally, higher values of min_weight_fraction_leaf
can help prevent overfitting by ensuring that each leaf has a minimum amount of data. However, setting this value too high can lead to underfitting, where the model fails to capture the underlying patterns in the data.
The default value for min_weight_fraction_leaf
is 0.0.
In practice, values between 0.0 and 0.1 are commonly used depending on the size and complexity of the dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_weight_fraction_leaf values
min_weight_fraction_leaf_values = [0.0, 0.01, 0.05, 0.1]
mse_scores = []
for value in min_weight_fraction_leaf_values:
gbr = GradientBoostingRegressor(min_weight_fraction_leaf=value, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_weight_fraction_leaf={value}, MSE: {mse:.3f}")
Running the example gives an output like:
min_weight_fraction_leaf=0.0, MSE: 1234.753
min_weight_fraction_leaf=0.01, MSE: 1169.101
min_weight_fraction_leaf=0.05, MSE: 1138.279
min_weight_fraction_leaf=0.1, MSE: 1506.949
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into training and test sets.
- Train
GradientBoostingRegressor
models with differentmin_weight_fraction_leaf
values. - Evaluate the mean squared error (MSE) of each model on the test set.
Some tips and heuristics for setting min_weight_fraction_leaf
:
- Start with the default value of 0.0 and increase it gradually to see the effect on model performance.
- Higher values can help prevent overfitting by ensuring each leaf has a minimum amount of data.
- The optimal value depends on the dataset’s size and distribution.
Issues to consider:
- Very small values might lead to overfitting, especially with noisy data.
- Very large values can lead to underfitting, reducing the model’s ability to capture complex patterns.
- Balancing between preventing overfitting and maintaining model complexity is key.