The min_samples_leaf
parameter in scikit-learn’s GradientBoostingRegressor
controls the minimum number of samples required to be at a leaf node.
Gradient Boosting is a powerful ensemble method that builds models sequentially to correct errors from previous models. It combines the strengths of multiple models to improve overall performance.
The min_samples_leaf
parameter determines the minimum number of samples required to form a leaf in the decision trees of the ensemble. This parameter affects the granularity of the model.
Generally, setting min_samples_leaf
to higher values can prevent overfitting by ensuring that leaf nodes have a sufficient number of samples, which helps in generalizing better. However, too high a value can lead to underfitting by making the model too simplistic.
The default value for min_samples_leaf
is 1.
In practice, values between 1 and 10 are commonly used, balancing model complexity and generalization.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10]
errors = []
for min_samples in min_samples_leaf_values:
gbr = GradientBoostingRegressor(min_samples_leaf=min_samples, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
error = mean_squared_error(y_test, y_pred)
errors.append(error)
print(f"min_samples_leaf={min_samples}, MSE: {error:.3f}")
Running the example gives an output like:
min_samples_leaf=1, MSE: 1234.753
min_samples_leaf=5, MSE: 1185.637
min_samples_leaf=10, MSE: 1154.572
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into training and test sets.
- Train
GradientBoostingRegressor
models with differentmin_samples_leaf
values. - Evaluate the Mean Squared Error (MSE) of each model on the test set.
Some tips and heuristics for setting min_samples_leaf
:
- Start with the default value of 1 and adjust based on model performance and complexity.
- Lower values may lead to overfitting, while higher values may cause underfitting.
- Balance between bias and variance to achieve optimal performance.
Issues to consider:
- The optimal value depends on the dataset size and distribution.
- Too small
min_samples_leaf
can lead to overly complex models. - Too large
min_samples_leaf
can lead to overly simplified models.