The min_samples_split
parameter in scikit-learn’s GradientBoostingRegressor
controls the minimum number of samples required to split an internal node.
Gradient Boosting is an ensemble learning method that builds models sequentially, with each new model attempting to correct the errors of the previous ones.
The min_samples_split
parameter specifies the minimum number of samples required to split an internal node. This parameter helps prevent the model from learning overly specific patterns in the training data (overfitting).
The default value for min_samples_split
is 2.
In practice, values between 2 and 20 are commonly used, depending on the size and variability of the dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []
for min_samples in min_samples_split_values:
gbr = GradientBoostingRegressor(min_samples_split=min_samples, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_samples_split={min_samples}, MSE: {mse:.3f}")
Running the example gives an output like:
min_samples_split=2, MSE: 3052.375
min_samples_split=5, MSE: 2933.520
min_samples_split=10, MSE: 2996.977
min_samples_split=20, MSE: 3035.567
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into training and testing sets.
- Train
GradientBoostingRegressor
models with differentmin_samples_split
values. - Evaluate and compare the mean squared error (MSE) of each model on the test set.
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and adjust based on model performance and dataset size.
- Higher values may prevent overfitting but can lead to underfitting if set too high.
- Experiment with values in small increments to find the optimal setting.
Issues to consider:
- The optimal value of
min_samples_split
is dataset-dependent. - Too low values might cause overfitting, especially with noisy data.
- Larger datasets generally benefit from higher
min_samples_split
values.