SKLearner Home | About | Contact | Examples

Configure GradientBoostingRegressor "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s GradientBoostingRegressor controls the minimum number of samples required to split an internal node.

Gradient Boosting is an ensemble learning method that builds models sequentially, with each new model attempting to correct the errors of the previous ones.

The min_samples_split parameter specifies the minimum number of samples required to split an internal node. This parameter helps prevent the model from learning overly specific patterns in the training data (overfitting).

The default value for min_samples_split is 2.

In practice, values between 2 and 20 are commonly used, depending on the size and variability of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []

for min_samples in min_samples_split_values:
    gbr = GradientBoostingRegressor(min_samples_split=min_samples, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_split={min_samples}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_split=2, MSE: 3052.375
min_samples_split=5, MSE: 2933.520
min_samples_split=10, MSE: 2996.977
min_samples_split=20, MSE: 3035.567

The key steps in this example are:

  1. Generate a synthetic regression dataset.
  2. Split the data into training and testing sets.
  3. Train GradientBoostingRegressor models with different min_samples_split values.
  4. Evaluate and compare the mean squared error (MSE) of each model on the test set.

Some tips and heuristics for setting min_samples_split:

Issues to consider:



See Also