The subsample parameter in scikit-learn’s GradientBoostingRegressor controls the fraction of samples used for fitting each base learner in the ensemble.
GradientBoostingRegressor is a boosting algorithm that combines the predictions of multiple weak learners (usually decision trees) to produce a strong learner. It iteratively fits new models to the residuals of the previous models.
The subsample parameter specifies the fraction of samples to be used for fitting each base learner. Values range from 0 to 1.
The default value for subsample is 1.0, meaning all samples are used.
In practice, values between 0.5 and 1.0 are commonly used to introduce randomness and potentially reduce overfitting.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different subsample values
subsample_values = [0.5, 0.7, 1.0]
mses = []
for s in subsample_values:
gbr = GradientBoostingRegressor(subsample=s, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mses.append(mse)
print(f"subsample={s}, MSE: {mse:.3f}")
Running the example gives an output like:
subsample=0.5, MSE: 719.012
subsample=0.7, MSE: 1066.050
subsample=1.0, MSE: 1234.753
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into training and test sets.
- Train
GradientBoostingRegressormodels with differentsubsamplevalues. - Evaluate the mean squared error (MSE) of each model on the test set.
Some tips and heuristics for setting subsample:
- Lower
subsamplevalues can reduce overfitting by adding randomness. - Start with the default value and experiment with values between 0.5 and 1.0.
- Monitor the training time as lower values might increase it.
Issues to consider:
- The optimal
subsamplevalue depends on the dataset and problem complexity. - Too low values can lead to underfitting and high variance in model predictions.
- Ensure to test multiple values to find the balance between bias and variance.