The subsample
parameter in scikit-learn’s GradientBoostingRegressor
controls the fraction of samples used for fitting each base learner in the ensemble.
GradientBoostingRegressor
is a boosting algorithm that combines the predictions of multiple weak learners (usually decision trees) to produce a strong learner. It iteratively fits new models to the residuals of the previous models.
The subsample
parameter specifies the fraction of samples to be used for fitting each base learner. Values range from 0 to 1.
The default value for subsample
is 1.0, meaning all samples are used.
In practice, values between 0.5 and 1.0 are commonly used to introduce randomness and potentially reduce overfitting.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different subsample values
subsample_values = [0.5, 0.7, 1.0]
mses = []
for s in subsample_values:
gbr = GradientBoostingRegressor(subsample=s, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mses.append(mse)
print(f"subsample={s}, MSE: {mse:.3f}")
Running the example gives an output like:
subsample=0.5, MSE: 719.012
subsample=0.7, MSE: 1066.050
subsample=1.0, MSE: 1234.753
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into training and test sets.
- Train
GradientBoostingRegressor
models with differentsubsample
values. - Evaluate the mean squared error (MSE) of each model on the test set.
Some tips and heuristics for setting subsample
:
- Lower
subsample
values can reduce overfitting by adding randomness. - Start with the default value and experiment with values between 0.5 and 1.0.
- Monitor the training time as lower values might increase it.
Issues to consider:
- The optimal
subsample
value depends on the dataset and problem complexity. - Too low values can lead to underfitting and high variance in model predictions.
- Ensure to test multiple values to find the balance between bias and variance.