Configure GradientBoostingClassifier "subsample" Parameter

Gradient Boosting is an ensemble method that combines weak learners, typically decision trees, to create a strong learner in an iterative fashion. The subsample parameter controls the fraction of samples to be used for fitting the individual base learners.

By setting subsample to a value less than 1.0, the algorithm performs stochastic gradient boosting, which can help reduce overfitting and improve generalization. This introduces randomness into the training process.

The default value for subsample is 1.0, meaning that all samples are used to build each tree. Common values range from 0.5 to 0.9.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different subsample values
subsample_values = [0.5, 0.7, 0.9, 1.0]
f1_scores = []

for subsample in subsample_values:
    gbc = GradientBoostingClassifier(subsample=subsample, random_state=42)
    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='macro')
    f1_scores.append(f1)
    print(f"subsample={subsample}, F1-score: {f1:.3f}")

Running the example gives an output like:

subsample=0.5, F1-score: 0.797
subsample=0.7, F1-score: 0.801
subsample=0.9, F1-score: 0.781
subsample=1.0, F1-score: 0.780

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different subsample values
Evaluate the macro-averaged F1-score of each model on the test set

Some tips and heuristics for setting subsample:

Smaller values introduce more randomness and can help prevent overfitting
Larger datasets can often benefit from smaller subsample values
Decreasing subsample may require increasing n_estimators for best performance

Issues to consider:

Very small subsample values can lead to underfitting
Computational cost increases with smaller subsample as more trees are needed
The randomness introduced by subsample can affect reproducibility

See Also