Gradient Boosting is an ensemble method that combines weak learners, typically decision trees, to create a strong learner in an iterative fashion. The subsample
parameter controls the fraction of samples to be used for fitting the individual base learners.
By setting subsample
to a value less than 1.0, the algorithm performs stochastic gradient boosting, which can help reduce overfitting and improve generalization. This introduces randomness into the training process.
The default value for subsample
is 1.0, meaning that all samples are used to build each tree. Common values range from 0.5 to 0.9.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different subsample values
subsample_values = [0.5, 0.7, 0.9, 1.0]
f1_scores = []
for subsample in subsample_values:
gbc = GradientBoostingClassifier(subsample=subsample, random_state=42)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')
f1_scores.append(f1)
print(f"subsample={subsample}, F1-score: {f1:.3f}")
Running the example gives an output like:
subsample=0.5, F1-score: 0.797
subsample=0.7, F1-score: 0.801
subsample=0.9, F1-score: 0.781
subsample=1.0, F1-score: 0.780
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentsubsample
values - Evaluate the macro-averaged F1-score of each model on the test set
Some tips and heuristics for setting subsample
:
- Smaller values introduce more randomness and can help prevent overfitting
- Larger datasets can often benefit from smaller
subsample
values - Decreasing
subsample
may require increasingn_estimators
for best performance
Issues to consider:
- Very small
subsample
values can lead to underfitting - Computational cost increases with smaller
subsample
as more trees are needed - The randomness introduced by
subsample
can affect reproducibility