Gradient Boosting is an ensemble method that combines weak learners, typically decision trees, to create a strong learner in an iterative fashion. The subsample parameter controls the fraction of samples to be used for fitting the individual base learners.
By setting subsample to a value less than 1.0, the algorithm performs stochastic gradient boosting, which can help reduce overfitting and improve generalization. This introduces randomness into the training process.
The default value for subsample is 1.0, meaning that all samples are used to build each tree. Common values range from 0.5 to 0.9.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different subsample values
subsample_values = [0.5, 0.7, 0.9, 1.0]
f1_scores = []
for subsample in subsample_values:
gbc = GradientBoostingClassifier(subsample=subsample, random_state=42)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
f1 = f1_score(y_test, y_pred, average='macro')
f1_scores.append(f1)
print(f"subsample={subsample}, F1-score: {f1:.3f}")
Running the example gives an output like:
subsample=0.5, F1-score: 0.797
subsample=0.7, F1-score: 0.801
subsample=0.9, F1-score: 0.781
subsample=1.0, F1-score: 0.780
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train
GradientBoostingClassifiermodels with differentsubsamplevalues - Evaluate the macro-averaged F1-score of each model on the test set
Some tips and heuristics for setting subsample:
- Smaller values introduce more randomness and can help prevent overfitting
- Larger datasets can often benefit from smaller
subsamplevalues - Decreasing
subsamplemay require increasingn_estimatorsfor best performance
Issues to consider:
- Very small
subsamplevalues can lead to underfitting - Computational cost increases with smaller
subsampleas more trees are needed - The randomness introduced by
subsamplecan affect reproducibility