SKLearner Home | About | Contact | Examples

Configure GradientBoostingClassifier "subsample" Parameter

Gradient Boosting is an ensemble method that combines weak learners, typically decision trees, to create a strong learner in an iterative fashion. The subsample parameter controls the fraction of samples to be used for fitting the individual base learners.

By setting subsample to a value less than 1.0, the algorithm performs stochastic gradient boosting, which can help reduce overfitting and improve generalization. This introduces randomness into the training process.

The default value for subsample is 1.0, meaning that all samples are used to build each tree. Common values range from 0.5 to 0.9.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different subsample values
subsample_values = [0.5, 0.7, 0.9, 1.0]
f1_scores = []

for subsample in subsample_values:
    gbc = GradientBoostingClassifier(subsample=subsample, random_state=42)
    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='macro')
    f1_scores.append(f1)
    print(f"subsample={subsample}, F1-score: {f1:.3f}")

Running the example gives an output like:

subsample=0.5, F1-score: 0.797
subsample=0.7, F1-score: 0.801
subsample=0.9, F1-score: 0.781
subsample=1.0, F1-score: 0.780

The key steps in this example are:

  1. Generate a synthetic multiclass classification dataset
  2. Split the data into train and test sets
  3. Train GradientBoostingClassifier models with different subsample values
  4. Evaluate the macro-averaged F1-score of each model on the test set

Some tips and heuristics for setting subsample:

Issues to consider:



See Also