Configure GradientBoostingClassifier "loss" Parameter

The loss parameter in scikit-learn’s GradientBoostingClassifier specifies the loss function to be optimized when fitting the model.

Gradient Boosting is an ensemble method that sequentially adds weak learners to minimize a loss function. Each new learner is fit on the negative gradient of the loss function with respect to the ensemble’s predictions.

The loss parameter determines how the difference between the predicted and actual class probabilities is measured. Different loss functions have different mathematical formulations and properties that affect the model’s behavior and performance.

The default value for loss is 'log_loss', which corresponds to logistic regression loss and is used for probabilistic classification. The 'exponential' loss is also commonly used and corresponds to the AdaBoost algorithm.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different loss values
loss_functions = ['log_loss', 'exponential']
accuracies = []

for loss in loss_functions:
    gbc = GradientBoostingClassifier(loss=loss, random_state=42)
    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"loss='{loss}', Accuracy: {accuracy:.3f}")

Running the example gives an output like:

loss='log_loss', Accuracy: 0.915
loss='exponential', Accuracy: 0.890

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different loss functions
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting loss:

'log_loss' generally works well and is a good default choice
Consider 'exponential' loss to more aggressively correct misclassified samples
For noisy datasets, sometimes 'log_loss' can overfit more than 'exponential'
Matching the loss to the evaluation metric can help optimize the right objective

See Also