The loss
parameter in scikit-learn’s GradientBoostingClassifier
specifies the loss function to be optimized when fitting the model.
Gradient Boosting is an ensemble method that sequentially adds weak learners to minimize a loss function. Each new learner is fit on the negative gradient of the loss function with respect to the ensemble’s predictions.
The loss
parameter determines how the difference between the predicted and actual class probabilities is measured. Different loss functions have different mathematical formulations and properties that affect the model’s behavior and performance.
The default value for loss
is 'log_loss'
, which corresponds to logistic regression loss and is used for probabilistic classification. The 'exponential'
loss is also commonly used and corresponds to the AdaBoost algorithm.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different loss values
loss_functions = ['log_loss', 'exponential']
accuracies = []
for loss in loss_functions:
gbc = GradientBoostingClassifier(loss=loss, random_state=42)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"loss='{loss}', Accuracy: {accuracy:.3f}")
Running the example gives an output like:
loss='log_loss', Accuracy: 0.915
loss='exponential', Accuracy: 0.890
The key steps in this example are:
- Generate a synthetic binary classification dataset
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentloss
functions - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting loss
:
'log_loss'
generally works well and is a good default choice- Consider
'exponential'
loss to more aggressively correct misclassified samples - For noisy datasets, sometimes
'log_loss'
can overfit more than'exponential'
- Matching the loss to the evaluation metric can help optimize the right objective