Configure HistGradientBoostingClassifier "loss" Parameter

The loss parameter in scikit-learn’s HistGradientBoostingClassifier determines the loss function used to fit the model.

HistGradientBoostingClassifier is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.

The loss parameter specifies the objective function used during training. Different loss functions can affect the model’s performance and behavior, especially in handling imbalanced datasets or outliers.

The default value for loss is ’log_loss’ for binary and multi-class classification.

In practice, ’log_loss’ is commonly used for balanced datasets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, log_loss

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different loss functions
loss_functions = ['log_loss']
results = []

for loss in loss_functions:
    clf = HistGradientBoostingClassifier(loss=loss, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    y_pred_proba = clf.predict_proba(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    logloss = log_loss(y_test, y_pred_proba)

    results.append((loss, accuracy, logloss))
    print(f"Loss: {loss}, Accuracy: {accuracy:.4f}, Log Loss: {logloss:.4f}")

Running the example gives an output like:

Loss: log_loss, Accuracy: 0.9425, Log Loss: 0.1669

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different loss functions
Evaluate each model’s accuracy and log loss on the test set

Some tips and heuristics for choosing the loss parameter:

Use ’log_loss’ for balanced datasets and when you need well-calibrated probabilities
Consider ’exponential’ for imbalanced datasets as it puts more emphasis on misclassified examples

Issues to consider:

The choice of loss function can impact model performance, especially on imbalanced datasets
Different loss functions may require different learning rates or number of iterations for optimal performance
The best loss function may depend on your specific evaluation metric and business objectives

See Also