Configure HistGradientBoostingClassifier "learning_rate" Parameter

The learning_rate parameter in scikit-learn’s HistGradientBoostingClassifier controls the contribution of each tree to the final prediction.

HistGradientBoostingClassifier is a histogram-based gradient boosting algorithm that builds an additive model in a forward stage-wise fashion. It uses decision trees as weak learners and optimizes a differentiable loss function.

The learning_rate shrinks the contribution of each tree, slowing down the learning process. A lower learning rate requires more trees to achieve the same reduction in the loss function, which can lead to better generalization.

The default value for learning_rate is 0.1.

In practice, values between 0.01 and 0.3 are commonly used, with smaller values often yielding better performance at the cost of increased training time.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different learning_rate values
learning_rates = [0.01, 0.1, 0.3, 1.0]
accuracies = []

for lr in learning_rates:
    hgbc = HistGradientBoostingClassifier(learning_rate=lr, max_iter=100, random_state=42)
    hgbc.fit(X_train, y_train)
    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"learning_rate={lr}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

learning_rate=0.01, Accuracy: 0.853
learning_rate=0.1, Accuracy: 0.912
learning_rate=0.3, Accuracy: 0.917
learning_rate=1.0, Accuracy: 0.643

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different learning_rate values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting learning_rate:

Start with the default value of 0.1 and adjust based on model performance
Lower learning rates often lead to better generalization but require more iterations
Consider using learning rate scheduling to decrease the rate during training

Issues to consider:

The optimal learning rate depends on the specific dataset and problem
Very low learning rates may cause slow convergence or get stuck in local optima
High learning rates can cause unstable training or overfitting
There’s often a trade-off between model performance and training time

See Also