Configure GradientBoostingRegressor "learning_rate" Parameter

The learning_rate parameter in scikit-learn’s GradientBoostingRegressor controls the contribution of each tree to the final model.

Gradient Boosting is an ensemble learning method that builds models sequentially, with each new model attempting to correct errors made by the previous models. The learning_rate parameter determines the weight of each individual tree’s prediction in the final ensemble.

Generally, using a lower learning_rate value leads to a more robust model but requires more trees to achieve the same performance. Higher learning_rate values can speed up training but may lead to overfitting if too high.

The default value for learning_rate is 0.1.

In practice, values between 0.01 and 0.3 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different learning_rate values
learning_rate_values = [0.01, 0.1, 0.2, 0.3]
errors = []

for lr in learning_rate_values:
    gbr = GradientBoostingRegressor(learning_rate=lr, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    error = mean_squared_error(y_test, y_pred)
    errors.append(error)
    print(f"learning_rate={lr}, Mean Squared Error: {error:.3f}")

Running the example gives an output like:

learning_rate=0.01, Mean Squared Error: 8323.362
learning_rate=0.1, Mean Squared Error: 1234.753
learning_rate=0.2, Mean Squared Error: 1002.691
learning_rate=0.3, Mean Squared Error: 1130.817

The key steps in this example are:

Generate a synthetic regression dataset with relevant features
Split the data into train and test sets
Train GradientBoostingRegressor models with different learning_rate values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting learning_rate:

Start with the default value of 0.1 and adjust based on model performance
Lower learning_rate values generally require more trees to reach optimal performance
Balance between learning_rate and the number of trees (n_estimators) to avoid overfitting or underfitting

Issues to consider:

Lower learning_rate values can lead to longer training times
Too high learning_rate can cause the model to overfit
Optimal learning_rate depends on the specific dataset and problem complexity

See Also