The learning_rate
parameter in scikit-learn’s GradientBoostingRegressor
controls the contribution of each tree to the final model.
Gradient Boosting is an ensemble learning method that builds models sequentially, with each new model attempting to correct errors made by the previous models. The learning_rate
parameter determines the weight of each individual tree’s prediction in the final ensemble.
Generally, using a lower learning_rate
value leads to a more robust model but requires more trees to achieve the same performance. Higher learning_rate
values can speed up training but may lead to overfitting if too high.
The default value for learning_rate
is 0.1.
In practice, values between 0.01 and 0.3 are commonly used depending on the size and complexity of the dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different learning_rate values
learning_rate_values = [0.01, 0.1, 0.2, 0.3]
errors = []
for lr in learning_rate_values:
gbr = GradientBoostingRegressor(learning_rate=lr, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
error = mean_squared_error(y_test, y_pred)
errors.append(error)
print(f"learning_rate={lr}, Mean Squared Error: {error:.3f}")
Running the example gives an output like:
learning_rate=0.01, Mean Squared Error: 8323.362
learning_rate=0.1, Mean Squared Error: 1234.753
learning_rate=0.2, Mean Squared Error: 1002.691
learning_rate=0.3, Mean Squared Error: 1130.817
The key steps in this example are:
- Generate a synthetic regression dataset with relevant features
- Split the data into train and test sets
- Train
GradientBoostingRegressor
models with differentlearning_rate
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting learning_rate
:
- Start with the default value of 0.1 and adjust based on model performance
- Lower
learning_rate
values generally require more trees to reach optimal performance - Balance between
learning_rate
and the number of trees (n_estimators
) to avoid overfitting or underfitting
Issues to consider:
- Lower
learning_rate
values can lead to longer training times - Too high
learning_rate
can cause the model to overfit - Optimal
learning_rate
depends on the specific dataset and problem complexity