Configure GradientBoostingRegressor "criterion" Parameter

The criterion parameter in scikit-learn’s GradientBoostingRegressor controls the function to measure the quality of a split.

Gradient Boosting is an ensemble learning method that builds a predictive model by combining multiple weak learners, typically decision trees, in a stage-wise fashion. The criterion parameter specifies the metric used to evaluate the quality of a split in the decision trees.

The default value for criterion is “friedman_mse”.

In practice, common values for criterion include “friedman_mse” and “squared_error”. “friedman_mse” is usually preferred due to its efficiency and performance, while “squared_error” can be useful in different scenarios depending on the data characteristics.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criterion_values = ['friedman_mse', 'squared_error']
errors = []

for criterion in criterion_values:
    gbr = GradientBoostingRegressor(criterion=criterion, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    error = mean_squared_error(y_test, y_pred)
    errors.append(error)
    print(f"criterion={criterion}, MSE: {error:.3f}")

Running the example gives an output like:

criterion=friedman_mse, MSE: 1234.753
criterion=squared_error, MSE: 1234.753

The key steps in this example are:

Generate a synthetic regression dataset.
Split the data into train and test sets.
Train GradientBoostingRegressor models with different criterion values.
Evaluate and compare the mean squared error for each model on the test set.

Some tips and heuristics for setting criterion:

Start with the default value of “friedman_mse” and experiment with “squared_error” to understand their impact.
“squared_error” might be more sensitive and can provide better performance with clean data.
Consider the nature of your dataset when choosing the criterion, as different metrics can affect model performance and sensitivity.

Issues to consider:

Different criteria can impact training time and model sensitivity to data characteristics.
The choice of criterion can influence the model’s robustness to outliers and noise.
Evaluate the impact on both performance and computational efficiency for your specific use case.

See Also