The criterion
parameter in scikit-learn’s GradientBoostingRegressor
controls the function to measure the quality of a split.
Gradient Boosting is an ensemble learning method that builds a predictive model by combining multiple weak learners, typically decision trees, in a stage-wise fashion. The criterion
parameter specifies the metric used to evaluate the quality of a split in the decision trees.
The default value for criterion
is “friedman_mse”.
In practice, common values for criterion
include “friedman_mse” and “squared_error”. “friedman_mse” is usually preferred due to its efficiency and performance, while “squared_error” can be useful in different scenarios depending on the data characteristics.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criterion_values = ['friedman_mse', 'squared_error']
errors = []
for criterion in criterion_values:
gbr = GradientBoostingRegressor(criterion=criterion, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
error = mean_squared_error(y_test, y_pred)
errors.append(error)
print(f"criterion={criterion}, MSE: {error:.3f}")
Running the example gives an output like:
criterion=friedman_mse, MSE: 1234.753
criterion=squared_error, MSE: 1234.753
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into train and test sets.
- Train
GradientBoostingRegressor
models with differentcriterion
values. - Evaluate and compare the mean squared error for each model on the test set.
Some tips and heuristics for setting criterion
:
- Start with the default value of “friedman_mse” and experiment with “squared_error” to understand their impact.
- “squared_error” might be more sensitive and can provide better performance with clean data.
- Consider the nature of your dataset when choosing the criterion, as different metrics can affect model performance and sensitivity.
Issues to consider:
- Different criteria can impact training time and model sensitivity to data characteristics.
- The choice of criterion can influence the model’s robustness to outliers and noise.
- Evaluate the impact on both performance and computational efficiency for your specific use case.