The criterion
parameter in scikit-learn’s GradientBoostingClassifier
determines the function used to measure the quality of a split at each node of the decision trees.
Gradient Boosting is an ensemble method that sequentially adds decision trees to correct the errors made by the previous trees. The criterion
parameter affects how the algorithm decides to split nodes when building these trees.
The default value for criterion
is 'friedman_mse'
, which refers to the mean squared error with improvement score by Friedman. Another supported value is 'squared_error'
for the regular mean squared error.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criterion_values = ['friedman_mse', 'squared_error']
accuracies = []
for criterion in criterion_values:
gb = GradientBoostingClassifier(criterion=criterion, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"criterion='{criterion}', Accuracy: {accuracy:.3f}")
Running the example gives an output like:
criterion='friedman_mse', Accuracy: 0.785
criterion='squared_error', Accuracy: 0.785
The key steps in this example are:
- Generate a synthetic multiclass classification dataset with informative and noise features
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentcriterion
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting criterion
:
- The default
'friedman_mse'
often works well and is a good starting point 'squared_error'
may be better for small datasets or noisy data
Issues to consider:
- The choice of
criterion
can affect both training time and model performance - Differences between criteria may be more noticeable on certain types of datasets
- It’s worth experimenting with different values, but the default is often a solid choice