Configure GradientBoostingClassifier "criterion" Parameter

The criterion parameter in scikit-learn’s GradientBoostingClassifier determines the function used to measure the quality of a split at each node of the decision trees.

Gradient Boosting is an ensemble method that sequentially adds decision trees to correct the errors made by the previous trees. The criterion parameter affects how the algorithm decides to split nodes when building these trees.

The default value for criterion is 'friedman_mse', which refers to the mean squared error with improvement score by Friedman. Another supported value is 'squared_error' for the regular mean squared error.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criterion_values = ['friedman_mse', 'squared_error']
accuracies = []

for criterion in criterion_values:
    gb = GradientBoostingClassifier(criterion=criterion, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"criterion='{criterion}', Accuracy: {accuracy:.3f}")

Running the example gives an output like:

criterion='friedman_mse', Accuracy: 0.785
criterion='squared_error', Accuracy: 0.785

The key steps in this example are:

Generate a synthetic multiclass classification dataset with informative and noise features
Split the data into train and test sets
Train GradientBoostingClassifier models with different criterion values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting criterion:

The default 'friedman_mse' often works well and is a good starting point
'squared_error' may be better for small datasets or noisy data

Issues to consider:

The choice of criterion can affect both training time and model performance
Differences between criteria may be more noticeable on certain types of datasets
It’s worth experimenting with different values, but the default is often a solid choice

See Also