Configure GradientBoostingClassifier "tol" Parameter

Configuring the tol parameter in GradientBoostingClassifier controls the tolerance for the stopping criterion based on the loss improvement.

Gradient Boosting is an ensemble technique that builds models sequentially, with each model trying to correct the errors of the previous ones. The tol parameter specifies the minimum improvement in the loss function required to continue training.

The tol parameter sets a threshold to determine when the training should stop if the improvement in the loss function falls below this threshold.

The default value for tol is 1e-4. Common values used range from 1e-4 to 1e-2, depending on the desired trade-off between training time and model performance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different tol values
tol_values = [1e-4, 1e-3, 1e-2]
accuracies = []

for tol in tol_values:
    gb = GradientBoostingClassifier(tol=tol, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"tol={tol}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

tol=0.0001, Accuracy: 0.880
tol=0.001, Accuracy: 0.880
tol=0.01, Accuracy: 0.880

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and noise features
Split the data into training and test sets
Train GradientBoostingClassifier models with different tol values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting tol:

Start with the default value and adjust based on the required balance between training time and performance
Higher tol values can reduce training time but may lead to underfitting
Lower tol values increase training time and potentially improve performance but may risk overfitting

Issues to consider:

The optimal tol value depends on the complexity of the dataset and the desired balance between training efficiency and model performance
Too high tol values may stop training prematurely, leading to suboptimal models
Too low tol values can increase the risk of overfitting due to excessive training

See Also