Configuring the tol
parameter in GradientBoostingClassifier
controls the tolerance for the stopping criterion based on the loss improvement.
Gradient Boosting is an ensemble technique that builds models sequentially, with each model trying to correct the errors of the previous ones. The tol
parameter specifies the minimum improvement in the loss function required to continue training.
The tol
parameter sets a threshold to determine when the training should stop if the improvement in the loss function falls below this threshold.
The default value for tol
is 1e-4
. Common values used range from 1e-4
to 1e-2
, depending on the desired trade-off between training time and model performance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different tol values
tol_values = [1e-4, 1e-3, 1e-2]
accuracies = []
for tol in tol_values:
gb = GradientBoostingClassifier(tol=tol, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"tol={tol}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
tol=0.0001, Accuracy: 0.880
tol=0.001, Accuracy: 0.880
tol=0.01, Accuracy: 0.880
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and noise features
- Split the data into training and test sets
- Train
GradientBoostingClassifier
models with differenttol
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting tol
:
- Start with the default value and adjust based on the required balance between training time and performance
- Higher
tol
values can reduce training time but may lead to underfitting - Lower
tol
values increase training time and potentially improve performance but may risk overfitting
Issues to consider:
- The optimal
tol
value depends on the complexity of the dataset and the desired balance between training efficiency and model performance - Too high
tol
values may stop training prematurely, leading to suboptimal models - Too low
tol
values can increase the risk of overfitting due to excessive training