The n_iter_no_change
parameter in scikit-learn’s HistGradientBoostingClassifier
controls early stopping during training.
Early stopping is a technique to prevent overfitting by halting the training process when the model’s performance on a validation set stops improving. The n_iter_no_change
parameter specifies the number of iterations with no improvement after which training will stop.
This parameter helps balance between model performance and training time. A smaller value can lead to faster training but potentially underfit models, while a larger value allows more iterations for potential improvement but may increase training time.
The default value for n_iter_no_change
is 10. In practice, values between 5 and 50 are commonly used, depending on the dataset size and complexity.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_iter_no_change values
n_iter_values = [5, 10, 20, 50]
results = []
for n in n_iter_values:
start_time = time.time()
hgbc = HistGradientBoostingClassifier(n_iter_no_change=n, random_state=42)
hgbc.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = hgbc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
results.append((n, accuracy, train_time, hgbc.n_iter_))
print(f"n_iter_no_change={n}, Accuracy: {accuracy:.3f}, "
f"Training Time: {train_time:.2f}s, Iterations: {hgbc.n_iter_}")
Running the example gives an output like:
n_iter_no_change=5, Accuracy: 0.912, Training Time: 0.87s, Iterations: 100
n_iter_no_change=10, Accuracy: 0.912, Training Time: 0.83s, Iterations: 100
n_iter_no_change=20, Accuracy: 0.912, Training Time: 0.84s, Iterations: 100
n_iter_no_change=50, Accuracy: 0.912, Training Time: 0.83s, Iterations: 100
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
HistGradientBoostingClassifier
models with differentn_iter_no_change
values - Measure training time, final number of iterations, and test set accuracy for each model
Some tips and heuristics for setting n_iter_no_change
:
- Start with the default value of 10 and adjust based on model performance and training time
- For larger datasets, consider increasing
n_iter_no_change
to allow more iterations for potential improvement - If training time is a concern, use a smaller value but monitor for potential underfitting
Issues to consider:
- The optimal value depends on the dataset size, complexity, and the trade-off between training time and model performance
- Too small a value may lead to premature stopping and underfitting
- Too large a value may unnecessarily increase training time without significant performance gains
- Consider using cross-validation to find the best value for your specific dataset