Configure GradientBoostingClassifier "validation_fraction" Parameter

The validation_fraction parameter in scikit-learn’s GradientBoostingClassifier specifies the proportion of training data to set aside as validation data for early stopping.

GradientBoostingClassifier is an ensemble learning method that builds a series of decision trees, where each tree corrects the errors of the previous ones.

The validation_fraction parameter helps in preventing overfitting by using a portion of the training data to monitor the model’s performance and stop training early if performance deteriorates.

The default value for validation_fraction is 0.1. Common values range from 0.1 to 0.3, depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different validation_fraction values
validation_fraction_values = [0.1, 0.2, 0.3]
accuracies = []

for vf in validation_fraction_values:
    gbc = GradientBoostingClassifier(validation_fraction=vf, n_iter_no_change=10, random_state=42)
    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"validation_fraction={vf}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

validation_fraction=0.1, Accuracy: 0.875
validation_fraction=0.2, Accuracy: 0.885
validation_fraction=0.3, Accuracy: 0.880

The key steps in this example are:

Generate a synthetic binary classification dataset with informative features.
Split the data into training and test sets.
Train GradientBoostingClassifier models with different validation_fraction values.
Evaluate the accuracy of each model on the test set.

Some tips and heuristics for setting validation_fraction:

Start with the default value of 0.1 and adjust based on model performance and dataset size.
A higher validation_fraction may be beneficial for larger datasets to ensure sufficient validation data.
Monitor performance on validation data to prevent overfitting.

Issues to consider:

Too small a validation_fraction may not provide enough data for reliable early stopping.
Too large a validation_fraction reduces the effective training data size, potentially impacting model performance.
The optimal validation_fraction is dataset-dependent and may require tuning.

See Also