The validation_fraction
parameter in scikit-learn’s GradientBoostingRegressor
controls the fraction of training data to set aside for validation during training.
Gradient Boosting is a machine learning technique for regression problems, which builds an ensemble of trees sequentially to minimize the loss function. It is known for its accuracy and ability to handle various types of data.
The validation_fraction
parameter specifies the fraction of training data to set aside as validation data for early stopping. It helps monitor the validation error during training to prevent overfitting.
The default value for validation_fraction
is 0.1 (10% of the training data).
In practice, values between 0.1 and 0.3 are commonly used depending on the dataset size and complexity.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different validation_fraction values
validation_fraction_values = [0.1, 0.2, 0.3]
mse_scores = []
for vf in validation_fraction_values:
gbr = GradientBoostingRegressor(validation_fraction=vf, n_estimators=100, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"validation_fraction={vf}, MSE: {mse:.3f}")
Running the example gives an output like:
validation_fraction=0.1, MSE: 3052.375
validation_fraction=0.2, MSE: 3052.375
validation_fraction=0.3, MSE: 3052.375
The key steps in this example are:
- Generate a synthetic regression dataset with noise
- Split the data into train and test sets
- Train
GradientBoostingRegressor
models with differentvalidation_fraction
values - Evaluate the mean squared error (MSE) of each model on the test set
Some tips and heuristics for setting validation_fraction
:
- Use the default value of 0.1 initially and adjust based on validation error
- Larger values of
validation_fraction
may be beneficial for larger datasets to ensure robust early stopping - Monitor the validation error to decide when to stop training to prevent overfitting
Issues to consider:
- The optimal
validation_fraction
value depends on the dataset size and variability - Too small a fraction might not provide a reliable validation set, while too large a fraction reduces the amount of training data
- Consider computational cost as larger validation fractions require more data to be set aside and monitored