The cv
parameter in scikit-learn’s StackingRegressor
controls the cross-validation strategy used to train the meta-model.
StackingRegressor
is an ensemble method that combines multiple base regressors by training a meta-regressor on their predictions. The cv
parameter determines how these predictions are generated during training.
Setting cv
affects the trade-off between bias and variance in the stacked model. It also impacts training time and the risk of overfitting.
The default value for cv
is 5-fold cross-validation. Common alternatives include leave-one-out cross-validation, 10-fold cross-validation (cv=10
), or using a specific KFold
or GroupKFold
instance.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base models
base_models = [
('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
('lr', LinearRegression())
]
# Train with different cv values
cv_values = [3, 5, 10]
mse_scores = []
for cv in cv_values:
stack = StackingRegressor(
estimators=base_models,
final_estimator=LinearRegression(),
cv=cv
)
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"cv={cv}, MSE: {mse:.3f}")
# Find best cv value
best_cv = cv_values[np.argmin(mse_scores)]
print(f"Best cv value: {best_cv}")
Running the example gives an output like:
cv=3, MSE: 0.238
cv=5, MSE: 0.238
cv=10, MSE: 0.238
Best cv value: 10
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define base models for the
StackingRegressor
- Train
StackingRegressor
models with differentcv
values - Evaluate the mean squared error of each model on the test set
- Identify the best
cv
value based on lowest MSE
Some tips and heuristics for setting cv
:
- Start with the default 5-fold cross-validation and experiment with other values
- Use
cv='loo'
for small datasets to maximize training data usage - Consider using
GroupKFold
for datasets with grouped observations - Balance between higher
cv
values for more stable estimates and computational cost
Issues to consider:
- Higher
cv
values increase training time but may lead to more robust models - Very high
cv
values (or'loo'
) can be computationally expensive for large datasets - The optimal
cv
value depends on the dataset size, feature dimensionality, and base models used - Ensure that each fold in cross-validation contains a representative sample of the data