SKLearner Home | About | Contact | Examples

Configure StackingRegressor "cv" Parameter

The cv parameter in scikit-learn’s StackingRegressor controls the cross-validation strategy used to train the meta-model.

StackingRegressor is an ensemble method that combines multiple base regressors by training a meta-regressor on their predictions. The cv parameter determines how these predictions are generated during training.

Setting cv affects the trade-off between bias and variance in the stacked model. It also impacts training time and the risk of overfitting.

The default value for cv is 5-fold cross-validation. Common alternatives include leave-one-out cross-validation, 10-fold cross-validation (cv=10), or using a specific KFold or GroupKFold instance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
base_models = [
    ('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
    ('lr', LinearRegression())
]

# Train with different cv values
cv_values = [3, 5, 10]
mse_scores = []

for cv in cv_values:
    stack = StackingRegressor(
        estimators=base_models,
        final_estimator=LinearRegression(),
        cv=cv
    )
    stack.fit(X_train, y_train)
    y_pred = stack.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"cv={cv}, MSE: {mse:.3f}")

# Find best cv value
best_cv = cv_values[np.argmin(mse_scores)]
print(f"Best cv value: {best_cv}")

Running the example gives an output like:

cv=3, MSE: 0.238
cv=5, MSE: 0.238
cv=10, MSE: 0.238
Best cv value: 10

The key steps in this example are:

  1. Generate a synthetic regression dataset
  2. Split the data into train and test sets
  3. Define base models for the StackingRegressor
  4. Train StackingRegressor models with different cv values
  5. Evaluate the mean squared error of each model on the test set
  6. Identify the best cv value based on lowest MSE

Some tips and heuristics for setting cv:

Issues to consider:



See Also