SKLearner Home | About | Contact | Examples

Configure StackingClassifier "cv" Parameter

The cv parameter in scikit-learn’s StackingClassifier controls the cross-validation strategy used when fitting base estimators.

StackingClassifier is an ensemble method that combines multiple classification models via a meta-classifier. The cv parameter determines how the training data is split for fitting the base estimators.

Proper configuration of cv can help prevent overfitting and ensure that the base estimators generalize well to unseen data.

The default value for cv is 5-fold cross-validation. Common alternatives include different numbers of folds (e.g., 3, 10), stratified k-fold, or custom cross-validation iterators.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base classifiers
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svm', SVC(kernel='rbf', random_state=42)),
    ('lr', LogisticRegression(random_state=42))
]

# Define different cv configurations
cv_configs = [
    ('Default (5-fold)', 5),
    ('3-fold', 3),
    ('10-fold', 10),
    ('Stratified 5-fold', StratifiedKFold(n_splits=5, shuffle=True, random_state=42)),
    ('KFold 5-fold', KFold(n_splits=5, shuffle=True, random_state=42))
]

# Train and evaluate StackingClassifier with different cv configurations
for cv_name, cv in cv_configs:
    clf = StackingClassifier(estimators=estimators, cv=cv,
                             final_estimator=LogisticRegression())
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{cv_name} - Accuracy: {accuracy:.3f}")

Running the example gives an output like:

Default (5-fold) - Accuracy: 0.885
3-fold - Accuracy: 0.880
10-fold - Accuracy: 0.880
Stratified 5-fold - Accuracy: 0.885
KFold 5-fold - Accuracy: 0.880

The key steps in this example are:

  1. Generate a synthetic multi-class classification dataset
  2. Split the data into train and test sets
  3. Define base classifiers for the StackingClassifier
  4. Create different cross-validation configurations
  5. Train StackingClassifier models with different cv settings
  6. Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting cv:

Issues to consider:



See Also