The cv
parameter in scikit-learn’s StackingClassifier
controls the cross-validation strategy used when fitting base estimators.
StackingClassifier
is an ensemble method that combines multiple classification models via a meta-classifier. The cv
parameter determines how the training data is split for fitting the base estimators.
Proper configuration of cv
can help prevent overfitting and ensure that the base estimators generalize well to unseen data.
The default value for cv
is 5-fold cross-validation. Common alternatives include different numbers of folds (e.g., 3, 10), stratified k-fold, or custom cross-validation iterators.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base classifiers
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('svm', SVC(kernel='rbf', random_state=42)),
('lr', LogisticRegression(random_state=42))
]
# Define different cv configurations
cv_configs = [
('Default (5-fold)', 5),
('3-fold', 3),
('10-fold', 10),
('Stratified 5-fold', StratifiedKFold(n_splits=5, shuffle=True, random_state=42)),
('KFold 5-fold', KFold(n_splits=5, shuffle=True, random_state=42))
]
# Train and evaluate StackingClassifier with different cv configurations
for cv_name, cv in cv_configs:
clf = StackingClassifier(estimators=estimators, cv=cv,
final_estimator=LogisticRegression())
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"{cv_name} - Accuracy: {accuracy:.3f}")
Running the example gives an output like:
Default (5-fold) - Accuracy: 0.885
3-fold - Accuracy: 0.880
10-fold - Accuracy: 0.880
Stratified 5-fold - Accuracy: 0.885
KFold 5-fold - Accuracy: 0.880
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Define base classifiers for the StackingClassifier
- Create different cross-validation configurations
- Train StackingClassifier models with different cv settings
- Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting cv
:
- Use stratified cross-validation for imbalanced datasets
- Increase the number of folds for smaller datasets to reduce bias
- Consider computational cost when choosing the number of folds
- Use domain-specific cross-validation strategies when appropriate (e.g., time series data)
Issues to consider:
- The optimal cv strategy depends on the size and characteristics of the dataset
- More folds generally provide a better estimate of model performance but increase computational cost
- Stratified cross-validation helps maintain class distribution in each fold
- Custom cross-validation strategies may be necessary for specialized problems or data structures