Configure StackingClassifier "final_estimator" Parameter

The final_estimator parameter in scikit-learn’s StackingClassifier determines the model used to combine predictions from base estimators.

StackingClassifier is an ensemble method that fits multiple base classifiers on the original dataset, then uses their predictions as input to a final classifier. The final_estimator is this last layer that learns how to best combine the base predictions.

By default, final_estimator is set to LogisticRegression(). However, any classifier can be used, allowing for complex ensemble architectures.

Common choices for final_estimator include LogisticRegression, RandomForestClassifier, and GradientBoostingClassifier, each offering different trade-offs between interpretability and performance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svc', SVC(kernel='rbf', probability=True, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
]

# Define final estimators to compare
final_estimators = [
    ('default', None),  # Uses default LogisticRegression
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
]

# Train and evaluate models with different final estimators
for name, final_estimator in final_estimators:
    clf = StackingClassifier(estimators=base_estimators, final_estimator=final_estimator, cv=5)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    print(f"Final Estimator: {name}")
    print(f"Accuracy: {accuracy:.3f}")
    print(f"F1 Score: {f1:.3f}\n")

Running the example gives an output like:

Final Estimator: default
Accuracy: 0.910
F1 Score: 0.911

Final Estimator: rf
Accuracy: 0.895
F1 Score: 0.896

Final Estimator: gb
Accuracy: 0.885
F1 Score: 0.886

Key steps in this example:

Generate a synthetic multi-class classification dataset
Split data into train and test sets
Define base estimators (RandomForest, SVC, KNN)
Create StackingClassifier with different final estimators
Train models and evaluate performance using accuracy and F1 score

Tips for choosing an appropriate final estimator:

Consider the problem complexity and dataset size
Experiment with different classifiers to find the best performer
Balance between model interpretability and predictive power
Use cross-validation to ensure robust performance estimates

Issues to consider:

More complex final estimators may lead to overfitting
Computational cost increases with more sophisticated final estimators
The best final estimator often depends on the choice of base estimators
Ensure the final estimator can handle the output format of the base estimators

See Also