The final_estimator
parameter in scikit-learn’s StackingClassifier
determines the model used to combine predictions from base estimators.
StackingClassifier
is an ensemble method that fits multiple base classifiers on the original dataset, then uses their predictions as input to a final classifier. The final_estimator
is this last layer that learns how to best combine the base predictions.
By default, final_estimator
is set to LogisticRegression()
. However, any classifier can be used, allowing for complex ensemble architectures.
Common choices for final_estimator
include LogisticRegression
, RandomForestClassifier
, and GradientBoostingClassifier
, each offering different trade-offs between interpretability and performance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base estimators
base_estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('svc', SVC(kernel='rbf', probability=True, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=5))
]
# Define final estimators to compare
final_estimators = [
('default', None), # Uses default LogisticRegression
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(n_estimators=100, random_state=42))
]
# Train and evaluate models with different final estimators
for name, final_estimator in final_estimators:
clf = StackingClassifier(estimators=base_estimators, final_estimator=final_estimator, cv=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Final Estimator: {name}")
print(f"Accuracy: {accuracy:.3f}")
print(f"F1 Score: {f1:.3f}\n")
Running the example gives an output like:
Final Estimator: default
Accuracy: 0.910
F1 Score: 0.911
Final Estimator: rf
Accuracy: 0.895
F1 Score: 0.896
Final Estimator: gb
Accuracy: 0.885
F1 Score: 0.886
Key steps in this example:
- Generate a synthetic multi-class classification dataset
- Split data into train and test sets
- Define base estimators (RandomForest, SVC, KNN)
- Create StackingClassifier with different final estimators
- Train models and evaluate performance using accuracy and F1 score
Tips for choosing an appropriate final estimator:
- Consider the problem complexity and dataset size
- Experiment with different classifiers to find the best performer
- Balance between model interpretability and predictive power
- Use cross-validation to ensure robust performance estimates
Issues to consider:
- More complex final estimators may lead to overfitting
- Computational cost increases with more sophisticated final estimators
- The best final estimator often depends on the choice of base estimators
- Ensure the final estimator can handle the output format of the base estimators