The estimators
parameter in scikit-learn’s StackingClassifier
defines the set of base estimators used in the ensemble.
Stacking is an ensemble learning technique that combines multiple classification models via a meta-classifier. The estimators
parameter specifies the list of base classifiers to be stacked.
Selecting appropriate base estimators is crucial for the performance of the stacked model. A diverse set of base models often leads to better generalization.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define different sets of estimators
estimators_sets = [
[('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('svm', SVC(kernel='rbf', probability=True, random_state=42))],
[('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('lr', LogisticRegression(random_state=42)),
('nb', GaussianNB())]
]
# Train and evaluate StackingClassifier with different estimator sets
for i, estimators in enumerate(estimators_sets, 1):
stacking_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(random_state=42),
cv=5
)
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Estimator Set {i}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
Estimator Set 1, Accuracy: 0.860
Estimator Set 2, Accuracy: 0.845
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Define different sets of base estimators
- Create
StackingClassifier
instances with different estimator sets - Train each stacking classifier and evaluate its accuracy on the test set
Some tips for configuring the estimators
parameter:
- Choose a diverse set of base models to capture different aspects of the data
- Consider the computational cost of each estimator, especially for large datasets
- Balance between weak learners (e.g., Decision Stumps) and strong learners (e.g., Random Forests)
- Experiment with different combinations to find the optimal set for your specific problem
Issues to consider:
- Using too many complex estimators may lead to overfitting
- There’s a trade-off between model complexity and interpretability
- Cross-validation is crucial when selecting and evaluating base estimators
- The performance of the stacked model depends heavily on the choice of base estimators