Configure BaggingClassifier "estimator" Parameter

The estimator parameter in scikit-learn’s BaggingClassifier determines the base model used in the ensemble.

BaggingClassifier is an ensemble method that creates multiple instances of a base estimator, trains each on random subsets of the data, and combines their predictions through voting. This approach helps reduce overfitting and improves generalization.

By default, BaggingClassifier uses DecisionTreeClassifier as its base estimator. However, you can specify any classifier that follows scikit-learn’s estimator API, such as LogisticRegression, SVC, or custom estimators.

The choice of base estimator can significantly impact the ensemble’s performance, bias-variance trade-off, and computational requirements.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
estimators = {
    'Default (DecisionTree)': None,
    'LogisticRegression': LogisticRegression(),
    'SVC': SVC()
}

# Train and evaluate BaggingClassifier with different base estimators
for name, estimator in estimators.items():
    bagging = BaggingClassifier(estimator=estimator, random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"{name}:")
    print(f"  Accuracy: {accuracy:.3f}")
    print(f"  F1-score: {f1:.3f}")

Running the example gives an output like:

Default (DecisionTree):
  Accuracy: 0.800
  F1-score: 0.796
LogisticRegression:
  Accuracy: 0.770
  F1-score: 0.779
SVC:
  Accuracy: 0.945
  F1-score: 0.948

Key steps in this example:

Generate a synthetic classification dataset with informative features
Split the data into train and test sets
Create BaggingClassifier instances with different base estimators
Train each ensemble and evaluate its performance on the test set
Compare accuracy and F1-scores for different base estimators

Tips for choosing and configuring base estimators:

Consider the bias-variance trade-off of different base models
Experiment with various estimators to find the best fit for your data
Adjust the number of base estimators (n_estimators) based on the chosen base model
Fine-tune hyperparameters of the base estimator for optimal performance

Issues to consider:

Complex base estimators may lead to overfitting or increased computational cost
Simple base estimators might not capture complex patterns in the data
The effectiveness of different base estimators can vary depending on the dataset characteristics
Ensemble diversity is crucial; using varied base estimators can improve robustness

See Also