Configure BaggingClassifier "n_estimators" Parameter

The n_estimators parameter in scikit-learn’s BaggingClassifier controls the number of base estimators in the ensemble.

Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a classifier on each subset, and combines their predictions. The n_estimators parameter determines how many such classifiers are created and combined.

Increasing n_estimators generally improves model performance by reducing variance, but it also increases computational cost. There’s usually a point of diminishing returns where adding more estimators provides minimal benefit.

The default value for n_estimators in BaggingClassifier is 10.

In practice, values between 10 and 100 are commonly used, depending on the dataset size and complexity, as well as computational resources available.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_estimators values
n_estimators_values = [5, 10, 50, 100]
accuracies = []

for n in n_estimators_values:
    bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=n, random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"n_estimators={n}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

n_estimators=5, Accuracy: 0.805
n_estimators=10, Accuracy: 0.865
n_estimators=50, Accuracy: 0.875
n_estimators=100, Accuracy: 0.880

The key steps in this example are:

Generate a synthetic classification dataset with informative and redundant features
Split the data into train and test sets
Create BaggingClassifier models with different n_estimators values
Train each model and evaluate its accuracy on the test set

Tips and heuristics for setting n_estimators:

Start with the default value of 10 and gradually increase
Monitor performance improvement as you increase n_estimators
Consider the trade-off between model performance and training time
Use cross-validation to find the optimal value for your specific dataset

Issues to consider:

The optimal number of estimators can vary significantly depending on the dataset
Increasing n_estimators always increases computational cost
Very high values of n_estimators may lead to overfitting on some datasets
The base estimator choice can affect the optimal n_estimators value

See Also