The n_estimators
parameter in scikit-learn’s BaggingClassifier
controls the number of base estimators in the ensemble.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a classifier on each subset, and combines their predictions. The n_estimators
parameter determines how many such classifiers are created and combined.
Increasing n_estimators
generally improves model performance by reducing variance, but it also increases computational cost. There’s usually a point of diminishing returns where adding more estimators provides minimal benefit.
The default value for n_estimators
in BaggingClassifier
is 10.
In practice, values between 10 and 100 are commonly used, depending on the dataset size and complexity, as well as computational resources available.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_estimators values
n_estimators_values = [5, 10, 50, 100]
accuracies = []
for n in n_estimators_values:
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=n, random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"n_estimators={n}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
n_estimators=5, Accuracy: 0.805
n_estimators=10, Accuracy: 0.865
n_estimators=50, Accuracy: 0.875
n_estimators=100, Accuracy: 0.880
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Create
BaggingClassifier
models with differentn_estimators
values - Train each model and evaluate its accuracy on the test set
Tips and heuristics for setting n_estimators
:
- Start with the default value of 10 and gradually increase
- Monitor performance improvement as you increase
n_estimators
- Consider the trade-off between model performance and training time
- Use cross-validation to find the optimal value for your specific dataset
Issues to consider:
- The optimal number of estimators can vary significantly depending on the dataset
- Increasing
n_estimators
always increases computational cost - Very high values of
n_estimators
may lead to overfitting on some datasets - The base estimator choice can affect the optimal
n_estimators
value