Configure BaggingClassifier "bootstrap" Parameter

The bootstrap parameter in scikit-learn’s BaggingClassifier determines whether bootstrap samples are used when building base estimators.

Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a model on each subset, and combines their predictions. The bootstrap parameter controls how these subsets are created.

When bootstrap is set to True, samples are drawn with replacement, allowing the same instance to appear multiple times in a subset. When False, the whole dataset is used to train each base estimator.

The default value for bootstrap is True.

In practice, True is commonly used to introduce randomness, while False may be preferred for smaller datasets or when overfitting is a concern.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different bootstrap values
bootstrap_values = [True, False]
base_estimator = DecisionTreeClassifier(random_state=42)

for bootstrap in bootstrap_values:
    bagging = BaggingClassifier(estimator=base_estimator, n_estimators=10,
                                bootstrap=bootstrap, random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"bootstrap={bootstrap}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

bootstrap=True, Accuracy: 0.865
bootstrap=False, Accuracy: 0.800

The key steps in this example are:

Generate a synthetic classification dataset with informative and redundant features
Split the data into train and test sets
Create BaggingClassifier instances with different bootstrap values
Train models and evaluate their accuracy on the test set

Some tips and heuristics for setting bootstrap:

Use True (default) to introduce more randomness and potentially reduce overfitting
Consider False for smaller datasets to ensure all instances are used in each base estimator
Experiment with both options and choose based on cross-validation performance

Issues to consider:

The impact of bootstrap can vary depending on the base estimator and dataset characteristics
When False, the diversity of base estimators relies more on the random subsets of features
Using True may lead to some instances being overrepresented or not used at all in some base estimators

See Also