The bootstrap parameter in scikit-learn’s BaggingClassifier determines whether bootstrap samples are used when building base estimators.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a model on each subset, and combines their predictions. The bootstrap parameter controls how these subsets are created.
When bootstrap is set to True, samples are drawn with replacement, allowing the same instance to appear multiple times in a subset. When False, the whole dataset is used to train each base estimator.
The default value for bootstrap is True.
In practice, True is commonly used to introduce randomness, while False may be preferred for smaller datasets or when overfitting is a concern.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap values
bootstrap_values = [True, False]
base_estimator = DecisionTreeClassifier(random_state=42)
for bootstrap in bootstrap_values:
bagging = BaggingClassifier(estimator=base_estimator, n_estimators=10,
bootstrap=bootstrap, random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"bootstrap={bootstrap}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
bootstrap=True, Accuracy: 0.865
bootstrap=False, Accuracy: 0.800
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Create
BaggingClassifierinstances with differentbootstrapvalues - Train models and evaluate their accuracy on the test set
Some tips and heuristics for setting bootstrap:
- Use
True(default) to introduce more randomness and potentially reduce overfitting - Consider
Falsefor smaller datasets to ensure all instances are used in each base estimator - Experiment with both options and choose based on cross-validation performance
Issues to consider:
- The impact of
bootstrapcan vary depending on the base estimator and dataset characteristics - When
False, the diversity of base estimators relies more on the random subsets of features - Using
Truemay lead to some instances being overrepresented or not used at all in some base estimators