The bootstrap
parameter in scikit-learn’s BaggingClassifier
determines whether bootstrap samples are used when building base estimators.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a model on each subset, and combines their predictions. The bootstrap
parameter controls how these subsets are created.
When bootstrap
is set to True
, samples are drawn with replacement, allowing the same instance to appear multiple times in a subset. When False
, the whole dataset is used to train each base estimator.
The default value for bootstrap
is True
.
In practice, True
is commonly used to introduce randomness, while False
may be preferred for smaller datasets or when overfitting is a concern.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap values
bootstrap_values = [True, False]
base_estimator = DecisionTreeClassifier(random_state=42)
for bootstrap in bootstrap_values:
bagging = BaggingClassifier(estimator=base_estimator, n_estimators=10,
bootstrap=bootstrap, random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"bootstrap={bootstrap}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
bootstrap=True, Accuracy: 0.865
bootstrap=False, Accuracy: 0.800
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Create
BaggingClassifier
instances with differentbootstrap
values - Train models and evaluate their accuracy on the test set
Some tips and heuristics for setting bootstrap
:
- Use
True
(default) to introduce more randomness and potentially reduce overfitting - Consider
False
for smaller datasets to ensure all instances are used in each base estimator - Experiment with both options and choose based on cross-validation performance
Issues to consider:
- The impact of
bootstrap
can vary depending on the base estimator and dataset characteristics - When
False
, the diversity of base estimators relies more on the random subsets of features - Using
True
may lead to some instances being overrepresented or not used at all in some base estimators