The bootstrap
parameter in scikit-learn’s BaggingRegressor
controls whether bootstrap samples are used when building base estimators.
Bagging (Bootstrap Aggregating) is an ensemble method that combines predictions from multiple base estimators to improve generalization and reduce overfitting. The bootstrap
parameter determines whether individual estimators are trained on bootstrap samples or the original dataset.
When bootstrap
is True, each base estimator is trained on a random subset of the training data, sampled with replacement. This introduces randomness and can help reduce overfitting. When False, each estimator uses the full dataset, which may lead to stronger individual models but less diversity in the ensemble.
The default value for bootstrap
is True.
In practice, both True and False can be effective depending on the dataset and base estimator characteristics.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap values
bootstrap_values = [True, False]
mse_scores = []
for bootstrap in bootstrap_values:
bagging = BaggingRegressor(n_estimators=10, bootstrap=bootstrap, random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"bootstrap={bootstrap}, MSE: {mse:.3f}")
# Calculate relative difference
relative_diff = (mse_scores[1] - mse_scores[0]) / mse_scores[0] * 100
print(f"Relative difference: {relative_diff:.2f}%")
Running the example gives an output like:
bootstrap=True, MSE: 7486.481
bootstrap=False, MSE: 17501.890
Relative difference: 133.78%
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
BaggingRegressor
models withbootstrap
set to True and False - Evaluate the mean squared error of each model on the test set
- Compare the relative difference in performance between the two configurations
Some tips and heuristics for setting bootstrap
:
- Use True (default) when working with smaller datasets or when base estimators tend to overfit
- Consider False when using strong base estimators or when working with larger datasets
- Experiment with both options and compare performance using cross-validation
Issues to consider:
- The optimal choice depends on the characteristics of your dataset and base estimators
- Using bootstrap=True may reduce overfitting but could potentially underfit if base estimators are weak
- Setting bootstrap=False may lead to stronger individual models but less diversity in the ensemble