Configure BaggingRegressor "bootstrap" Parameter

The bootstrap parameter in scikit-learn’s BaggingRegressor controls whether bootstrap samples are used when building base estimators.

Bagging (Bootstrap Aggregating) is an ensemble method that combines predictions from multiple base estimators to improve generalization and reduce overfitting. The bootstrap parameter determines whether individual estimators are trained on bootstrap samples or the original dataset.

When bootstrap is True, each base estimator is trained on a random subset of the training data, sampled with replacement. This introduces randomness and can help reduce overfitting. When False, each estimator uses the full dataset, which may lead to stronger individual models but less diversity in the ensemble.

The default value for bootstrap is True.

In practice, both True and False can be effective depending on the dataset and base estimator characteristics.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different bootstrap values
bootstrap_values = [True, False]
mse_scores = []

for bootstrap in bootstrap_values:
    bagging = BaggingRegressor(n_estimators=10, bootstrap=bootstrap, random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"bootstrap={bootstrap}, MSE: {mse:.3f}")

# Calculate relative difference
relative_diff = (mse_scores[1] - mse_scores[0]) / mse_scores[0] * 100
print(f"Relative difference: {relative_diff:.2f}%")

Running the example gives an output like:

bootstrap=True, MSE: 7486.481
bootstrap=False, MSE: 17501.890
Relative difference: 133.78%

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train BaggingRegressor models with bootstrap set to True and False
Evaluate the mean squared error of each model on the test set
Compare the relative difference in performance between the two configurations

Some tips and heuristics for setting bootstrap:

Use True (default) when working with smaller datasets or when base estimators tend to overfit
Consider False when using strong base estimators or when working with larger datasets
Experiment with both options and compare performance using cross-validation

Issues to consider:

The optimal choice depends on the characteristics of your dataset and base estimators
Using bootstrap=True may reduce overfitting but could potentially underfit if base estimators are weak
Setting bootstrap=False may lead to stronger individual models but less diversity in the ensemble

See Also