Configure BaggingRegressor "bootstrap_features" Parameter

The bootstrap_features parameter in scikit-learn’s BaggingRegressor controls whether features are sampled with replacement when building individual base estimators.

Bagging is an ensemble method that combines predictions from multiple base estimators to improve generalization and reduce overfitting. The bootstrap_features parameter determines if features are randomly sampled with replacement for each base estimator.

When bootstrap_features is set to True, each base estimator is trained on a random subset of features, potentially with duplicates. This can increase the diversity of the ensemble and may improve performance on datasets with many irrelevant features.

The default value for bootstrap_features is False, meaning all features are used for each base estimator.

In practice, setting bootstrap_features to True can be beneficial when dealing with high-dimensional data or when you want to increase the randomness in the ensemble.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different bootstrap_features values
bootstrap_features_values = [False, True]
mse_scores = []

for bootstrap_features in bootstrap_features_values:
    br = BaggingRegressor(n_estimators=10, bootstrap_features=bootstrap_features, random_state=42)
    br.fit(X_train, y_train)
    y_pred = br.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"bootstrap_features={bootstrap_features}, MSE: {mse:.3f}")

# Compare relative performance
relative_performance = (mse_scores[0] - mse_scores[1]) / mse_scores[0] * 100
print(f"Relative improvement: {relative_performance:.2f}%")

Running the example gives an output like:

bootstrap_features=False, MSE: 7486.481
bootstrap_features=True, MSE: 17215.800
Relative improvement: -129.96%

The key steps in this example are:

Generate a synthetic regression dataset with multiple features
Split the data into train and test sets
Train BaggingRegressor models with different bootstrap_features values
Evaluate the mean squared error of each model on the test set
Compare the relative performance improvement

Some tips and heuristics for setting bootstrap_features:

Use bootstrap_features=True when dealing with high-dimensional data
Consider using it in combination with max_features to control feature subset size
Experiment with both True and False values to see which performs better on your specific dataset

Issues to consider:

Setting bootstrap_features=True may increase training time due to the additional randomization
The impact of bootstrap_features can vary depending on the nature of your dataset and the base estimator used
When using bootstrap_features=True, consider increasing n_estimators to ensure sufficient feature coverage

See Also