The bootstrap_features parameter in scikit-learn’s BaggingRegressor controls whether features are sampled with replacement when building individual base estimators.
Bagging is an ensemble method that combines predictions from multiple base estimators to improve generalization and reduce overfitting. The bootstrap_features parameter determines if features are randomly sampled with replacement for each base estimator.
When bootstrap_features is set to True, each base estimator is trained on a random subset of features, potentially with duplicates. This can increase the diversity of the ensemble and may improve performance on datasets with many irrelevant features.
The default value for bootstrap_features is False, meaning all features are used for each base estimator.
In practice, setting bootstrap_features to True can be beneficial when dealing with high-dimensional data or when you want to increase the randomness in the ensemble.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap_features values
bootstrap_features_values = [False, True]
mse_scores = []
for bootstrap_features in bootstrap_features_values:
br = BaggingRegressor(n_estimators=10, bootstrap_features=bootstrap_features, random_state=42)
br.fit(X_train, y_train)
y_pred = br.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"bootstrap_features={bootstrap_features}, MSE: {mse:.3f}")
# Compare relative performance
relative_performance = (mse_scores[0] - mse_scores[1]) / mse_scores[0] * 100
print(f"Relative improvement: {relative_performance:.2f}%")
Running the example gives an output like:
bootstrap_features=False, MSE: 7486.481
bootstrap_features=True, MSE: 17215.800
Relative improvement: -129.96%
The key steps in this example are:
- Generate a synthetic regression dataset with multiple features
- Split the data into train and test sets
- Train
BaggingRegressormodels with differentbootstrap_featuresvalues - Evaluate the mean squared error of each model on the test set
- Compare the relative performance improvement
Some tips and heuristics for setting bootstrap_features:
- Use
bootstrap_features=Truewhen dealing with high-dimensional data - Consider using it in combination with
max_featuresto control feature subset size - Experiment with both
TrueandFalsevalues to see which performs better on your specific dataset
Issues to consider:
- Setting
bootstrap_features=Truemay increase training time due to the additional randomization - The impact of
bootstrap_featurescan vary depending on the nature of your dataset and the base estimator used - When using
bootstrap_features=True, consider increasingn_estimatorsto ensure sufficient feature coverage