The bootstrap_features
parameter in scikit-learn’s BaggingRegressor
controls whether features are sampled with replacement when building individual base estimators.
Bagging is an ensemble method that combines predictions from multiple base estimators to improve generalization and reduce overfitting. The bootstrap_features
parameter determines if features are randomly sampled with replacement for each base estimator.
When bootstrap_features
is set to True
, each base estimator is trained on a random subset of features, potentially with duplicates. This can increase the diversity of the ensemble and may improve performance on datasets with many irrelevant features.
The default value for bootstrap_features
is False
, meaning all features are used for each base estimator.
In practice, setting bootstrap_features
to True
can be beneficial when dealing with high-dimensional data or when you want to increase the randomness in the ensemble.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap_features values
bootstrap_features_values = [False, True]
mse_scores = []
for bootstrap_features in bootstrap_features_values:
br = BaggingRegressor(n_estimators=10, bootstrap_features=bootstrap_features, random_state=42)
br.fit(X_train, y_train)
y_pred = br.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"bootstrap_features={bootstrap_features}, MSE: {mse:.3f}")
# Compare relative performance
relative_performance = (mse_scores[0] - mse_scores[1]) / mse_scores[0] * 100
print(f"Relative improvement: {relative_performance:.2f}%")
Running the example gives an output like:
bootstrap_features=False, MSE: 7486.481
bootstrap_features=True, MSE: 17215.800
Relative improvement: -129.96%
The key steps in this example are:
- Generate a synthetic regression dataset with multiple features
- Split the data into train and test sets
- Train
BaggingRegressor
models with differentbootstrap_features
values - Evaluate the mean squared error of each model on the test set
- Compare the relative performance improvement
Some tips and heuristics for setting bootstrap_features
:
- Use
bootstrap_features=True
when dealing with high-dimensional data - Consider using it in combination with
max_features
to control feature subset size - Experiment with both
True
andFalse
values to see which performs better on your specific dataset
Issues to consider:
- Setting
bootstrap_features=True
may increase training time due to the additional randomization - The impact of
bootstrap_features
can vary depending on the nature of your dataset and the base estimator used - When using
bootstrap_features=True
, consider increasingn_estimators
to ensure sufficient feature coverage