The bootstrap_features
parameter in scikit-learn’s BaggingClassifier
controls whether features are sampled with replacement when training base estimators.
Bagging (Bootstrap Aggregating) is an ensemble method that combines predictions from multiple models to reduce variance and improve generalization. The bootstrap_features
parameter determines if and how features are randomly sampled for each base estimator.
When bootstrap_features
is True, features are sampled with replacement, allowing some features to be selected multiple times while others may not be selected at all. This increases diversity among base estimators, potentially improving the ensemble’s performance.
The default value for bootstrap_features
is False, meaning all features are used for each base estimator. Setting it to True enables feature sampling, which can be beneficial for high-dimensional datasets or when feature correlations exist.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_repeated=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap_features values
bootstrap_features_values = [False, True]
accuracies = []
for bootstrap_features in bootstrap_features_values:
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),
n_estimators=100,
bootstrap_features=bootstrap_features,
random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"bootstrap_features={bootstrap_features}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
bootstrap_features=False, Accuracy: 0.900
bootstrap_features=True, Accuracy: 0.915
The key steps in this example are:
- Generate a synthetic classification dataset with informative, redundant, and repeated features
- Split the data into train and test sets
- Train
BaggingClassifier
models with differentbootstrap_features
values - Evaluate the accuracy of each model on the test set
Tips for setting bootstrap_features
:
- Use True for high-dimensional datasets or when features may be correlated
- Consider the trade-off between model diversity and feature importance
- Experiment with different values to find the optimal setting for your specific dataset
Issues to consider:
- Feature sampling may not be beneficial for datasets with few highly informative features
- Enabling
bootstrap_features
can increase computational cost - The impact of feature sampling depends on the choice of base estimator and dataset characteristics