Configure BaggingClassifier "bootstrap_features" Parameter

The bootstrap_features parameter in scikit-learn’s BaggingClassifier controls whether features are sampled with replacement when training base estimators.

Bagging (Bootstrap Aggregating) is an ensemble method that combines predictions from multiple models to reduce variance and improve generalization. The bootstrap_features parameter determines if and how features are randomly sampled for each base estimator.

When bootstrap_features is True, features are sampled with replacement, allowing some features to be selected multiple times while others may not be selected at all. This increases diversity among base estimators, potentially improving the ensemble’s performance.

The default value for bootstrap_features is False, meaning all features are used for each base estimator. Setting it to True enables feature sampling, which can be beneficial for high-dimensional datasets or when feature correlations exist.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_repeated=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different bootstrap_features values
bootstrap_features_values = [False, True]
accuracies = []

for bootstrap_features in bootstrap_features_values:
    bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),
                                n_estimators=100,
                                bootstrap_features=bootstrap_features,
                                random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"bootstrap_features={bootstrap_features}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

bootstrap_features=False, Accuracy: 0.900
bootstrap_features=True, Accuracy: 0.915

The key steps in this example are:

Generate a synthetic classification dataset with informative, redundant, and repeated features
Split the data into train and test sets
Train BaggingClassifier models with different bootstrap_features values
Evaluate the accuracy of each model on the test set

Tips for setting bootstrap_features:

Use True for high-dimensional datasets or when features may be correlated
Consider the trade-off between model diversity and feature importance
Experiment with different values to find the optimal setting for your specific dataset

Issues to consider:

Feature sampling may not be beneficial for datasets with few highly informative features
Enabling bootstrap_features can increase computational cost
The impact of feature sampling depends on the choice of base estimator and dataset characteristics

See Also