Configure BaggingClassifier "max_features" Parameter

The max_features parameter in scikit-learn’s BaggingClassifier controls the number of features randomly selected for each base estimator.

Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a classifier on each subset, and combines their predictions. The max_features parameter determines how many features are considered when creating these subsets.

Setting max_features can help reduce overfitting by introducing randomness in feature selection. It can also improve model diversity within the ensemble, potentially leading to better generalization.

The default value for max_features is 1.0, which means all features are used. Common values include the square root or log2 of the total number of features, or a fraction between 0.5 and 0.8.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [0.5, 0.7, 1.0, round(np.sqrt(X.shape[1])), round(np.log2(X.shape[1]))]
accuracies = []

for max_feat in max_features_values:
    bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),
                                max_features=max_feat,
                                n_estimators=100,
                                random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_features={max_feat}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_features=0.5, Accuracy: 0.895
max_features=0.7, Accuracy: 0.890
max_features=1.0, Accuracy: 0.880
max_features=4, Accuracy: 0.860
max_features=4, Accuracy: 0.860

The key steps in this example are:

Generate a synthetic classification dataset with informative and redundant features
Split the data into train and test sets
Train BaggingClassifier models with different max_features values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_features:

Start with the default value of 1.0 and experiment with lower values
Try using ‘sqrt’ or ’log2’ for datasets with many features
Consider the trade-off between model diversity and feature importance

Issues to consider:

Lower max_features values increase randomness but may exclude important features
Higher values may lead to overfitting if many irrelevant features are present
The optimal value depends on the specific dataset and problem

See Also