The max_features
parameter in scikit-learn’s BaggingClassifier
controls the number of features randomly selected for each base estimator.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a classifier on each subset, and combines their predictions. The max_features
parameter determines how many features are considered when creating these subsets.
Setting max_features
can help reduce overfitting by introducing randomness in feature selection. It can also improve model diversity within the ensemble, potentially leading to better generalization.
The default value for max_features
is 1.0, which means all features are used. Common values include the square root or log2 of the total number of features, or a fraction between 0.5 and 0.8.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = [0.5, 0.7, 1.0, round(np.sqrt(X.shape[1])), round(np.log2(X.shape[1]))]
accuracies = []
for max_feat in max_features_values:
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(),
max_features=max_feat,
n_estimators=100,
random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_features={max_feat}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_features=0.5, Accuracy: 0.895
max_features=0.7, Accuracy: 0.890
max_features=1.0, Accuracy: 0.880
max_features=4, Accuracy: 0.860
max_features=4, Accuracy: 0.860
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
BaggingClassifier
models with differentmax_features
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting max_features
:
- Start with the default value of 1.0 and experiment with lower values
- Try using ‘sqrt’ or ’log2’ for datasets with many features
- Consider the trade-off between model diversity and feature importance
Issues to consider:
- Lower
max_features
values increase randomness but may exclude important features - Higher values may lead to overfitting if many irrelevant features are present
- The optimal value depends on the specific dataset and problem