Configure BaggingRegressor "max_features" Parameter

The max_features parameter in scikit-learn’s BaggingRegressor controls the number of features randomly selected for each base estimator.

BaggingRegressor is an ensemble method that fits multiple base regressors on random subsets of the original dataset and aggregates their predictions. The max_features parameter determines the size of the feature subset for each base estimator.

Setting max_features can help balance the trade-off between model diversity and individual estimator performance. Lower values increase diversity but may reduce the performance of individual estimators, while higher values do the opposite.

The default value for max_features is 1.0, which means all features are used. Common values range from 0.5 to 1.0, depending on the dataset’s characteristics and the desired balance between diversity and individual estimator strength.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [0.5, 0.7, 0.9, 1.0]
mse_scores = []

for max_feat in max_features_values:
    bgr = BaggingRegressor(max_features=max_feat, random_state=42)
    bgr.fit(X_train, y_train)
    y_pred = bgr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_features={max_feat}, MSE: {mse:.3f}")

# Find best max_features value
best_max_features = max_features_values[np.argmin(mse_scores)]
print(f"Best max_features value: {best_max_features}")

Running the example gives an output like:

max_features=0.5, MSE: 13034.131
max_features=0.7, MSE: 9486.080
max_features=0.9, MSE: 7813.246
max_features=1.0, MSE: 7486.481
Best max_features value: 1.0

The key steps in this example are:

Generate a synthetic regression dataset with 20 features
Split the data into train and test sets
Train BaggingRegressor models with different max_features values
Evaluate the mean squared error of each model on the test set
Identify the best max_features value based on lowest MSE

Some tips and heuristics for setting max_features:

Start with the default value of 1.0 and decrease it to increase ensemble diversity
For high-dimensional datasets, lower values (e.g., 0.5-0.7) often work well
Consider the trade-off between diversity and individual estimator performance

Issues to consider:

The optimal max_features value depends on the dataset’s characteristics
Lower values increase training time as more base estimators may be needed
Very low values might lead to underfitting of individual base estimators

See Also