The max_features
parameter in scikit-learn’s BaggingRegressor
controls the number of features randomly selected for each base estimator.
BaggingRegressor
is an ensemble method that fits multiple base regressors on random subsets of the original dataset and aggregates their predictions. The max_features
parameter determines the size of the feature subset for each base estimator.
Setting max_features
can help balance the trade-off between model diversity and individual estimator performance. Lower values increase diversity but may reduce the performance of individual estimators, while higher values do the opposite.
The default value for max_features
is 1.0, which means all features are used. Common values range from 0.5 to 1.0, depending on the dataset’s characteristics and the desired balance between diversity and individual estimator strength.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = [0.5, 0.7, 0.9, 1.0]
mse_scores = []
for max_feat in max_features_values:
bgr = BaggingRegressor(max_features=max_feat, random_state=42)
bgr.fit(X_train, y_train)
y_pred = bgr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"max_features={max_feat}, MSE: {mse:.3f}")
# Find best max_features value
best_max_features = max_features_values[np.argmin(mse_scores)]
print(f"Best max_features value: {best_max_features}")
Running the example gives an output like:
max_features=0.5, MSE: 13034.131
max_features=0.7, MSE: 9486.080
max_features=0.9, MSE: 7813.246
max_features=1.0, MSE: 7486.481
Best max_features value: 1.0
The key steps in this example are:
- Generate a synthetic regression dataset with 20 features
- Split the data into train and test sets
- Train
BaggingRegressor
models with differentmax_features
values - Evaluate the mean squared error of each model on the test set
- Identify the best
max_features
value based on lowest MSE
Some tips and heuristics for setting max_features
:
- Start with the default value of 1.0 and decrease it to increase ensemble diversity
- For high-dimensional datasets, lower values (e.g., 0.5-0.7) often work well
- Consider the trade-off between diversity and individual estimator performance
Issues to consider:
- The optimal
max_features
value depends on the dataset’s characteristics - Lower values increase training time as more base estimators may be needed
- Very low values might lead to underfitting of individual base estimators