The max_samples
parameter in scikit-learn’s BaggingRegressor
controls the number of samples drawn from the training set to train each base estimator.
Bagging, short for Bootstrap Aggregating, is an ensemble method that combines predictions from multiple base estimators trained on different subsets of the data. The max_samples
parameter determines the size of these subsets.
Setting max_samples
affects the diversity of the base estimators. Smaller values increase diversity but may lead to underfitting, while larger values reduce diversity but can improve individual estimator performance.
The default value for max_samples
is 1.0, which means using all samples. In practice, values between 0.5 and 1.0 are commonly used, depending on the dataset size and desired trade-off between diversity and individual estimator performance.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_samples values
max_samples_values = [0.1, 0.5, 0.8, 1.0]
mse_scores = []
for samples in max_samples_values:
bagging = BaggingRegressor(n_estimators=10, max_samples=samples, random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"max_samples={samples}, MSE: {mse:.3f}")
# Find best max_samples value
best_max_samples = max_samples_values[np.argmin(mse_scores)]
print(f"Best max_samples value: {best_max_samples}")
Running the example gives an output like:
max_samples=0.1, MSE: 14174.491
max_samples=0.5, MSE: 9104.987
max_samples=0.8, MSE: 7360.430
max_samples=1.0, MSE: 7486.481
Best max_samples value: 0.8
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
BaggingRegressor
models with differentmax_samples
values - Evaluate the mean squared error of each model on the test set
- Identify the best
max_samples
value based on the lowest MSE
Some tips and heuristics for setting max_samples
:
- Start with the default value of 1.0 and decrease it to find a balance between diversity and performance
- For large datasets, smaller values (e.g., 0.5-0.8) often work well
- For smaller datasets, using higher values or even 1.0 may be necessary to ensure sufficient training data for each base estimator
Issues to consider:
- The optimal
max_samples
value depends on the dataset size and complexity - Lower values increase diversity but may lead to underfitting of individual estimators
- Higher values reduce diversity but can improve individual estimator performance
- Consider the trade-off between computational cost and model performance when adjusting this parameter