Configure ExtraTreesRegressor "max_samples" Parameter

The max_samples parameter in scikit-learn’s ExtraTreesRegressor controls the number of samples to draw from X to train each base estimator.

Extra Trees Regressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. The max_samples parameter determines the size of these sub-samples.

When max_samples is set to a float between 0 and 1, it represents a fraction of the total number of samples. If an integer is used, it represents the absolute number of samples.

The default value for max_samples is None, which means that all samples are used to train each base estimator.

In practice, values between 0.5 and 0.8 are commonly used when max_samples is set as a float, or values ranging from hundreds to thousands when set as an integer, depending on the dataset size.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_samples values
max_samples_values = [0.5, 0.7, 0.9, None]
mse_scores = []

for samples in max_samples_values:
    etr = ExtraTreesRegressor(n_estimators=100, bootstrap=True, max_samples=samples, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_samples={samples}, MSE: {mse:.3f}")

# Find best max_samples value
best_max_samples = max_samples_values[np.argmin(mse_scores)]
print(f"Best max_samples value: {best_max_samples}")

Running the example gives an output like:

max_samples=0.5, MSE: 2451.041
max_samples=0.7, MSE: 2340.609
max_samples=0.9, MSE: 2140.951
max_samples=None, MSE: 2122.805
Best max_samples value: None

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different max_samples values
Evaluate the mean squared error of each model on the test set
Identify the best max_samples value based on lowest MSE

Some tips and heuristics for setting max_samples:

Start with the default value (None) and experiment with lower fractions
Using a smaller fraction can help reduce overfitting and improve generalization
Consider the trade-off between computational efficiency and model performance

Issues to consider:

The optimal max_samples value depends on the size and complexity of the dataset
Using a very small fraction may result in underfitting or high variance
The effect of max_samples interacts with other parameters like n_estimators

See Also