The max_samples
parameter in scikit-learn’s ExtraTreesRegressor
controls the number of samples to draw from X to train each base estimator.
Extra Trees Regressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. The max_samples
parameter determines the size of these sub-samples.
When max_samples
is set to a float between 0 and 1, it represents a fraction of the total number of samples. If an integer is used, it represents the absolute number of samples.
The default value for max_samples
is None, which means that all samples are used to train each base estimator.
In practice, values between 0.5 and 0.8 are commonly used when max_samples
is set as a float, or values ranging from hundreds to thousands when set as an integer, depending on the dataset size.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_samples values
max_samples_values = [0.5, 0.7, 0.9, None]
mse_scores = []
for samples in max_samples_values:
etr = ExtraTreesRegressor(n_estimators=100, bootstrap=True, max_samples=samples, random_state=42)
etr.fit(X_train, y_train)
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"max_samples={samples}, MSE: {mse:.3f}")
# Find best max_samples value
best_max_samples = max_samples_values[np.argmin(mse_scores)]
print(f"Best max_samples value: {best_max_samples}")
Running the example gives an output like:
max_samples=0.5, MSE: 2451.041
max_samples=0.7, MSE: 2340.609
max_samples=0.9, MSE: 2140.951
max_samples=None, MSE: 2122.805
Best max_samples value: None
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentmax_samples
values - Evaluate the mean squared error of each model on the test set
- Identify the best
max_samples
value based on lowest MSE
Some tips and heuristics for setting max_samples
:
- Start with the default value (None) and experiment with lower fractions
- Using a smaller fraction can help reduce overfitting and improve generalization
- Consider the trade-off between computational efficiency and model performance
Issues to consider:
- The optimal
max_samples
value depends on the size and complexity of the dataset - Using a very small fraction may result in underfitting or high variance
- The effect of
max_samples
interacts with other parameters liken_estimators