Configure RandomForestRegressor "max_samples" Parameter

The max_samples parameter in scikit-learn’s RandomForestRegressor controls the number of samples to draw from X to train each base estimator (decision tree) in the ensemble.

Random Forest is an ensemble learning method that combines predictions from multiple decision trees to improve generalization performance. Each tree is trained on a bootstrap sample of the original dataset.

The max_samples parameter determines the size of the bootstrap sample for each tree. It can be set as an integer representing the absolute number of samples, or a float between 0 and 1 representing a fraction of the total number of samples.

The default value for max_samples is None, which means that the entire dataset is used to train each tree (max_samples=n_samples).

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       n_targets=1, noise=0.5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_samples values
max_samples_values = [None, 0.5, 0.8, 1.0]
mse_scores = []

for max_samples in max_samples_values:
    rf = RandomForestRegressor(n_estimators=100, max_samples=max_samples, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_samples={max_samples}, MSE: {mse:.3f}")

Running the example gives an output like:

max_samples=None, MSE: 208.564
max_samples=0.5, MSE: 236.305
max_samples=0.8, MSE: 224.914
max_samples=1.0, MSE: 208.564

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features
Split the data into train and test sets
Train RandomForestRegressor models with different max_samples values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting max_samples:

Smaller values introduce more randomness and diversity among the trees, which can improve generalization to new data
Larger values make the trees more similar to each other but can sometimes improve performance depending on the problem
Values between 0.5 and 0.8 can be a good starting point for tuning this parameter

Issues to consider with max_samples:

There is a tradeoff between introducing randomness/diversity and model performance
Smaller max_samples values are more computationally efficient as each tree is trained on fewer data points
The optimal value for max_samples depends on the specific dataset and problem at hand

See Also