Configure RandomForestRegressor "max_features" Parameter

The max_features parameter in scikit-learn’s RandomForestRegressor controls the number of features considered when looking for the best split at each node of the decision trees.

This parameter can be an integer, float, or string value. If an integer, it specifies the absolute number of features to consider. If a float between 0 and 1, it represents the fraction of the total features to consider. If set to “sqrt”, it considers the square root of the total features, and if set to “log2”, it considers the base-2 logarithm of the total features.

By default, max_features is set to “auto”, which is equivalent to “sqrt”. This is often a good starting point and works well in practice.

Adjusting max_features affects the bias-variance tradeoff of the model. Using a smaller value increases bias and decreases variance, while using a larger value decreases bias but increases variance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=10,
                       noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [5, 10, "sqrt", "log2"]
mse_scores = []

for max_features in max_features_values:
    rf = RandomForestRegressor(max_features=max_features, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_features={max_features}, MSE: {mse:.3f}")

Running the example gives an output like:

max_features=5, MSE: 8887.757
max_features=10, MSE: 7194.433
max_features=sqrt, MSE: 9878.019
max_features=log2, MSE: 9878.019

The key steps in this example are:

Generate a synthetic regression dataset with informative and noisy features
Split the data into train and test sets
Train RandomForestRegressor models with different max_features values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting max_features:

Start with the default value of “sqrt” or “auto”
If the dataset has a large number of noisy or irrelevant features, consider using a smaller value
If underfitting, try increasing max_features to reduce bias
If overfitting, try decreasing max_features to increase bias and reduce variance

Issues to consider:

The optimal value depends on the specific dataset and problem
There is a tradeoff between model complexity and generalization performance
Using too low a value can lead to underfitting, while too high a value can cause overfitting

See Also