Configure RandomForestRegressor "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s RandomForestRegressor controls the maximum number of leaf nodes in each decision tree, effectively limiting the size of the trees.

Random Forest is an ensemble of decision trees, and the size of each tree can impact the model’s performance and computational cost. The max_leaf_nodes parameter provides an alternative to max_depth for controlling tree size.

By default, max_leaf_nodes is set to None, which allows the trees to grow without any limit on the number of leaf nodes. This can lead to large, complex trees that may overfit the training data.

In practice, common values for max_leaf_nodes are powers of 2, such as 64, 128, 256, etc., depending on the size and complexity of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [None, 64, 128, 256]
mse_scores = []

for max_leaf in max_leaf_nodes_values:
    rf = RandomForestRegressor(max_leaf_nodes=max_leaf, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_leaf_nodes={max_leaf}, MSE: {mse:.3f}")

Running the example gives an output like:

max_leaf_nodes=None, MSE: 2621.793
max_leaf_nodes=64, MSE: 3175.176
max_leaf_nodes=128, MSE: 2767.662
max_leaf_nodes=256, MSE: 2636.956

The key steps in this example are:

Generate a synthetic regression dataset with noise
Split the data into train and test sets
Train RandomForestRegressor models with different max_leaf_nodes values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting max_leaf_nodes:

Start with no limit (None) and decrease the value until performance starts to degrade
Smaller trees are more interpretable but may underfit the data
Larger trees can capture more complex relationships but may overfit, especially on small datasets

Issues to consider:

The optimal value for max_leaf_nodes depends on the size and complexity of the dataset
Tree size also interacts with other parameters like max_depth, so consider tuning them together
Larger trees have higher computational costs for training and prediction

See Also