Configure ExtraTreesRegressor "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s ExtraTreesRegressor controls the maximum number of leaf nodes in each tree.

Extremely Randomized Trees (Extra Trees) is an ensemble method similar to Random Forests, but with additional randomization in the tree-building process. It creates multiple decision trees and aggregates their predictions.

max_leaf_nodes limits the number of leaf nodes in each tree, effectively controlling the depth and complexity of the trees. Lower values result in shallower trees, potentially reducing overfitting but may increase bias.

The default value for max_leaf_nodes is None, which allows trees to grow without restriction. Common values range from 10 to 1000, depending on dataset size and complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 50, 100, None]
mse_scores = []

for max_nodes in max_leaf_nodes_values:
    etr = ExtraTreesRegressor(n_estimators=100, max_leaf_nodes=max_nodes, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_leaf_nodes={max_nodes}, MSE: {mse:.3f}")

Running the example gives an output like:

max_leaf_nodes=10, MSE: 5725.319
max_leaf_nodes=50, MSE: 3250.931
max_leaf_nodes=100, MSE: 2627.548
max_leaf_nodes=None, MSE: 2036.183

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different max_leaf_nodes values
Evaluate the Mean Squared Error (MSE) of each model on the test set

Some tips and heuristics for setting max_leaf_nodes:

Start with a small value (e.g., 10) and increase it until performance plateaus
Use cross-validation to find the optimal value for your specific dataset
Consider the trade-off between model complexity and performance

Issues to consider:

Smaller max_leaf_nodes values can help prevent overfitting but may underfit complex datasets
Larger values allow for more complex trees but may lead to overfitting on smaller datasets
The optimal value depends on the size, dimensionality, and complexity of your dataset

See Also