Configure DecisionTreeRegressor "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s DecisionTreeRegressor controls the maximum number of leaf nodes in the tree.

Decision Tree Regression is a non-parametric supervised learning method that predicts the target variable by learning simple decision rules inferred from the data features. The max_leaf_nodes parameter limits the total number of leaf nodes the tree can have, which affects the tree’s depth and complexity.

Smaller values of max_leaf_nodes lead to simpler trees that are less likely to overfit but may underfit the data. Larger values allow more complex trees that can capture finer patterns but are prone to overfitting.

By default, there is no limit to the number of leaf nodes (max_leaf_nodes=None).

In practice, the optimal value depends on the dataset and the trade-off between bias and variance. Common values range from 10 to 100.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 50, 100, None]
mse_scores = []

for max_leaf_nodes in max_leaf_nodes_values:
    dt = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_leaf_nodes={max_leaf_nodes}, MSE: {mse:.3f}")

Running the example gives an output like:

max_leaf_nodes=10, MSE: 8729.677
max_leaf_nodes=50, MSE: 6180.483
max_leaf_nodes=100, MSE: 5892.833
max_leaf_nodes=None, MSE: 6350.428

The key steps in this example are:

Generate a synthetic regression dataset with relevant features and noise
Split the data into train and test sets
Train DecisionTreeRegressor models with different max_leaf_nodes values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting max_leaf_nodes:

Start with the default value (no limit) and decrease it until the performance starts to degrade
Lower values can help prevent overfitting, especially on smaller datasets
Higher values allow more complex trees that can capture finer patterns in the data

Issues to consider:

The optimal value of max_leaf_nodes depends on the size and complexity of the dataset
Setting it too low can lead to underfitting, while setting it too high can cause overfitting
It’s important to find the right balance based on the bias-variance trade-off

See Also