SKLearner Home | About | Contact | Examples

Configure DecisionTreeRegressor "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s DecisionTreeRegressor controls the maximum number of leaf nodes in the tree.

Decision Tree Regression is a non-parametric supervised learning method that predicts the target variable by learning simple decision rules inferred from the data features. The max_leaf_nodes parameter limits the total number of leaf nodes the tree can have, which affects the tree’s depth and complexity.

Smaller values of max_leaf_nodes lead to simpler trees that are less likely to overfit but may underfit the data. Larger values allow more complex trees that can capture finer patterns but are prone to overfitting.

By default, there is no limit to the number of leaf nodes (max_leaf_nodes=None).

In practice, the optimal value depends on the dataset and the trade-off between bias and variance. Common values range from 10 to 100.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 50, 100, None]
mse_scores = []

for max_leaf_nodes in max_leaf_nodes_values:
    dt = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_leaf_nodes={max_leaf_nodes}, MSE: {mse:.3f}")

Running the example gives an output like:

max_leaf_nodes=10, MSE: 8729.677
max_leaf_nodes=50, MSE: 6180.483
max_leaf_nodes=100, MSE: 5892.833
max_leaf_nodes=None, MSE: 6350.428

The key steps in this example are:

  1. Generate a synthetic regression dataset with relevant features and noise
  2. Split the data into train and test sets
  3. Train DecisionTreeRegressor models with different max_leaf_nodes values
  4. Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting max_leaf_nodes:

Issues to consider:



See Also