Configure DecisionTreeClassifier "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s DecisionTreeClassifier controls the maximum number of leaf nodes in the tree.

A leaf node is a node that has no children. The max_leaf_nodes parameter limits the total number of leaf nodes the tree can have, which in turn affects how much the tree can grow and how complex the decision boundaries can be.

Smaller values of max_leaf_nodes lead to simpler trees that are less likely to overfit, but may underfit the data. Larger values allow more complex trees that can capture finer patterns, but are more prone to overfitting.

The default value for max_leaf_nodes is None, which means there is no limit to the number of leaf nodes.

In practice, this parameter is often adjusted to find the sweet spot between model complexity and generalization performance. Common values range from a few dozen to a few hundred, depending on the size and nature of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=7, random_state=42)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train trees with different max_leaf_nodes values
max_leaf_nodes_values = [None, 10, 50, 100, 200]
accuracies = []

for max_leaf_nodes in max_leaf_nodes_values:
    tree = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=42)
    tree.fit(X_train, y_train)
    y_pred = tree.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_leaf_nodes={max_leaf_nodes}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_leaf_nodes=None, Accuracy: 0.785
max_leaf_nodes=10, Accuracy: 0.735
max_leaf_nodes=50, Accuracy: 0.780
max_leaf_nodes=100, Accuracy: 0.780
max_leaf_nodes=200, Accuracy: 0.780

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train decision trees with different max_leaf_nodes values
Evaluate the accuracy of each tree on the test set

Some tips and heuristics for setting max_leaf_nodes:

Start with the default (no limit) and decrease the value until performance starts to degrade
Lower values create simpler trees that are less likely to overfit
Higher values allow more complex trees but risk overfitting to noise in the training data

Issues to consider:

The optimal value depends on the size and complexity of the dataset
Setting the value too high can lead to overfitting, while too low can result in underfitting
There are diminishing returns in performance and increased computational costs for very high values

See Also