The max_leaf_nodes
parameter in scikit-learn’s DecisionTreeClassifier
controls the maximum number of leaf nodes in the tree.
A leaf node is a node that has no children. The max_leaf_nodes
parameter limits the total number of leaf nodes the tree can have, which in turn affects how much the tree can grow and how complex the decision boundaries can be.
Smaller values of max_leaf_nodes
lead to simpler trees that are less likely to overfit, but may underfit the data. Larger values allow more complex trees that can capture finer patterns, but are more prone to overfitting.
The default value for max_leaf_nodes
is None, which means there is no limit to the number of leaf nodes.
In practice, this parameter is often adjusted to find the sweet spot between model complexity and generalization performance. Common values range from a few dozen to a few hundred, depending on the size and nature of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_informative=7, random_state=42)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train trees with different max_leaf_nodes values
max_leaf_nodes_values = [None, 10, 50, 100, 200]
accuracies = []
for max_leaf_nodes in max_leaf_nodes_values:
tree = DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes, random_state=42)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_leaf_nodes={max_leaf_nodes}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_leaf_nodes=None, Accuracy: 0.785
max_leaf_nodes=10, Accuracy: 0.735
max_leaf_nodes=50, Accuracy: 0.780
max_leaf_nodes=100, Accuracy: 0.780
max_leaf_nodes=200, Accuracy: 0.780
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train decision trees with different
max_leaf_nodes
values - Evaluate the accuracy of each tree on the test set
Some tips and heuristics for setting max_leaf_nodes
:
- Start with the default (no limit) and decrease the value until performance starts to degrade
- Lower values create simpler trees that are less likely to overfit
- Higher values allow more complex trees but risk overfitting to noise in the training data
Issues to consider:
- The optimal value depends on the size and complexity of the dataset
- Setting the value too high can lead to overfitting, while too low can result in underfitting
- There are diminishing returns in performance and increased computational costs for very high values