The max_leaf_nodes
parameter in scikit-learn’s DecisionTreeRegressor
controls the maximum number of leaf nodes in the tree.
Decision Tree Regression is a non-parametric supervised learning method that predicts the target variable by learning simple decision rules inferred from the data features. The max_leaf_nodes
parameter limits the total number of leaf nodes the tree can have, which affects the tree’s depth and complexity.
Smaller values of max_leaf_nodes
lead to simpler trees that are less likely to overfit but may underfit the data. Larger values allow more complex trees that can capture finer patterns but are prone to overfitting.
By default, there is no limit to the number of leaf nodes (max_leaf_nodes=None
).
In practice, the optimal value depends on the dataset and the trade-off between bias and variance. Common values range from 10 to 100.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 50, 100, None]
mse_scores = []
for max_leaf_nodes in max_leaf_nodes_values:
dt = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"max_leaf_nodes={max_leaf_nodes}, MSE: {mse:.3f}")
Running the example gives an output like:
max_leaf_nodes=10, MSE: 8729.677
max_leaf_nodes=50, MSE: 6180.483
max_leaf_nodes=100, MSE: 5892.833
max_leaf_nodes=None, MSE: 6350.428
The key steps in this example are:
- Generate a synthetic regression dataset with relevant features and noise
- Split the data into train and test sets
- Train
DecisionTreeRegressor
models with differentmax_leaf_nodes
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting max_leaf_nodes
:
- Start with the default value (no limit) and decrease it until the performance starts to degrade
- Lower values can help prevent overfitting, especially on smaller datasets
- Higher values allow more complex trees that can capture finer patterns in the data
Issues to consider:
- The optimal value of
max_leaf_nodes
depends on the size and complexity of the dataset - Setting it too low can lead to underfitting, while setting it too high can cause overfitting
- It’s important to find the right balance based on the bias-variance trade-off