Configure DecisionTreeClassifier "max_depth" Parameter

The max_depth parameter in scikit-learn’s DecisionTreeClassifier controls the maximum depth of the decision tree.

Decision trees make predictions by recursively splitting the data based on feature values until a stopping criterion is met. The max_depth parameter sets an upper limit on the number of splits, effectively determining how deep the tree can grow.

By default, max_depth is set to None, allowing the tree to expand until all leaves contain samples from a single class. In practice, common values for max_depth range from 1 to 10, depending on the complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_depth values
max_depth_values = [1, 3, 5, 10, None]
accuracies = []

for depth in max_depth_values:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_depth={depth}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_depth=1, Accuracy: 0.555
max_depth=3, Accuracy: 0.730
max_depth=5, Accuracy: 0.785
max_depth=10, Accuracy: 0.790
max_depth=None, Accuracy: 0.785

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train DecisionTreeClassifier models with different max_depth values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_depth:

Start with a small value and increase it until the performance plateaus
Use cross-validation to find the optimal value for your dataset
Consider the trade-off between model complexity and interpretability

Issues to consider:

Setting max_depth too low can lead to underfitting, where the model is too simple to capture the underlying patterns in the data
Setting max_depth too high can cause overfitting, where the model learns noise in the training data and fails to generalize well to new data
The optimal value for max_depth depends on the size and complexity of the dataset

See Also