Configure DecisionTreeClassifier "ccp_alpha" Parameter

The ccp_alpha parameter in scikit-learn’s DecisionTreeClassifier controls the complexity of the tree via minimal cost-complexity pruning.

Pruning reduces the size of the decision tree by removing branches that provide little power to classify instances. This helps to prevent overfitting and can improve the model’s generalization performance on unseen data.

The ccp_alpha parameter determines the complexity parameter α. The subtree with the largest cost complexity that is smaller than α will be pruned. By default, no pruning is performed.

The default value for ccp_alpha is 0.0, which means no pruning is done.

In practice, ccp_alpha values between 0.0 and 0.1 are common, depending on the complexity of the dataset and the desire to avoid overfitting.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different ccp_alpha values
ccp_alphas = [0.0, 0.01, 0.05, 0.1]
accuracies = []

for ccp_alpha in ccp_alphas:
    dt = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"ccp_alpha={ccp_alpha}, Accuracy: {accuracy:.3f}")

The output will look similar to:

ccp_alpha=0.0, Accuracy: 0.875
ccp_alpha=0.01, Accuracy: 0.855
ccp_alpha=0.05, Accuracy: 0.735
ccp_alpha=0.1, Accuracy: 0.735

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train DecisionTreeClassifier models with different ccp_alpha values
Evaluate the accuracy of each model on the test set and print results

Tips and heuristics for setting ccp_alpha:

Start with the default value of 0.0 and incrementally increase to find the optimal value
Higher values lead to more pruning, which can improve generalization
Too high values may lead to underfitting

Issues to consider:

The optimal ccp_alpha value depends on the specific dataset and problem
It’s important to balance the bias-variance tradeoff when pruning
Pruning may not always improve performance, especially for small trees

See Also