The ccp_alpha
parameter in scikit-learn’s RandomForestClassifier
controls the complexity of the decision trees in the ensemble via cost complexity pruning.
Cost complexity pruning is a technique that reduces the size of a decision tree by removing branches that provide little power to classify instances. By setting ccp_alpha
to a non-zero value, the algorithm will prune branches with the smallest cost complexity, resulting in smaller, less complex trees.
Pruning can help avoid overfitting, especially when dealing with noisy or high-dimensional datasets. However, setting ccp_alpha
too high may lead to underfitting.
The default value for ccp_alpha
is 0.0, which means no pruning is performed.
In practice, small positive values between 0.001 and 0.1 are commonly used depending on the characteristics of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different ccp_alpha values
ccp_alpha_values = [0.0, 0.001, 0.01, 0.1]
accuracies = []
for ccp_alpha in ccp_alpha_values:
rf = RandomForestClassifier(n_estimators=100, ccp_alpha=ccp_alpha, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"ccp_alpha={ccp_alpha}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
ccp_alpha=0.0, Accuracy: 0.915
ccp_alpha=0.001, Accuracy: 0.915
ccp_alpha=0.01, Accuracy: 0.885
ccp_alpha=0.1, Accuracy: 0.490
The key steps in this example are:
- Generate a synthetic classification dataset with informative, redundant, and noise features
- Split the data into train and test sets
- Train
RandomForestClassifier
models with differentccp_alpha
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting ccp_alpha
:
- Start with the default value of 0.0 and increase it in small increments
- Increasing
ccp_alpha
leads to smaller, less complex trees - Too high a value can lead to underfitting
- The optimal value depends on the specific characteristics of the dataset
Issues to consider:
- Pruning trades off bias and variance in the model
- It may be necessary to tune
ccp_alpha
together with other parameters likemax_depth
to find the best balance - Pruning increases the training time of the model