SKLearner Home | About | Contact | Examples

Configure RandomForestClassifier "ccp_alpha" Parameter

The ccp_alpha parameter in scikit-learn’s RandomForestClassifier controls the complexity of the decision trees in the ensemble via cost complexity pruning.

Cost complexity pruning is a technique that reduces the size of a decision tree by removing branches that provide little power to classify instances. By setting ccp_alpha to a non-zero value, the algorithm will prune branches with the smallest cost complexity, resulting in smaller, less complex trees.

Pruning can help avoid overfitting, especially when dealing with noisy or high-dimensional datasets. However, setting ccp_alpha too high may lead to underfitting.

The default value for ccp_alpha is 0.0, which means no pruning is performed.

In practice, small positive values between 0.001 and 0.1 are commonly used depending on the characteristics of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different ccp_alpha values
ccp_alpha_values = [0.0, 0.001, 0.01, 0.1]
accuracies = []

for ccp_alpha in ccp_alpha_values:
    rf = RandomForestClassifier(n_estimators=100, ccp_alpha=ccp_alpha, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"ccp_alpha={ccp_alpha}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

ccp_alpha=0.0, Accuracy: 0.915
ccp_alpha=0.001, Accuracy: 0.915
ccp_alpha=0.01, Accuracy: 0.885
ccp_alpha=0.1, Accuracy: 0.490

The key steps in this example are:

  1. Generate a synthetic classification dataset with informative, redundant, and noise features
  2. Split the data into train and test sets
  3. Train RandomForestClassifier models with different ccp_alpha values
  4. Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting ccp_alpha:

Issues to consider:



See Also