Configure DecisionTreeClassifier "splitter" Parameter

The splitter parameter in scikit-learn’s DecisionTreeClassifier controls the strategy used for splitting nodes when building the decision tree.

Decision Trees are a non-parametric supervised learning method used for classification and regression. The splitter parameter determines how the splits are made at each node.

The default value for splitter is “best”, which chooses the best split based on impurity criteria (Gini for classification, MSE for regression). The alternative value is “random”, which selects random splits.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different splitter values
splitter_values = ["best", "random"]
accuracies = []

for splitter in splitter_values:
    dt = DecisionTreeClassifier(splitter=splitter, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"splitter={splitter}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

splitter=best, Accuracy: 0.875
splitter=random, Accuracy: 0.815

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train DecisionTreeClassifier models with different splitter values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting splitter:

“best” is the default and often provides good results
“random” can be useful for large datasets to reduce training time

Issues to consider:

“random” split may lead to slightly lower accuracy than “best”
Computational cost vs. benefit of using “best” on large datasets

See Also