The splitter
parameter in scikit-learn’s DecisionTreeClassifier
controls the strategy used for splitting nodes when building the decision tree.
Decision Trees are a non-parametric supervised learning method used for classification and regression. The splitter
parameter determines how the splits are made at each node.
The default value for splitter
is “best”, which chooses the best split based on impurity criteria (Gini for classification, MSE for regression). The alternative value is “random”, which selects random splits.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different splitter values
splitter_values = ["best", "random"]
accuracies = []
for splitter in splitter_values:
dt = DecisionTreeClassifier(splitter=splitter, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"splitter={splitter}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
splitter=best, Accuracy: 0.875
splitter=random, Accuracy: 0.815
The key steps in this example are:
- Generate a synthetic binary classification dataset
- Split the data into train and test sets
- Train
DecisionTreeClassifier
models with differentsplitter
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting splitter
:
- “best” is the default and often provides good results
- “random” can be useful for large datasets to reduce training time
Issues to consider:
- “random” split may lead to slightly lower accuracy than “best”
- Computational cost vs. benefit of using “best” on large datasets