The splitter
parameter in scikit-learn’s DecisionTreeRegressor
controls the strategy used to split nodes when building the tree.
Decision Tree Regression is a non-parametric supervised learning algorithm that learns a hierarchy of if-then-else decision rules to predict a target variable. The splitter
parameter determines how the splits are made at each node.
The splitter
parameter can be set to either "best"
or "random"
. When set to "best"
, the algorithm chooses the best split based on a criterion such as mean squared error (MSE) or mean absolute error (MAE). When set to "random"
, the algorithm selects a random split from the top max_features
features.
The default value for splitter
is "best"
, which generally leads to better performance but may overfit on some datasets.
Using "random"
can help reduce overfitting by introducing randomness in the tree-building process, but it may also lead to lower performance.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different splitter values
splitter_values = ["best", "random"]
mse_scores = []
for splitter in splitter_values:
dt = DecisionTreeRegressor(splitter=splitter, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"splitter='{splitter}', MSE: {mse:.3f}")
Running the example gives an output like:
splitter='best', MSE: 6350.428
splitter='random', MSE: 6464.066
The key steps in this example are:
- Generate a synthetic regression dataset with noise
- Split the data into train and test sets
- Train
DecisionTreeRegressor
models with differentsplitter
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting splitter
:
- Use the default
"best"
value for most cases, as it generally leads to better performance - Consider using
"random"
if the tree is overfitting or if randomness is desired - When using
"random"
, also tune themax_features
parameter to control the number of features considered at each split
Issues to consider:
- The optimal
splitter
value depends on the dataset and the problem - Using
"random"
may require more trees or a largermax_depth
to achieve good performance - The randomness introduced by
"random"
can make the model less interpretable