The criterion
parameter in scikit-learn’s DecisionTreeRegressor
determines the function used to measure the quality of a split at each node of the tree.
It supports three different options: “squared_error” (equivalent to “mse”) for mean squared error, “friedman_mse” for mean squared error with Friedman’s improvement score, and “absolute_error” (equivalent to “mae”) for mean absolute error.
The default value for criterion
is “squared_error”, which is generally a good choice for most regression problems. “absolute_error” can be more robust to outliers in the data.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criterion_values = ["squared_error", "friedman_mse", "absolute_error"]
mse_scores = []
mae_scores = []
for criterion in criterion_values:
dt = DecisionTreeRegressor(criterion=criterion, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse_scores.append(mse)
mae_scores.append(mae)
print(f"criterion={criterion}, MSE: {mse:.3f}, MAE: {mae:.3f}")
Running the example gives an output like:
criterion=squared_error, MSE: 481.286, MAE: 17.384
criterion=friedman_mse, MSE: 480.421, MAE: 17.443
criterion=absolute_error, MSE: 566.492, MAE: 18.470
The key steps in this example are:
- Generate a synthetic regression dataset with informative and noise features
- Split the data into train and test sets
- Train
DecisionTreeRegressor
models with differentcriterion
values - Evaluate the mean squared error and mean absolute error of each model on the test set
Some tips and heuristics for setting criterion
:
- “squared_error” is generally a good default choice for most regression problems
- “absolute_error” can be more robust to outliers in the data
- “friedman_mse” uses a slightly modified version of the mean squared error criterion
Issues to consider:
- The optimal choice of
criterion
may depend on the specific characteristics of the dataset and problem - There can be trade-offs between different criteria, such as “squared_error” penalizing large errors more heavily while “absolute_error” is less sensitive to outliers