Configure DecisionTreeRegressor "criterion" Parameter

The criterion parameter in scikit-learn’s DecisionTreeRegressor determines the function used to measure the quality of a split at each node of the tree.

It supports three different options: “squared_error” (equivalent to “mse”) for mean squared error, “friedman_mse” for mean squared error with Friedman’s improvement score, and “absolute_error” (equivalent to “mae”) for mean absolute error.

The default value for criterion is “squared_error”, which is generally a good choice for most regression problems. “absolute_error” can be more robust to outliers in the data.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criterion_values = ["squared_error", "friedman_mse", "absolute_error"]
mse_scores = []
mae_scores = []

for criterion in criterion_values:
    dt = DecisionTreeRegressor(criterion=criterion, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse_scores.append(mse)
    mae_scores.append(mae)
    print(f"criterion={criterion}, MSE: {mse:.3f}, MAE: {mae:.3f}")

Running the example gives an output like:

criterion=squared_error, MSE: 481.286, MAE: 17.384
criterion=friedman_mse, MSE: 480.421, MAE: 17.443
criterion=absolute_error, MSE: 566.492, MAE: 18.470

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features
Split the data into train and test sets
Train DecisionTreeRegressor models with different criterion values
Evaluate the mean squared error and mean absolute error of each model on the test set

Some tips and heuristics for setting criterion:

“squared_error” is generally a good default choice for most regression problems
“absolute_error” can be more robust to outliers in the data
“friedman_mse” uses a slightly modified version of the mean squared error criterion

Issues to consider:

The optimal choice of criterion may depend on the specific characteristics of the dataset and problem
There can be trade-offs between different criteria, such as “squared_error” penalizing large errors more heavily while “absolute_error” is less sensitive to outliers

See Also