The criterion
parameter in scikit-learn’s ExtraTreesRegressor
determines the function used to measure the quality of a split.
ExtraTreesRegressor
is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve the predictive accuracy and control over-fitting.
The criterion
parameter affects how the algorithm chooses the best split at each node. It can impact both the model’s performance and the structure of the trees.
The default value for criterion
is “squared_error”. Other options include “friedman_mse” and “absolute_error”.
Each criterion option may perform differently depending on the specific characteristics of your dataset and the nature of the regression problem.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criterion_options = ["squared_error", "friedman_mse", "absolute_error"]
mse_scores = []
for criterion in criterion_options:
etr = ExtraTreesRegressor(n_estimators=100, criterion=criterion, random_state=42)
etr.fit(X_train, y_train)
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"criterion={criterion}, MSE: {mse:.3f}")
Running the example gives an output like:
criterion=squared_error, MSE: 2036.183
criterion=friedman_mse, MSE: 2022.844
criterion=absolute_error, MSE: 1894.340
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentcriterion
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting criterion
:
- “squared_error” is a good default choice for many regression problems
- “friedman_mse” can be beneficial for datasets with heteroscedastic noise
- “absolute_error” may be preferable when dealing with outliers or non-Gaussian error distributions
Issues to consider:
- The best criterion depends on the specific characteristics of your dataset
- Different criteria may lead to different tree structures and prediction patterns
- The impact of the criterion choice may vary with other parameters like
max_depth
ormin_samples_split