Configure RandomForestRegressor "criterion" Parameter

The criterion parameter in scikit-learn’s RandomForestRegressor determines the function used to measure the quality of a split at each node of the trees in the ensemble.

Random Forest is an ensemble learning method that combines predictions from multiple decision trees. The criterion parameter controls how the splits are evaluated during the construction of these trees.

There are two options for criterion: “squared_error” (default) and “absolute_error”. “squared_error” minimizes the mean squared error (MSE) while “absolute_error” minimizes the mean absolute error (MAE).

Both criteria are generally effective, but they may have slightly different behavior depending on the problem and data characteristics.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different criterion values
criteria = ["squared_error", "absolute_error"]
results = []

for criterion in criteria:
    rf = RandomForestRegressor(n_estimators=100, criterion=criterion, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    results.append((criterion, mse, mae))

# Print results
for criterion, mse, mae in results:
    print(f"Criterion: {criterion}")
    print(f"Mean Squared Error: {mse:.3f}")
    print(f"Mean Absolute Error: {mae:.3f}\n")

The example output would look like:

Criterion: squared_error
Mean Squared Error: 2621.793
Mean Absolute Error: 40.177

Criterion: absolute_error
Mean Squared Error: 2604.009
Mean Absolute Error: 40.113

The key steps in this example are:

Generate a synthetic regression dataset with noise
Split the data into train and test sets
Train RandomForestRegressor models with different criterion values
Evaluate the performance using MSE and MAE metrics

Some tips and heuristics for choosing the criterion:

Both “squared_error” and “absolute_error” are usually effective
“squared_error” minimizes MSE, which penalizes large errors more heavily
“absolute_error” minimizes MAE, which may be more robust to outliers

Issues to consider:

The choice of criterion depends on the specific problem and data characteristics
Differences between the two criteria may be subtle in many cases

See Also