The criterion
parameter in scikit-learn’s RandomForestRegressor
determines the function used to measure the quality of a split at each node of the trees in the ensemble.
Random Forest is an ensemble learning method that combines predictions from multiple decision trees. The criterion
parameter controls how the splits are evaluated during the construction of these trees.
There are two options for criterion
: “squared_error” (default) and “absolute_error”. “squared_error” minimizes the mean squared error (MSE) while “absolute_error” minimizes the mean absolute error (MAE).
Both criteria are generally effective, but they may have slightly different behavior depending on the problem and data characteristics.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different criterion values
criteria = ["squared_error", "absolute_error"]
results = []
for criterion in criteria:
rf = RandomForestRegressor(n_estimators=100, criterion=criterion, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
results.append((criterion, mse, mae))
# Print results
for criterion, mse, mae in results:
print(f"Criterion: {criterion}")
print(f"Mean Squared Error: {mse:.3f}")
print(f"Mean Absolute Error: {mae:.3f}\n")
The example output would look like:
Criterion: squared_error
Mean Squared Error: 2621.793
Mean Absolute Error: 40.177
Criterion: absolute_error
Mean Squared Error: 2604.009
Mean Absolute Error: 40.113
The key steps in this example are:
- Generate a synthetic regression dataset with noise
- Split the data into train and test sets
- Train
RandomForestRegressor
models with differentcriterion
values - Evaluate the performance using MSE and MAE metrics
Some tips and heuristics for choosing the criterion
:
- Both “squared_error” and “absolute_error” are usually effective
- “squared_error” minimizes MSE, which penalizes large errors more heavily
- “absolute_error” minimizes MAE, which may be more robust to outliers
Issues to consider:
- The choice of criterion depends on the specific problem and data characteristics
- Differences between the two criteria may be subtle in many cases