SKLearner Home | About | Contact | Examples

Scikit-Learn "ExtraTreesRegressor" versus "RandomForestRegressor"

ExtraTreesRegressor and RandomForestRegressor are both ensemble methods used for regression tasks, but they differ in their approach to randomization and averaging.

ExtraTreesRegressor utilizes random splits for each feature and ensemble averaging. Key hyperparameters include n_estimators (number of trees), max_features (number of features to consider when looking for the best split), and min_samples_split (minimum number of samples required to split an internal node).

RandomForestRegressor uses bagging with decision trees and feature randomness for improved accuracy. Key hyperparameters include n_estimators, max_features, and bootstrap (whether bootstrap samples are used when building trees).

The main difference is that ExtraTreesRegressor introduces more randomness compared to RandomForestRegressor, which can lead to different performance and generalization characteristics. ExtraTreesRegressor may perform better in terms of variance reduction, while RandomForestRegressor can be more robust due to bootstrapping.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate ExtraTreesRegressor
etr = ExtraTreesRegressor(n_estimators=100, random_state=42)
etr.fit(X_train, y_train)
y_pred_etr = etr.predict(X_test)
print(f"ExtraTreesRegressor MSE: {mean_squared_error(y_test, y_pred_etr):.3f}")
print(f"ExtraTreesRegressor R2: {r2_score(y_test, y_pred_etr):.3f}")

# Fit and evaluate RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=100, random_state=42)
rfr.fit(X_train, y_train)
y_pred_rfr = rfr.predict(X_test)
print(f"RandomForestRegressor MSE: {mean_squared_error(y_test, y_pred_rfr):.3f}")
print(f"RandomForestRegressor R2: {r2_score(y_test, y_pred_rfr):.3f}")

Running the example gives an output like:

ExtraTreesRegressor MSE: 6014.584
ExtraTreesRegressor R2: 0.845
RandomForestRegressor MSE: 7057.104
RandomForestRegressor R2: 0.818
  1. Generate a synthetic regression dataset using make_regression with 1000 samples and 20 features.
  2. Split the data into training and test sets using train_test_split.
  3. Instantiate ExtraTreesRegressor with 100 trees, fit it on the training data, and evaluate its performance on the test set using MSE and R-squared.
  4. Instantiate RandomForestRegressor with 100 trees, fit it on the training data, and evaluate its performance on the test set using MSE and R-squared.
  5. Compare the performance metrics of both models to understand their differences in handling regression tasks.


See Also