Configure ExtraTreesRegressor "random_state" Parameter

The random_state parameter in scikit-learn’s ExtraTreesRegressor controls the randomness of the model, affecting both the sampling of features and data points for each tree.

ExtraTreesRegressor is an ensemble method that builds multiple decision trees and combines their predictions to improve performance and reduce overfitting. It’s similar to Random Forest but uses random thresholds for splitting features.

The random_state parameter ensures reproducibility of results. When set to a fixed value, it guarantees that the model will produce the same results given the same input data and parameters.

The default value for random_state is None, which means the regressor will use a random number generator. In practice, it’s common to set random_state to a fixed integer (e.g., 42) for reproducibility.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different random_state values
random_state_values = [None, 0, 42, 100]
mse_scores = []

for rs in random_state_values:
    etr = ExtraTreesRegressor(n_estimators=100, random_state=rs)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"random_state={rs}, MSE: {mse:.4f}")

# Check variability with random_state=None
none_mse_scores = []
for _ in range(5):
    etr = ExtraTreesRegressor(n_estimators=100, random_state=None)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    none_mse_scores.append(mse)

print(f"random_state=None, MSE range: {min(none_mse_scores):.4f} - {max(none_mse_scores):.4f}")

Running the example gives an output like:

random_state=None, MSE: 1948.6322
random_state=0, MSE: 1916.8944
random_state=42, MSE: 2036.1826
random_state=100, MSE: 1906.2123
random_state=None, MSE range: 1886.5176 - 2018.3518

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different random_state values
Evaluate the Mean Squared Error (MSE) of each model on the test set
Demonstrate the variability of results when random_state=None

Some tips for setting random_state:

Use a fixed integer value for reproducibility in research or production environments
Keep the same random_state value across your entire pipeline for consistent results
Use random_state=None during model development to ensure your model generalizes well

Issues to consider:

Different random_state values may lead to slightly different model performances
Relying too heavily on a single random_state might lead to overfitting to that particular random seed
When reporting results, consider using multiple random seeds and reporting average performance

See Also