Configure ExtraTreesRegressor "bootstrap" Parameter

The bootstrap parameter in scikit-learn’s ExtraTreesRegressor determines whether bootstrap samples are used when building trees.

Extra Trees Regressor is an ensemble method that builds a collection of unpruned decision trees and uses averaging to improve prediction accuracy and control over-fitting.

The bootstrap parameter controls whether individual trees are trained on bootstrap samples (random samples with replacement) of the training data. When True, it introduces randomness and can help reduce overfitting. When False, the whole dataset is used to train each tree.

The default value for bootstrap is False in ExtraTreesRegressor.

In practice, both True and False are commonly used, depending on the specific dataset and problem at hand.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different bootstrap values
bootstrap_values = [False, True]
mse_scores = []
training_times = []

for bootstrap in bootstrap_values:
    start_time = time.time()
    etr = ExtraTreesRegressor(n_estimators=100, random_state=42, bootstrap=bootstrap)
    etr.fit(X_train, y_train)
    training_time = time.time() - start_time

    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    mse_scores.append(mse)
    training_times.append(training_time)

    print(f"bootstrap={bootstrap}, MSE: {mse:.3f}, Training Time: {training_time:.3f} seconds")

Running the example gives an output like:

bootstrap=False, MSE: 2036.183, Training Time: 0.223 seconds
bootstrap=True, MSE: 2122.805, Training Time: 0.198 seconds

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with bootstrap set to False and True
Evaluate the mean squared error of each model on the test set
Compare the performance and training time for each configuration

Some tips and heuristics for setting bootstrap:

Use bootstrap=True when you want to introduce more randomness and potentially reduce overfitting
Use bootstrap=False when you want each tree to see all the data, which may be beneficial for smaller datasets
Experiment with both options and use cross-validation to determine which works best for your specific problem

Issues to consider:

Using bootstrap=True may increase training time due to the sampling process
The effect of bootstrapping can vary depending on the size and characteristics of your dataset
When bootstrap=True, roughly 63.2% of the original samples are included in each bootstrap sample, affecting the diversity of trees in the ensemble

See Also