The bootstrap
parameter in scikit-learn’s ExtraTreesRegressor
determines whether bootstrap samples are used when building trees.
Extra Trees Regressor is an ensemble method that builds a collection of unpruned decision trees and uses averaging to improve prediction accuracy and control over-fitting.
The bootstrap
parameter controls whether individual trees are trained on bootstrap samples (random samples with replacement) of the training data. When True
, it introduces randomness and can help reduce overfitting. When False
, the whole dataset is used to train each tree.
The default value for bootstrap
is False
in ExtraTreesRegressor
.
In practice, both True
and False
are commonly used, depending on the specific dataset and problem at hand.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import time
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different bootstrap values
bootstrap_values = [False, True]
mse_scores = []
training_times = []
for bootstrap in bootstrap_values:
start_time = time.time()
etr = ExtraTreesRegressor(n_estimators=100, random_state=42, bootstrap=bootstrap)
etr.fit(X_train, y_train)
training_time = time.time() - start_time
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
training_times.append(training_time)
print(f"bootstrap={bootstrap}, MSE: {mse:.3f}, Training Time: {training_time:.3f} seconds")
Running the example gives an output like:
bootstrap=False, MSE: 2036.183, Training Time: 0.223 seconds
bootstrap=True, MSE: 2122.805, Training Time: 0.198 seconds
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models withbootstrap
set toFalse
andTrue
- Evaluate the mean squared error of each model on the test set
- Compare the performance and training time for each configuration
Some tips and heuristics for setting bootstrap
:
- Use
bootstrap=True
when you want to introduce more randomness and potentially reduce overfitting - Use
bootstrap=False
when you want each tree to see all the data, which may be beneficial for smaller datasets - Experiment with both options and use cross-validation to determine which works best for your specific problem
Issues to consider:
- Using
bootstrap=True
may increase training time due to the sampling process - The effect of bootstrapping can vary depending on the size and characteristics of your dataset
- When
bootstrap=True
, roughly 63.2% of the original samples are included in each bootstrap sample, affecting the diversity of trees in the ensemble