The bootstrap parameter in scikit-learn’s RandomForestRegressor controls whether bootstrap samples are used when building trees.
Bootstrapping is a resampling technique where each tree is trained on a random sample of the training data, drawn with replacement. This introduces randomness and diversity into the ensemble, which can help reduce overfitting.
By default, bootstrap is set to True, meaning that each tree is trained on a bootstrapped subset of the data. Setting bootstrap to False means that each tree is trained on the entire training dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
n_targets=1, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with bootstrap=True and bootstrap=False
rf_bootstrap = RandomForestRegressor(n_estimators=100, bootstrap=True, random_state=42)
rf_no_bootstrap = RandomForestRegressor(n_estimators=100, bootstrap=False, random_state=42)
rf_bootstrap.fit(X_train, y_train)
rf_no_bootstrap.fit(X_train, y_train)
# Evaluate performance
y_pred_bootstrap = rf_bootstrap.predict(X_test)
y_pred_no_bootstrap = rf_no_bootstrap.predict(X_test)
mse_bootstrap = mean_squared_error(y_test, y_pred_bootstrap)
mse_no_bootstrap = mean_squared_error(y_test, y_pred_no_bootstrap)
print(f"Bootstrap MSE: {mse_bootstrap:.3f}")
print(f"No Bootstrap MSE: {mse_no_bootstrap:.3f}")
Running the example gives an output like:
Bootstrap MSE: 208.093
No Bootstrap MSE: 458.141
The key steps in this example are:
- Generate a synthetic regression dataset with informative and noise features
- Split the data into train and test sets
- Train
RandomForestRegressormodels withbootstrap=Trueandbootstrap=False - Evaluate and compare the mean squared error of the models on the test set
Some tips and heuristics for setting bootstrap:
- Bootstrapping generally improves performance by reducing variance, especially on smaller datasets
- Using
bootstrap=Falsecan be beneficial for large datasets where reducing training time is important - If using
bootstrap=False, it’s recommended to increasen_estimatorsto maintain performance
Issues to consider:
- Bootstrapping increases the model’s training time, as each tree is trained on a different data subset
- The randomness introduced by bootstrapping can slightly reduce the interpretability of feature importances
- The impact of bootstrapping may vary depending on the specific characteristics of the dataset