Configure ExtraTreesClassifier "bootstrap" Parameter

The bootstrap parameter in scikit-learn’s ExtraTreesClassifier determines whether bootstrap samples are used when building trees.

Extra Trees, short for Extremely Randomized Trees, is an ensemble learning method similar to Random Forests. It builds multiple decision trees and aggregates their predictions to improve overall performance and reduce overfitting.

The bootstrap parameter controls whether individual trees are trained on bootstrap samples (random samples with replacement) of the training data. When True, each tree uses a random subset of the data, introducing more diversity among trees. When False, the whole dataset is used for each tree.

The default value for bootstrap is False in ExtraTreesClassifier.

In practice, both True and False are commonly used, depending on the specific problem and dataset characteristics.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different bootstrap values
bootstrap_values = [False, True]
accuracies = []

for bootstrap in bootstrap_values:
    etc = ExtraTreesClassifier(n_estimators=100, random_state=42, bootstrap=bootstrap)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"bootstrap={bootstrap}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

bootstrap=False, Accuracy: 0.845
bootstrap=True, Accuracy: 0.825

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train ExtraTreesClassifier models with different bootstrap values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting the bootstrap parameter:

Use bootstrap=False (default) for smaller datasets or when you want to ensure all data points are used in each tree
Try bootstrap=True for larger datasets or when you want to introduce more randomness and diversity among trees
Compare performance with both options to determine which works best for your specific problem

Issues to consider:

Using bootstrap=True may lead to better generalization on some datasets, but can also increase training time
The impact of bootstrapping can vary depending on the size and characteristics of your dataset
Bootstrapping interacts with other parameters like max_samples, which controls the size of the bootstrap sample when bootstrap=True

See Also