Configure ExtraTreesClassifier "max_samples" Parameter

The max_samples parameter in scikit-learn’s ExtraTreesClassifier controls the number of samples used for training each base estimator.

ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees (extra-trees) on various sub-samples of the dataset. The max_samples parameter determines the size of these sub-samples.

Setting max_samples to a value less than 1.0 can increase the diversity of the trees in the forest, potentially improving the model’s ability to generalize. However, using too small a value may result in underfitting.

The default value for max_samples is None, which means that all samples are used to train each base estimator.

In practice, values between 0.5 and 1.0 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_samples values
max_samples_values = [None, 0.5, 0.7, 0.9]
accuracies = []

for samples in max_samples_values:
    etc = ExtraTreesClassifier(n_estimators=100, bootstrap=True, max_samples=samples, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_samples={samples}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_samples=None, Accuracy: 0.925
max_samples=0.5, Accuracy: 0.900
max_samples=0.7, Accuracy: 0.910
max_samples=0.9, Accuracy: 0.920

The key steps in this example are:

Generate a synthetic classification dataset with informative and redundant features
Split the data into train and test sets
Train ExtraTreesClassifier models with different max_samples values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_samples:

Start with the default value (None) and experiment with lower values
Consider the size of your dataset; smaller datasets may benefit from higher max_samples values
Balance between increasing tree diversity and retaining enough information for each tree

Issues to consider:

Lower max_samples values can increase training time as more trees may be needed for convergence
Very low values might lead to underfitting, while values close to 1.0 may not provide enough diversity
The optimal max_samples value often interacts with other parameters like n_estimators

See Also