SKLearner Home | About | Contact | Examples

Configure ExtraTreesClassifier "max_samples" Parameter

The max_samples parameter in scikit-learn’s ExtraTreesClassifier controls the number of samples used for training each base estimator.

ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees (extra-trees) on various sub-samples of the dataset. The max_samples parameter determines the size of these sub-samples.

Setting max_samples to a value less than 1.0 can increase the diversity of the trees in the forest, potentially improving the model’s ability to generalize. However, using too small a value may result in underfitting.

The default value for max_samples is None, which means that all samples are used to train each base estimator.

In practice, values between 0.5 and 1.0 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_samples values
max_samples_values = [None, 0.5, 0.7, 0.9]
accuracies = []

for samples in max_samples_values:
    etc = ExtraTreesClassifier(n_estimators=100, bootstrap=True, max_samples=samples, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_samples={samples}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_samples=None, Accuracy: 0.925
max_samples=0.5, Accuracy: 0.900
max_samples=0.7, Accuracy: 0.910
max_samples=0.9, Accuracy: 0.920

The key steps in this example are:

  1. Generate a synthetic classification dataset with informative and redundant features
  2. Split the data into train and test sets
  3. Train ExtraTreesClassifier models with different max_samples values
  4. Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_samples:

Issues to consider:



See Also