The max_samples
parameter in scikit-learn’s ExtraTreesClassifier
controls the number of samples used for training each base estimator.
ExtraTreesClassifier
is an ensemble method that fits a number of randomized decision trees (extra-trees) on various sub-samples of the dataset. The max_samples
parameter determines the size of these sub-samples.
Setting max_samples
to a value less than 1.0 can increase the diversity of the trees in the forest, potentially improving the model’s ability to generalize. However, using too small a value may result in underfitting.
The default value for max_samples
is None, which means that all samples are used to train each base estimator.
In practice, values between 0.5 and 1.0 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_samples values
max_samples_values = [None, 0.5, 0.7, 0.9]
accuracies = []
for samples in max_samples_values:
etc = ExtraTreesClassifier(n_estimators=100, bootstrap=True, max_samples=samples, random_state=42)
etc.fit(X_train, y_train)
y_pred = etc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_samples={samples}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_samples=None, Accuracy: 0.925
max_samples=0.5, Accuracy: 0.900
max_samples=0.7, Accuracy: 0.910
max_samples=0.9, Accuracy: 0.920
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
ExtraTreesClassifier
models with differentmax_samples
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting max_samples
:
- Start with the default value (None) and experiment with lower values
- Consider the size of your dataset; smaller datasets may benefit from higher
max_samples
values - Balance between increasing tree diversity and retaining enough information for each tree
Issues to consider:
- Lower
max_samples
values can increase training time as more trees may be needed for convergence - Very low values might lead to underfitting, while values close to 1.0 may not provide enough diversity
- The optimal
max_samples
value often interacts with other parameters liken_estimators