The max_features
parameter in scikit-learn’s ExtraTreesClassifier
controls the number of features to consider when looking for the best split.
Extra Trees Classifier is an ensemble method that builds multiple randomized decision trees and combines their predictions. The max_features
parameter introduces additional randomness in the tree-building process by limiting the number of features considered at each split.
Adjusting max_features
affects the diversity of the trees in the ensemble. Lower values increase randomness and can help prevent overfitting, while higher values allow the algorithm to consider more features, potentially capturing more complex relationships.
The default value for max_features
is ‘sqrt’, which uses the square root of the total number of features. Common options include ‘sqrt’, ’log2’, or a fraction of the total features (e.g., 0.5).
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = ['sqrt', 'log2', 0.5, 0.8, None]
accuracies = []
for mf in max_features_values:
etc = ExtraTreesClassifier(n_estimators=100, max_features=mf, random_state=42)
etc.fit(X_train, y_train)
y_pred = etc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_features={mf}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_features=sqrt, Accuracy: 0.925
max_features=log2, Accuracy: 0.925
max_features=0.5, Accuracy: 0.930
max_features=0.8, Accuracy: 0.925
max_features=None, Accuracy: 0.930
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
ExtraTreesClassifier
models with differentmax_features
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting max_features
:
- Start with the default ‘sqrt’ and experiment with other values
- Lower values of
max_features
increase randomness and can help with high-dimensional data - Higher values allow the model to consider more features, which may be beneficial for complex relationships
Issues to consider:
- The optimal
max_features
value depends on the dataset’s characteristics and the problem at hand - Very low values may lead to underfitting, while very high values may cause overfitting
- The impact of
max_features
can vary depending on other parameters liken_estimators
andmax_depth