The max_features
parameter in scikit-learn’s RandomForestClassifier
controls the number of features considered when looking for the best split at each node in the decision trees.
Random Forest builds an ensemble of decision trees, and at each split in a tree, it considers a random subset of features. The max_features
parameter determines the size of this subset.
Adjusting max_features
affects the bias-variance tradeoff of the model. Using a smaller value introduces more randomness and reduces variance, but may increase bias if set too low.
The default value for max_features
is "sqrt"
, which considers a number of features equal to the square root of the total number of features. Another common value is "log2"
, which uses the logarithm base 2 of the total features.
In practice, the optimal value depends on the specific dataset and problem. It is common to tune max_features
using values between 0.1 and 1.0 times the total number of features.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset with many features
X, y = make_classification(n_samples=1000, n_features=100, n_informative=20,
n_redundant=0, n_classes=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = [0.1, "sqrt", "log2", 0.5, 1.0]
accuracies = []
for mf in max_features_values:
rf = RandomForestClassifier(max_features=mf, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_features={mf}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_features=0.1, Accuracy: 0.535
max_features=sqrt, Accuracy: 0.535
max_features=log2, Accuracy: 0.495
max_features=0.5, Accuracy: 0.510
max_features=1.0, Accuracy: 0.510
The key steps in this example are:
- Generate a synthetic multiclass classification dataset with 100 features
- Split the data into train and test sets
- Train
RandomForestClassifier
models with differentmax_features
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting max_features
:
- For datasets with many features, start with a low value like 0.1 or
"sqrt"
- If the model underfits, increase
max_features
to reduce bias - If the model overfits, decrease
max_features
to increase randomness and reduce variance
Issues to consider:
- Using a very low
max_features
value can lead to high bias and underfitting - Setting
max_features
too high may not provide enough regularization, causing overfitting - Higher values of
max_features
are more computationally expensive, as more features are considered at each split