Configure RandomForestClassifier "max_features" Parameter

The max_features parameter in scikit-learn’s RandomForestClassifier controls the number of features considered when looking for the best split at each node in the decision trees.

Random Forest builds an ensemble of decision trees, and at each split in a tree, it considers a random subset of features. The max_features parameter determines the size of this subset.

Adjusting max_features affects the bias-variance tradeoff of the model. Using a smaller value introduces more randomness and reduces variance, but may increase bias if set too low.

The default value for max_features is "sqrt", which considers a number of features equal to the square root of the total number of features. Another common value is "log2", which uses the logarithm base 2 of the total features.

In practice, the optimal value depends on the specific dataset and problem. It is common to tune max_features using values between 0.1 and 1.0 times the total number of features.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset with many features
X, y = make_classification(n_samples=1000, n_features=100, n_informative=20,
                           n_redundant=0, n_classes=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [0.1, "sqrt", "log2", 0.5, 1.0]
accuracies = []

for mf in max_features_values:
    rf = RandomForestClassifier(max_features=mf, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_features={mf}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_features=0.1, Accuracy: 0.535
max_features=sqrt, Accuracy: 0.535
max_features=log2, Accuracy: 0.495
max_features=0.5, Accuracy: 0.510
max_features=1.0, Accuracy: 0.510

The key steps in this example are:

Generate a synthetic multiclass classification dataset with 100 features
Split the data into train and test sets
Train RandomForestClassifier models with different max_features values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_features:

For datasets with many features, start with a low value like 0.1 or "sqrt"
If the model underfits, increase max_features to reduce bias
If the model overfits, decrease max_features to increase randomness and reduce variance

Issues to consider:

Using a very low max_features value can lead to high bias and underfitting
Setting max_features too high may not provide enough regularization, causing overfitting
Higher values of max_features are more computationally expensive, as more features are considered at each split

See Also