The max_features
parameter in scikit-learn’s GradientBoostingClassifier
controls the number of features to consider when looking for the best split at each node of the decision trees.
Gradient Boosting builds an ensemble of decision trees in a stage-wise fashion. The max_features
parameter determines how many features are randomly sampled for splitting at each node.
Smaller values of max_features
can help reduce overfitting by introducing more randomness and diversity among the trees. However, setting it too low may lead to underfitting.
The default value for max_features
is None
, which means that all features are considered at each split. Common values are "sqrt"
(square root of the total number of features) and "log2"
(logarithm base 2 of the total number of features).
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=10,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = [None, "sqrt", "log2", 0.5]
accuracies = []
for mf in max_features_values:
gb = GradientBoostingClassifier(max_features=mf, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_features={mf}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_features=None, Accuracy: 0.760
max_features=sqrt, Accuracy: 0.720
max_features=log2, Accuracy: 0.720
max_features=0.5, Accuracy: 0.750
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentmax_features
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting max_features
:
- Start with the default value (
None
) and try smaller values like"sqrt"
or"log2"
- Smaller values can help reduce overfitting, especially for high-dimensional datasets
- If the model is underfitting, try increasing
max_features
Issues to consider:
- The optimal value depends on the specific dataset and problem
- Setting
max_features
too low can lead to underfitting and poor performance - Using a smaller value increases randomness, which can require more trees (
n_estimators
) for the model to converge