The max_features
parameter in scikit-learn’s HistGradientBoostingClassifier
controls the number of features to consider when looking for the best split.
HistGradientBoostingClassifier is an implementation of gradient boosting that uses histogram-based decision trees. It’s known for its speed and efficiency, especially on large datasets.
The max_features
parameter determines the maximum number of features to consider when splitting a node. It is a float between 0 and 1, representing a percentage of the number of input features.
By default, max_features
is set to 1.0, which uses all features.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
accuracies = []
for mf in max_features_values:
hgbc = HistGradientBoostingClassifier(max_features=mf, random_state=42)
hgbc.fit(X_train, y_train)
y_pred = hgbc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"max_features={mf}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
max_features=0.1, Accuracy: 0.904
max_features=0.3, Accuracy: 0.917
max_features=0.5, Accuracy: 0.918
max_features=0.7, Accuracy: 0.915
max_features=0.9, Accuracy: 0.913
max_features=1.0, Accuracy: 0.912
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
HistGradientBoostingClassifier
models with differentmax_features
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting max_features
:
- Start with the default 1.0 and experiment with other values
- ‘sqrt’ and ’log2’ of the number of features often work well and can reduce overfitting
- Using a float (e.g., 0.5) allows for fine-grained control
- Lower values can lead to more diverse trees but may miss important features
Issues to consider:
- The optimal value depends on the dataset’s characteristics
- Too low values may result in underfitting, while too high can lead to overfitting
- There’s a trade-off between model complexity and computational cost
- The effect of
max_features
can interact with other hyperparameters