Configure HistGradientBoostingClassifier "categorical_features" Parameter

The categorical_features parameter in scikit-learn’s HistGradientBoostingClassifier specifies which features should be treated as categorical.

Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that can handle both numerical and categorical features. The categorical_features parameter allows you to explicitly define which features should be treated as categorical.

By default, categorical_features is set to None, which means the algorithm will try to automatically infer which features are categorical. You can also provide a boolean mask or a list of indices to specify categorical features manually.

In practice, it’s often beneficial to explicitly specify categorical features, especially when dealing with high-cardinality categorical variables or when you want fine-grained control over feature handling.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Generate synthetic dataset
n_samples = 1000
n_features = 5

# Create numerical features
X_numerical = np.random.rand(n_samples, n_features)

# Create categorical features
X_categorical = np.random.randint(0, 5, size=(n_samples, 2))

# Combine features
X = np.hstack([X_numerical, X_categorical])

# Generate target variable
y = (X[:, 0] + X[:, 5] > 1).astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define different categorical_features configurations
configs = [
    ('Auto-detect', None),
    ('All numerical', [False] * 7),
    ('Correct specification', [False] * 5 + [True] * 2)
]

for name, cat_features in configs:
    model = HistGradientBoostingClassifier(random_state=42, categorical_features=cat_features)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name}: Accuracy = {accuracy:.4f}")

Running the example gives an output like:

Auto-detect: Accuracy = 1.0000
All numerical: Accuracy = 1.0000
Correct specification: Accuracy = 1.0000

The key steps in this example are:

Generate a synthetic dataset with both numerical and categorical features
Split the data into training and test sets
Create HistGradientBoostingClassifier models with different categorical_features configurations
Train each model and evaluate its accuracy on the test set

Some tips and heuristics for setting categorical_features:

Use domain knowledge to identify truly categorical features
Consider explicitly specifying categorical features for better control and interpretability
Be cautious with high-cardinality categorical features, as they may need special handling
Experiment with different configurations to find the optimal performance for your dataset

Issues to consider:

Automatic detection may not always correctly identify categorical features
Treating numerical features as categorical can lead to loss of information
Treating categorical features as numerical may result in incorrect assumptions about feature relationships

See Also