SKLearner Home | About | Contact | Examples

Configure HistGradientBoostingClassifier "categorical_features" Parameter

The categorical_features parameter in scikit-learn’s HistGradientBoostingClassifier specifies which features should be treated as categorical.

Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that can handle both numerical and categorical features. The categorical_features parameter allows you to explicitly define which features should be treated as categorical.

By default, categorical_features is set to None, which means the algorithm will try to automatically infer which features are categorical. You can also provide a boolean mask or a list of indices to specify categorical features manually.

In practice, it’s often beneficial to explicitly specify categorical features, especially when dealing with high-cardinality categorical variables or when you want fine-grained control over feature handling.

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Generate synthetic dataset
n_samples = 1000
n_features = 5

# Create numerical features
X_numerical = np.random.rand(n_samples, n_features)

# Create categorical features
X_categorical = np.random.randint(0, 5, size=(n_samples, 2))

# Combine features
X = np.hstack([X_numerical, X_categorical])

# Generate target variable
y = (X[:, 0] + X[:, 5] > 1).astype(int)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define different categorical_features configurations
configs = [
    ('Auto-detect', None),
    ('All numerical', [False] * 7),
    ('Correct specification', [False] * 5 + [True] * 2)
]

for name, cat_features in configs:
    model = HistGradientBoostingClassifier(random_state=42, categorical_features=cat_features)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name}: Accuracy = {accuracy:.4f}")

Running the example gives an output like:

Auto-detect: Accuracy = 1.0000
All numerical: Accuracy = 1.0000
Correct specification: Accuracy = 1.0000

The key steps in this example are:

  1. Generate a synthetic dataset with both numerical and categorical features
  2. Split the data into training and test sets
  3. Create HistGradientBoostingClassifier models with different categorical_features configurations
  4. Train each model and evaluate its accuracy on the test set

Some tips and heuristics for setting categorical_features:

Issues to consider:



See Also