Configure HistGradientBoostingRegressor "categorical_features" Parameter

The categorical_features parameter in scikit-learn’s HistGradientBoostingRegressor specifies which features should be treated as categorical.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.

The categorical_features parameter allows the algorithm to properly handle categorical variables without the need for manual encoding. It expects a boolean mask or a list of indices indicating which features are categorical.

By default, categorical_features is set to None, which means all features are treated as numerical. When specified, it enables the algorithm to use special handling for categorical splits.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset with mixed feature types
n_samples = 1000
n_features = 5
n_categorical = 2

X, y = make_regression(n_samples=n_samples, n_features=n_features, noise=0.1, random_state=42)

# Convert some features to categorical
for i in range(n_categorical):
    X[:, i] = np.random.randint(0, 5, size=n_samples)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different categorical_features configurations
configurations = [
    None,
    [0, 1],
    [True, True, False, False, False]
]

for config in configurations:
    hgbr = HistGradientBoostingRegressor(categorical_features=config, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"categorical_features={config}, MSE: {mse:.3f}")

Running the example gives an output like:

categorical_features=None, MSE: 3428.551
categorical_features=[0, 1], MSE: 3627.036
categorical_features=[True, True, False, False, False], MSE: 3627.036

The key steps in this example are:

Generate a synthetic regression dataset with mixed numerical and categorical features
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different categorical_features configurations
Evaluate the mean squared error of each model on the test set

Some tips for setting categorical_features:

Use a boolean mask or list of indices to specify categorical features
Ensure categorical features are encoded as integers starting from 0
Consider the number of unique values in each categorical feature

Issues to consider:

Incorrectly specifying numerical features as categorical can lead to suboptimal performance
Very high cardinality categorical features may still benefit from encoding techniques
The effectiveness of automatic categorical handling depends on the dataset characteristics

See Also