SKLearner Home | About | Contact | Examples

Configure HistGradientBoostingRegressor "categorical_features" Parameter

The categorical_features parameter in scikit-learn’s HistGradientBoostingRegressor specifies which features should be treated as categorical.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.

The categorical_features parameter allows the algorithm to properly handle categorical variables without the need for manual encoding. It expects a boolean mask or a list of indices indicating which features are categorical.

By default, categorical_features is set to None, which means all features are treated as numerical. When specified, it enables the algorithm to use special handling for categorical splits.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset with mixed feature types
n_samples = 1000
n_features = 5
n_categorical = 2

X, y = make_regression(n_samples=n_samples, n_features=n_features, noise=0.1, random_state=42)

# Convert some features to categorical
for i in range(n_categorical):
    X[:, i] = np.random.randint(0, 5, size=n_samples)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different categorical_features configurations
configurations = [
    None,
    [0, 1],
    [True, True, False, False, False]
]

for config in configurations:
    hgbr = HistGradientBoostingRegressor(categorical_features=config, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"categorical_features={config}, MSE: {mse:.3f}")

Running the example gives an output like:

categorical_features=None, MSE: 3428.551
categorical_features=[0, 1], MSE: 3627.036
categorical_features=[True, True, False, False, False], MSE: 3627.036

The key steps in this example are:

  1. Generate a synthetic regression dataset with mixed numerical and categorical features
  2. Split the data into train and test sets
  3. Train HistGradientBoostingRegressor models with different categorical_features configurations
  4. Evaluate the mean squared error of each model on the test set

Some tips for setting categorical_features:

Issues to consider:



See Also