The categorical_features
parameter in scikit-learn’s HistGradientBoostingRegressor
specifies which features should be treated as categorical.
HistGradientBoostingRegressor
is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.
The categorical_features
parameter allows the algorithm to properly handle categorical variables without the need for manual encoding. It expects a boolean mask or a list of indices indicating which features are categorical.
By default, categorical_features
is set to None, which means all features are treated as numerical. When specified, it enables the algorithm to use special handling for categorical splits.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset with mixed feature types
n_samples = 1000
n_features = 5
n_categorical = 2
X, y = make_regression(n_samples=n_samples, n_features=n_features, noise=0.1, random_state=42)
# Convert some features to categorical
for i in range(n_categorical):
X[:, i] = np.random.randint(0, 5, size=n_samples)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different categorical_features configurations
configurations = [
None,
[0, 1],
[True, True, False, False, False]
]
for config in configurations:
hgbr = HistGradientBoostingRegressor(categorical_features=config, random_state=42)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"categorical_features={config}, MSE: {mse:.3f}")
Running the example gives an output like:
categorical_features=None, MSE: 3428.551
categorical_features=[0, 1], MSE: 3627.036
categorical_features=[True, True, False, False, False], MSE: 3627.036
The key steps in this example are:
- Generate a synthetic regression dataset with mixed numerical and categorical features
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentcategorical_features
configurations - Evaluate the mean squared error of each model on the test set
Some tips for setting categorical_features
:
- Use a boolean mask or list of indices to specify categorical features
- Ensure categorical features are encoded as integers starting from 0
- Consider the number of unique values in each categorical feature
Issues to consider:
- Incorrectly specifying numerical features as categorical can lead to suboptimal performance
- Very high cardinality categorical features may still benefit from encoding techniques
- The effectiveness of automatic categorical handling depends on the dataset characteristics