The categorical_features
parameter in scikit-learn’s HistGradientBoostingClassifier
specifies which features should be treated as categorical.
Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that can handle both numerical and categorical features. The categorical_features
parameter allows you to explicitly define which features should be treated as categorical.
By default, categorical_features
is set to None
, which means the algorithm will try to automatically infer which features are categorical. You can also provide a boolean mask or a list of indices to specify categorical features manually.
In practice, it’s often beneficial to explicitly specify categorical features, especially when dealing with high-cardinality categorical variables or when you want fine-grained control over feature handling.
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Generate synthetic dataset
n_samples = 1000
n_features = 5
# Create numerical features
X_numerical = np.random.rand(n_samples, n_features)
# Create categorical features
X_categorical = np.random.randint(0, 5, size=(n_samples, 2))
# Combine features
X = np.hstack([X_numerical, X_categorical])
# Generate target variable
y = (X[:, 0] + X[:, 5] > 1).astype(int)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define different categorical_features configurations
configs = [
('Auto-detect', None),
('All numerical', [False] * 7),
('Correct specification', [False] * 5 + [True] * 2)
]
for name, cat_features in configs:
model = HistGradientBoostingClassifier(random_state=42, categorical_features=cat_features)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"{name}: Accuracy = {accuracy:.4f}")
Running the example gives an output like:
Auto-detect: Accuracy = 1.0000
All numerical: Accuracy = 1.0000
Correct specification: Accuracy = 1.0000
The key steps in this example are:
- Generate a synthetic dataset with both numerical and categorical features
- Split the data into training and test sets
- Create
HistGradientBoostingClassifier
models with differentcategorical_features
configurations - Train each model and evaluate its accuracy on the test set
Some tips and heuristics for setting categorical_features
:
- Use domain knowledge to identify truly categorical features
- Consider explicitly specifying categorical features for better control and interpretability
- Be cautious with high-cardinality categorical features, as they may need special handling
- Experiment with different configurations to find the optimal performance for your dataset
Issues to consider:
- Automatic detection may not always correctly identify categorical features
- Treating numerical features as categorical can lead to loss of information
- Treating categorical features as numerical may result in incorrect assumptions about feature relationships