Efficiently converting categorical features into numerical format for machine learning models can be challenging, especially with high-cardinality data.
FeatureHasher
provides a solution by transforming categorical data into a hashed numerical format.
FeatureHasher
is a scikit-learn transformer used for encoding categorical features into a sparse matrix using the hashing trick.
Key hyperparameters include n_features
, which determines the number of features in the output sparse matrix, and input_type
, which specifies whether the input data is a dictionary of features or a list of features.
This algorithm is suitable for any machine learning problem requiring feature encoding, particularly useful in text processing and high-cardinality categorical data.
from sklearn.feature_extraction import FeatureHasher
import numpy as np
# Example data: list of dictionaries
data = [
{'feature_1': 'A', 'feature_2': '1'},
{'feature_1': 'B', 'feature_2': '2'},
{'feature_1': 'A', 'feature_2': '3'}
]
# Create the hasher
hasher = FeatureHasher(n_features=10, input_type='dict')
# Transform the data
hashed_features = hasher.transform(data)
# Convert to dense array for display purposes
hashed_features_dense = hashed_features.toarray()
print(hashed_features_dense)
Running the example gives an output like:
[[ 0. 0. 0. 0. 0. -1. 0. 1. 0. 0.]
[ 0. 1. 0. -1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. -1. 0. 0. 0. 0. 1. 0. 0.]]
- Prepare a sample dataset, which in this case is a list of dictionaries representing categorical features.
- Instantiate
FeatureHasher
with a specified number of output features (n_features=10
) and input type (input_type='dict'
). - Transform the data using the
transform
method to encode the categorical features into a sparse matrix. - Convert the sparse matrix to a dense array for display purposes.
This example demonstrates how to use FeatureHasher
to efficiently encode high-cardinality categorical features into a numerical format suitable for machine learning models. The FeatureHasher
is particularly useful for text processing tasks or datasets with a large number of categorical variables.