Scikit-Learn FeatureHasher for Feature Extraction

Efficiently converting categorical features into numerical format for machine learning models can be challenging, especially with high-cardinality data.

FeatureHasher provides a solution by transforming categorical data into a hashed numerical format.

FeatureHasher is a scikit-learn transformer used for encoding categorical features into a sparse matrix using the hashing trick.

Key hyperparameters include n_features, which determines the number of features in the output sparse matrix, and input_type, which specifies whether the input data is a dictionary of features or a list of features.

This algorithm is suitable for any machine learning problem requiring feature encoding, particularly useful in text processing and high-cardinality categorical data.

from sklearn.feature_extraction import FeatureHasher
import numpy as np

# Example data: list of dictionaries
data = [
    {'feature_1': 'A', 'feature_2': '1'},
    {'feature_1': 'B', 'feature_2': '2'},
    {'feature_1': 'A', 'feature_2': '3'}
]

# Create the hasher
hasher = FeatureHasher(n_features=10, input_type='dict')

# Transform the data
hashed_features = hasher.transform(data)

# Convert to dense array for display purposes
hashed_features_dense = hashed_features.toarray()
print(hashed_features_dense)

Running the example gives an output like:

[[ 0.  0.  0.  0.  0. -1.  0.  1.  0.  0.]
 [ 0.  1.  0. -1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0. -1.  0.  0.  0.  0.  1.  0.  0.]]

Prepare a sample dataset, which in this case is a list of dictionaries representing categorical features.
Instantiate FeatureHasher with a specified number of output features (n_features=10) and input type (input_type='dict').
Transform the data using the transform method to encode the categorical features into a sparse matrix.
Convert the sparse matrix to a dense array for display purposes.

This example demonstrates how to use FeatureHasher to efficiently encode high-cardinality categorical features into a numerical format suitable for machine learning models. The FeatureHasher is particularly useful for text processing tasks or datasets with a large number of categorical variables.

See Also