Scikit-Learn HashingVectorizer for Feature Extraction

HashingVectorizer is a powerful tool for efficiently converting text data into numerical format. This is particularly useful for machine learning algorithms that require numerical input. HashingVectorizer transforms a collection of text documents to a matrix of token occurrences, using a hash function to reduce dimensionality.

The key hyperparameters of HashingVectorizer include n_features (number of features), alternate_sign (use of alternating sign), and norm (norm used to normalize term vectors).

The algorithm is appropriate for text classification and clustering tasks where memory efficiency is crucial.

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
data = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])
X, y = data.data, data.target

# Vectorize text data
vectorizer = HashingVectorizer(n_features=10000)
X_vectorized = vectorizer.transform(X)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=1)

# Create and fit model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# Make a prediction
sample_text = ["NASA's new space mission is the most exciting news this year."]
sample_vectorized = vectorizer.transform(sample_text)
predicted_class = model.predict(sample_vectorized)
print('Predicted class: %d' % predicted_class[0])

Running the example gives an output like:

Accuracy: 0.967
Predicted class: 1

The steps are as follows:

Load sample text data from the fetch_20newsgroups dataset, selecting two categories for binary classification. This dataset provides a variety of newsgroup documents, making it suitable for demonstrating text classification.
Transform the text data using HashingVectorizer with a specified number of features (n_features=10000). This step converts the raw text into a numerical format that can be used by machine learning algorithms.
Split the vectorized data into training and testing sets using train_test_split(). This allows us to train the model on one portion of the data and test its performance on another.
Instantiate and fit a LogisticRegression model on the training data. The LogisticRegression model is used here due to its effectiveness in binary classification tasks.
Evaluate the model’s accuracy on the test data. This involves making predictions on the test set and comparing them to the actual labels to compute an accuracy score.
Make a prediction on a new sample text using the fitted vectorizer and model. This demonstrates how the model can be used to predict the class of unseen text data.

This example demonstrates how to effectively use HashingVectorizer for text feature extraction in scikit-learn, showcasing its memory efficiency and suitability for text classification tasks.

See Also