Scikit-Learn TfidfTransformer for Feature Extraction

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.

The TfidfTransformer in scikit-learn is used to transform a count matrix (bag-of-words) into a normalized TF-IDF representation.

Key hyperparameters of TfidfTransformer include norm, which specifies the normalization method (default is ’l2’), and use_idf, which indicates whether to enable inverse-document-frequency reweighting (default is True).

The algorithm is appropriate for text classification and clustering problems.

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np

# Sample text data
text_data = ["the cat sat on the mat", "the dog sat on the log", "cats and dogs are friends"]

# Convert text data to a matrix of token counts
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(text_data)

# Initialize the TfidfTransformer
tfidf_transformer = TfidfTransformer()

# Fit and transform the count matrix to a TF-IDF representation
X_tfidf = tfidf_transformer.fit_transform(X_counts)

# Show the TF-IDF values
print(X_tfidf.toarray())

# Example of transforming new text data
new_text = ["the cat and the dog"]
new_counts = count_vect.transform(new_text)
new_tfidf = tfidf_transformer.transform(new_counts)
print(new_tfidf.toarray())

Running the example gives an output like:

[[0.         0.         0.42755362 0.         0.         0.
  0.         0.         0.42755362 0.32516555 0.32516555 0.6503311 ]
 [0.         0.         0.         0.         0.42755362 0.
  0.         0.42755362 0.         0.32516555 0.32516555 0.6503311 ]
 [0.4472136  0.4472136  0.         0.4472136  0.         0.4472136
  0.4472136  0.         0.         0.         0.         0.        ]]
[[0.43381609 0.         0.43381609 0.         0.43381609 0.
  0.         0.         0.         0.         0.         0.65985664]]

The steps are as follows:

First, a small set of sample text data is defined. This data is converted into a matrix of token counts using the CountVectorizer from sklearn.feature_extraction.text.
The TfidfTransformer is instantiated with default parameters.
The count matrix (X_counts) is transformed into a TF-IDF representation (X_tfidf) using the fit_transform() method of the TfidfTransformer.
The TF-IDF values are printed to show the transformation results.
A new text sample is then transformed using the same fitted CountVectorizer and TfidfTransformer to demonstrate how to apply the transformation to new data.

This example illustrates how to use TfidfTransformer to convert raw text data into meaningful numerical features for machine learning models, showcasing its application in preprocessing text data.

See Also