TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.
The TfidfTransformer
in scikit-learn is used to transform a count matrix (bag-of-words) into a normalized TF-IDF representation.
Key hyperparameters of TfidfTransformer
include norm
, which specifies the normalization method (default is ’l2’), and use_idf
, which indicates whether to enable inverse-document-frequency reweighting (default is True
).
The algorithm is appropriate for text classification and clustering problems.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np
# Sample text data
text_data = ["the cat sat on the mat", "the dog sat on the log", "cats and dogs are friends"]
# Convert text data to a matrix of token counts
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(text_data)
# Initialize the TfidfTransformer
tfidf_transformer = TfidfTransformer()
# Fit and transform the count matrix to a TF-IDF representation
X_tfidf = tfidf_transformer.fit_transform(X_counts)
# Show the TF-IDF values
print(X_tfidf.toarray())
# Example of transforming new text data
new_text = ["the cat and the dog"]
new_counts = count_vect.transform(new_text)
new_tfidf = tfidf_transformer.transform(new_counts)
print(new_tfidf.toarray())
Running the example gives an output like:
[[0. 0. 0.42755362 0. 0. 0.
0. 0. 0.42755362 0.32516555 0.32516555 0.6503311 ]
[0. 0. 0. 0. 0.42755362 0.
0. 0.42755362 0. 0.32516555 0.32516555 0.6503311 ]
[0.4472136 0.4472136 0. 0.4472136 0. 0.4472136
0.4472136 0. 0. 0. 0. 0. ]]
[[0.43381609 0. 0.43381609 0. 0.43381609 0.
0. 0. 0. 0. 0. 0.65985664]]
The steps are as follows:
First, a small set of sample text data is defined. This data is converted into a matrix of token counts using the
CountVectorizer
fromsklearn.feature_extraction.text
.The
TfidfTransformer
is instantiated with default parameters.The count matrix (
X_counts
) is transformed into a TF-IDF representation (X_tfidf
) using thefit_transform()
method of theTfidfTransformer
.The TF-IDF values are printed to show the transformation results.
A new text sample is then transformed using the same fitted
CountVectorizer
andTfidfTransformer
to demonstrate how to apply the transformation to new data.
This example illustrates how to use TfidfTransformer
to convert raw text data into meaningful numerical features for machine learning models, showcasing its application in preprocessing text data.