Scikit-Learn TfidfVectorizer for Feature Extraction

Transforming text data into numerical representation is crucial for applying machine learning models to text-based data.

The TfidfVectorizer in scikit-learn converts a collection of raw documents into a matrix of TF-IDF features, which balances the frequency of terms across the dataset.

The key hyperparameters of TfidfVectorizer include max_features (number of features), ngram_range (range of n-values for different n-grams), and stop_words (removal of common words).

The algorithm is appropriate for text classification, clustering, and retrieval tasks.

from sklearn.feature_extraction.text import TfidfVectorizer

# sample text documents
docs = [
    "Machine learning is fascinating.",
    "The field of machine learning is evolving quickly.",
    "Artificial intelligence and machine learning are closely related.",
    "I love learning about new advancements in machine learning.",
    "Machine learning models require data."
]

# create the tfidf vectorizer
vectorizer = TfidfVectorizer(max_features=10, ngram_range=(1, 2), stop_words='english')

# fit and transform the documents
X = vectorizer.fit_transform(docs)

# print the tf-idf features
print(X.toarray())

# new sample document
new_doc = ["Machine learning is a branch of artificial intelligence."]

# transform the new document
new_X = vectorizer.transform(new_doc)

# print the tf-idf features of the new document
print(new_X.toarray())

Running the example gives an output like:

[[0.         0.57735027 0.57735027 0.57735027 0.         0.
  0.         0.         0.         0.        ]
 [0.         0.36750369 0.36750369 0.36750369 0.         0.
  0.         0.         0.77124776 0.        ]
 [0.         0.36750369 0.36750369 0.36750369 0.         0.
  0.         0.         0.         0.77124776]
 [0.47878445 0.45628671 0.22814336 0.22814336 0.         0.
  0.47878445 0.47878445 0.         0.        ]
 [0.         0.29100835 0.29100835 0.29100835 0.61071369 0.61071369
  0.         0.         0.         0.        ]]
[[0.         0.57735027 0.57735027 0.57735027 0.         0.
  0.         0.         0.         0.        ]]

First, a small set of sample text documents is created.
A TfidfVectorizer is instantiated with max_features, ngram_range, and stop_words parameters to control the feature extraction process.
The vectorizer is fit on the sample documents and then used to transform them into a matrix of TF-IDF features.
The resulting TF-IDF feature matrix is printed to showcase the transformation.
A new sample document is then transformed using the same vectorizer, demonstrating how the TF-IDF features can be generated for new data.

This example demonstrates the use of TfidfVectorizer to convert text documents into TF-IDF features, which can be utilized in various machine learning applications such as text classification and clustering.

See Also