SKLearner Home | About | Contact | Examples

Scikit-Learn CountVectorizer for Feature Extraction

CountVectorizer converts a collection of text documents into a matrix of token counts. It’s often used as a preprocessing step in text classification or clustering.

Key hyperparameters include max_features (maximum number of unique words to consider), ngram_range (range of n-values for different n-grams), and stop_words (words to be removed).

The algorithm is appropriate for text classification and clustering problems.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# sample text data
texts = ["I love machine learning", "Machine learning is great", "Python is my favorite language", "I love coding in Python"]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(texts)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(texts)

# summarize encoded vector
print(vector.toarray())

Running the example gives an output like:

{'love': 7, 'machine': 8, 'learning': 6, 'is': 4, 'great': 2, 'python': 10, 'my': 9, 'favorite': 1, 'language': 5, 'coding': 0, 'in': 3}
[[0 0 0 0 0 0 1 1 1 0 0]
 [0 0 1 0 1 0 1 0 1 0 0]
 [0 1 0 0 1 1 0 0 0 1 1]
 [1 0 0 1 0 0 0 1 0 0 1]]

The steps are as follows:

  1. Create a sample list of text documents, which includes simple phrases related to machine learning and programming.

  2. Initialize a CountVectorizer instance with default parameters. This object will be used to tokenize the text and build the vocabulary.

  3. Fit the CountVectorizer on the text documents using the fit() method. This step tokenizes the text and builds the vocabulary.

  4. Display the vocabulary built by the CountVectorizer using vocabulary_ attribute. This shows the unique words identified and their corresponding indices.

  5. Transform the text documents into a matrix of token counts using the transform() method. This converts the text into a numerical form that machine learning algorithms can process.

  6. Print the resulting token count matrix with toarray() method to see the numerical representation of the text documents.

This example demonstrates the basic usage of CountVectorizer to convert text data into numerical form, which is essential for further text processing tasks such as classification or clustering.



See Also