SKLearner Home | About | Contact | Examples

Scikit-Learn LatentDirichletAllocation Model

Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm used to discover the underlying topics in a collection of documents. It works by assuming that documents are mixtures of topics and that topics are mixtures of words.

Key hyperparameters of LatentDirichletAllocation include n_components (number of topics), learning_method (method for updating the model), and max_iter (maximum number of iterations).

The algorithm is suitable for unsupervised learning tasks, particularly topic modeling in text data.

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np

# Load dataset
data = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = data.data[:200]  # Limit to 200 documents for simplicity

# Prepare the count vectorizer
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
X = vectorizer.fit_transform(documents)

# Create the LDA model
lda = LatentDirichletAllocation(n_components=5, random_state=1)

# Fit the LDA model
lda.fit(X)

# Display the topics
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
display_topics(lda, vectorizer.get_feature_names_out(), no_top_words)

# Transform the dataset using the fitted LDA model
X_topics = lda.transform(X)

# Example transformation of a single document
doc_idx = 0
doc_topic_dist = X_topics[doc_idx]
print(f"Document {doc_idx} topic distribution: {doc_topic_dist}")

Running the example gives an output like:

Topic 0:
just like don know way want say good chip car
Topic 1:
think don like people year just ll good plane need
Topic 2:
key win god know does people good time believe creation
Topic 3:
edu graphics mail send 128 3d com objects file format
Topic 4:
gm israel power drive hard israeli cache know really program
Document 0 topic distribution: [0.00487303 0.0050445  0.00482743 0.00480503 0.98045001]

The steps are as follows:

  1. First, a subset of the 20 Newsgroups dataset is loaded, limiting to 200 documents for simplicity. This dataset is a common benchmark in text mining and information retrieval research.

  2. The CountVectorizer is used to convert the text data into a document-term matrix, setting parameters to ignore common English stop words and words that appear in too many or too few documents.

  3. A LatentDirichletAllocation model is instantiated with 5 topics and fit to the document-term matrix using the fit() method.

  4. A function is defined to display the top words in each topic. This function prints the top words for each topic discovered by the LDA model.

  5. The original document-term matrix is transformed into a topic distribution matrix using the fitted LDA model.

  6. The topic distribution for a sample document is printed to show the result of the transformation. This demonstrates the distribution of topics in a specific document, illustrating how LDA can be used to uncover hidden thematic structures in text data.

This example shows how to set up and use LatentDirichletAllocation for topic modeling in scikit-learn, providing an efficient way to analyze and interpret large collections of text data.



See Also