Overview of Topic Modeling

While dealing with large dataset, topic model helps us to efficiently analyze by clustering the documents. Large amount of text data comes in mostly unlabeled and due to it’s enormous quantity, applying supervised learning approaches doesn’t help that much! You’ll have different categories, so it would be quite impossible to create machine learning models for the data.

We can try to discover labels form unlabeled documents. In the case of text data, we can try to discover clusters of documents which is grouped together by topics.

But, we don’t know the correct labeled topic for it, do we? However, let’s get started with Latent Dirichlet Allocation.

Latent Dirichlet Allocation Overview

Latent Dirichlet Allocation is based off Dirichlet Distribution (named after Johann Dirichlet, German mathematician, 1800s). LDA was published in 2003 as a graphical model for topic discovery. Keep it in mind that, Dirichlet did not invent LDA, rather, LDA uses Dirichlet Algorithm. You should read this Latent Dirichlet Allocation

Assumptions of LDA for Topic Modeling:

Documents with similar topics use similar words.
Hidden/Latent topics can be found by looking for groups of words that often appear together in documents across the collection.

Some things you have to understand:

Document itself is a probability distribution over latent topics

Topics themselves are probability distributions over words. Topics are going to be modeled as probability distributions over words. For example, we have defined topic 1 over here. From the probability distribution, we understand that the word ‘he’ and ‘food’ has lower probability of belonging to topic 1, where the ‘cat’ and ‘dog’ has higher probability of belonging to topic. As a user, we can guess that topic 1 is a representative of pets.

The LDA/ unsupervised learning is unable to tell you directly, it’s up to the user to interpret these probability distribution as topics.

LDA presents documents as a mixture of topics containing words with certain probability. It assumes that we produce documents in the following way:

First, we decide how many words, N, the document will have.
Next, we choose a mix of topics for the document using a Dirichlet distribution over a set number of K topics. For example, we might say this document is 60% business, 20% politics, and 10% food. That's our topic mix.
Then, we generate each word in the document by picking a topic based on the previously chosen mix. We select 60% of the words from the business topic, 20% from politics, and 10% from food.
Finally, we use the topic to choose the word itself. For instance, if we pick the food topic, we might generate the word "Apple" with a 60% chance and another word like "Home" with a 30% chance, and so on.
Assuming this type of model for a group of documents, LDA will try to work backward from the documents to identify the set of topics that probably created the collection.

LDA assumes that's how documents are created. In reality, ofc we don't create documents this way, but it's a helpful way to understand how topics can be mixed in different documents and how words can be mixed in different topics. Now let’s think like LDA would think and do the backtracking processing:

First, decide on a fixed number of K topics to find.

To use LDA, you need to decide how many topics you want to discover. Before starting LDA, you should have an idea of how many topics there might be. So, choose a fixed number K of topics to find.

Then, use LDA to learn the topic representation for each document and the words linked to each topic.

Next, go through each document and randomly assign each word to one of the K topics.

Remember, this first random assignment already gives you both the topic representations for all the documents and the word distributions for all the topics, even though it's random at the start.

But these initial random assignments won’t make any sense, because yes, they are assigned randomly.

Now we need to iterate over them to see how to fix these assignments.

iterate over every word in every document to improvise
for every word in every doc, and for each topic t, we calculate:

p(topic t| document dd) = the proportion of words in document d that are currently assigned to topic t.
we are also going to calculate:

for every word in every document and for each topic t:

p(word w| topic t) = the proportion of assignments to topic t over all documents that come from this particular word w.
reassign w with a new topic, where:

p(topic t| document d) * p(word w| topic t)
This is the probability that topic t generated word w.

If we repeat this steps for a large number of times, we will eventually reach to a state where the assignments are not random anymore, rather acceptable.

After going through all the documents, LDA might say something like, "I think this document belongs to topic 5." But what is topic 5? We don't know, and LDA doesn't know either. What we can do is ask LDA for the most common words in topic 5 or which words are most likely to appear in topic 5. Suppose you get a list like 'cat, vet, bird, food, home'. Based on this list, it's up to you to guess what the topic might be. For this list, it's reasonable to think topic 5 is about pets. Is this the correct answer? We can't be sure because we didn't have a correct answer to start with. It's all about assumptions and probabilities.

Two important thing to remember:

User must decide on the amount of topics present int he document
User must interpret what the topics are

Latent Dirichlet Allocation in Python - Part 1

Let’s perform Latent Dirichlet Allocation - LDA

[ Get your data set and resources from here ]

import pandas ad pd 
npr = pd.read_csv('npr.csv')
npr.head()

Right now we have articles only. We don’t have any sorts of labels here. We can check the article by grabbing the column name and passing the index.

# Printing article 
npr['Article'][0]

# Checking out the len of the article
len(npr)

There are 11,992 articles here.

We don’t know yet, which article belongs to which topic. So, we’ll do the assigning process discussed earlier. But before that, need to do a little bit of data pre processing.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_df = 0.95, min_df = 2, stop_words='english' )
# max_df = gets rid of frequent/common words. Range (0-1)
# min_df = min word frequency. U can pass int too. 
# min_df=2 means if a word has shown up at least twice in the doc, discaard it.

# fit transforming the data set, because this is an unsupervised learning
dtm = cv.fit_transform(npr['Article'])
#dtm = document transform matrix\

dtm
# Output: <Compressed Sparse Row sparse matrix of dtype 'int64'
#    with 3033388 stored elements and shape (11992, 54777)>

# 54777 terms

# LDA is built-in in sklearn, import it
from sklearn.decomposition import LatentDirichletAllocation

After assigning it to a variable, you’ll get many params but make sure to choose number of components and random state. As described earlier how LDA thinks we create a document, we need to first assign randomly then iterate them over and over.

There are no right or wrong numbers for components. It depends on your experiences and how well you know the dataset. For us, let’s say we want 7 general topics return from here.

LDA = LatentDirichletAllocation(n_components=7, random_state = 42)
# n_component -> how many topics we want

# fit LDA into dtm 
LDA.fit(dtm) # This should take a while

We are working with a large number of documents. LDA is an iterative process, so it will continue updating the weights for each word and topic repeatedly until they stabilize.

Latent Dirichlet Allocation in Python - Part 2

Now we have 3 steps left:

Grab the vocabulary of words

get_feature_names() holds an instance of every single word. If you print the len, you will get 54777

 len(cv.get_feature_names_out())

 type(cv.get_feature_names_out()) # A list of all the words in npr document

 cv.get_feature_names_out()[43000] # You can grab a word by index

Grab the topics

 len(LDA.components_) 
 # It will return the number of topics

 type(LDA.components_)
 # This is actually a NumPy arrayGrab the highest probability words per topic

 LDA.components_.shape
 # It's an array of 7 topics by 54777 words
 #output (7, 5477

 LDA.components_ # checking the actual components

Now, we will use this information along with our ability to access the vocabulary (cv.get_feature_names_out()) to get the highest probability for per topic.

 single_topic = LDA.components_[0] # The very first topic

 single_topic.argsort() #argsort() return the index that would sort this array

Let me make it clear with an example

  import numpy as np
  arr = np.array([10,200,2])
  arr.argsort() 
  # Output: array([2, 0, 1])
  # This is the index position that will sort this array (lowest -> highest)

    # ARGSORT --> INDEX POSITIONS SORTED FROM LEAST--> GREATEST 
    # WE'RE LOOKING FOR TOP 10 VALUES (10 GREATEST VALUES)
    # LAST 10 VALUES FROM THE LIST ARGSORT

    single_topic.argsort()[-10:] #GRAB THE LAST 10 VALUES OF .ARGSORT() RETURNS

    top_ten_words = single_topic.argsort()[-10:]

    for i in top_ten_words: 
      print(cv.get_feature_names_out()[i])

Upon seeing the words, it feels like the article has something to do with govt, new rules, military and healthcare service. So maybe it's an article of a country. Let's clarify by identifying more words.

    top_twenty_words = single_topic.argsort()[-20:]

    for i in top_twenty_words: 
      print(cv.get_feature_names_out()[i])

I can see words like president, state, tax, insurance, federal, government, and percent. Now it's becoming clearer that the article might be about health insurance and politics.

LDA is now revealing an underlying topic. It's indicating that these specific words are likely to appear in this particular single_topic. Let's set up a simple loop to print out the top 15 words for each of the seven topics.

    for index,topic in enumerate(LDA.components_):
        print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
        print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
        print('\n')

Code explanation: for each topic LDA.components_ (which was an array, 7 topics by 54777 words), we are grabbing the top 15 words, then we are doing a list comprehension, to basically get the index positions of our get_feature_names_out()

Notice we only printed seven topics, zero through six. You decide how many topics to have beforehand. If topics aren't clear or seem too similar, choose fewer topics. If you want more clarity, ask for more topics from the start.

So, the final step is to attach these topic numbers to the original articles. It's quite simple.

We need to create a new column that has their topic number
```
 # CREATING A LIST THE ACTUAL TOPICS OFF THE dtm (document term matrix)
 topic_results = LDA.transform(dtm)
```
```
 topic_results.shape
```
```
 npr['Topic'] = topic_results.argmax(axis=1)
 npr
```
If you wanna see what does topic 5 represents, scroll up and check what topic 5 represents. So it looks like, it makes sense that this story is about YouTube stars 2026 where Adele is singing.

Non-negative Matrix Factorization Overview

Non negative matrix factorization is an unsupervised algorithm that simultaneously performs dimensionality reduction and clustering. We use it in conjuring with TF_IDF to model topics across documents.

We are going to approximate each object by a linear combination of k reduced dimensions or ‘basis vectors’ in w. Each basis vectors will be acting as a cluster.

We will provide an input, non-neg data matrix (A) - which is going to be our TF-IDF; numbers of basis vectors (k), initial values for factors W and H random matrices).
Summary:

Build a vector space model for documents (after removing stop words), creating a term-document matrix A
Use TF-IDF to adjust term weights in A
Normalize TF-IDF vectors to have unit length
Use NNDSVD to initialize factors on A
Use Projected Gradient NMF on A

Basis vectors: the topics (clusters) in the data

Coefficient matrix: the membership weights for documents relative toe ach topic (cluster)

Similar to LDA, we need to choose the number of expected topics and interpret the topics based on the coefficient values of the words for each topic. Keep in mind that these matrix coefficient values are not probabilities and can't be interpreted as easily as in LDA. However, we'll still be able to understand them.

However, it's not a probability value like what LDA gives us. It's simply a coefficient value that the non-negative matrix factorization provides. Fortunately, thanks to scikit learns’ consistent syntax, replacing LDA with NMF is actually quite straightforward. we will just adapt the previous LDA with NNF.

Non-negative Matrix Factorization with Python

Check out the github repo for this model. It’s easy, if you understand LDA, NMF shouldn’t be hard for you.

NLP - Topic Modeling

Table of contents