Natural Language Processing Topics: Topic Modeling LDA

LDA: Latent Dirichlet Allocation

Problem: We have certain number of documents that we want to find topics for. Following are the steps we take:
Need to identify K topics before hand.
works by going through the document and randomly assigning the words to each of the topics
the first pass does not make sense; we iterate over every words in every documents and try to improve the topic.
use Bayesian probability formula to compute the metric.

Using SKlearn for topic modeling of documents

The dataset used in the following can be found at this link .

import pandas as pd datafile = 'npr.csv' df = pd.read_csv(datafile) print(df.shape) >>> (11992, 1) print(df.head(2)) >>> Article 0 In the Washington of 2016, even when the polic... 1 Donald Trump has used Twitter — his prefe...

Now let us do count vectorizer

from sklearn.feature_extraction.text import CountVectorizer # max_df = 0.9, gets rid of words that appear in more than 90% of documents # min_df = 2, mean the document must appear in 3 documents # the fraction value supplied to max_df and min_df work as the fraction of the no of total documents whereas # .. the absolute number means the no of documents (for both max_df and min_df) # stop_words to remove the stop words in English cv = CountVectorizer(max_df = 0.9, min_df = 2, stop_words='english')

len(cv.get_feature_names_out()) >>> 54777

dtm = cv.fit_transform(df['Article'])

from sklearn.decomposition import LatentDirichletAllocation # n_components = 7 means we want to model 7 topics LDA = LatentDirichletAllocation(n_components=7, random_state=42) LDA.fit(dtm)

# After we do the LDA, we want to # 1. grab the vocabulary # .. cv.get_feature_names() gives us the words in the documents # .. we can grab the words by doing cv.get_feature_names()[9000] import random random_word_id = random.randint(0,len(cv.get_feature_names_out())) random_word = cv.get_feature_names_out()[random_word_id] # 2. grab the topics # .. the topics are saved as LDA.components_. # .. LDA.components_ is a matrix (no of topics=7)X(no of words) # .. so we can pick the topic info using LDA.components_[row_index], # .. .. where each values in the column gives the probability of the words to be associated with the topic # 3. grab the highest probability words per topic n = 20 # let us iterate through each row of the LDA components, i.e. 7 topics for i,components in enumerate(LDA.components_): print(f"For topic no: {i}, top {n} words are") # let us argsort to get the highest probability words from the components, and find the actual words in the cv feature names print([cv.get_feature_names_out()[idx] for idx in components.argsort()[(-1*n):]]) print() print()

>>>
For topic no: 0, top 20 words are ['president', 'state', 'tax', 'insurance', 'trump', 'companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']

For topic no: 1, top 20 words are ['white', 'according', 'attack', 'reported', 'war', 'military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']

For topic no: 2, top 20 words are ['little', 'know', 'don', 'year', 'make', 'way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']

For topic no: 3, top 20 words are ['world', 'research', 'university', 'percent', 'care', 'time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']

For topic no: 4, top 20 words are ['donald', 'political', 'states', 'law', 'just', 'voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']

# now for each of the document, let us find the appropriate topic number # .. and assign it to them # .. let us see the probability of each document belonging to the particular topic document_probability = LDA.transform(dtm) print(document_probability.shape) >>> (11992, 7)

# let us create a colum in our original df with a topic column df['topic'] = document_probability.argmax(axis=1)

And finally,

print(df.head(15)) >>> Article topic 0 In the Washington of 2016, even when the polic... 1 1 Donald Trump has used Twitter — his prefe... 1 2 Donald Trump is unabashedly praising Russian... 1 3 Updated at 2:50 p. m. ET, Russian President Vl... 1 4 From photography, illustration and video, to d... 2 5 I did not want to join yoga class. I hated tho... 3 6 With a who has publicly supported the debunk... 3 7 I was standing by the airport exit, debating w... 2 8 If movies were trying to be more realistic, pe... 3 9 Eighteen years ago, on New Year’s Eve, David F... 2 10 For years now, some of the best, wildest, most... 5 11 For years now, some of the best, wildest, most... 5 12 The Colorado River is like a giant bank accoun... 2 13 For the last installment of NPR’s holiday reci... 2 14 Being overweight can raise your blood pressure... 3