Natural Language Processing Topics: Topic Modeling LDA
LDA: Latent Dirichlet Allocation
- Problem: We have certain number of documents that we want to find topics for. Following are the steps we take:
- Need to identify K topics before hand.
- works by going through the document and randomly assigning the words to each of the topics
- the first pass does not make sense; we iterate over every words in every documents and try to improve the topic.
- use Bayesian probability formula to compute the metric.
Using SKlearn for topic modeling of documents
The dataset used in the following can be found at this link .
import pandas as pd datafile = 'npr.csv' df = pd.read_csv(datafile) print(df.shape) >>> (11992, 1) print(df.head(2)) >>> Article 0 In the Washington of 2016, even when the polic... 1 Donald Trump has used Twitter — his prefe...
Now let us do count vectorizer
from sklearn.feature_extraction.text import CountVectorizer # max_df = 0.9, gets rid of words that appear in more than 90% of documents # min_df = 2, mean the document must appear in 3 documents # the fraction value supplied to max_df and min_df work as the fraction of the no of total documents whereas # .. the absolute number means the no of documents (for both max_df and min_df) # stop_words to remove the stop words in English cv = CountVectorizer(max_df = 0.9, min_df = 2, stop_words='english')
len(cv.get_feature_names_out()) >>> 54777
dtm = cv.fit_transform(df['Article'])
from sklearn.decomposition import LatentDirichletAllocation # n_components = 7 means we want to model 7 topics LDA = LatentDirichletAllocation(n_components=7, random_state=42) LDA.fit(dtm)
# After we do the LDA, we want to # 1. grab the vocabulary # .. cv.get_feature_names() gives us the words in the documents # .. we can grab the words by doing cv.get_feature_names()[9000] import random random_word_id = random.randint(0,len(cv.get_feature_names_out())) random_word = cv.get_feature_names_out()[random_word_id] # 2. grab the topics # .. the topics are saved as LDA.components_. # .. LDA.components_ is a matrix (no of topics=7)X(no of words) # .. so we can pick the topic info using LDA.components_[row_index], # .. .. where each values in the column gives the probability of the words to be associated with the topic # 3. grab the highest probability words per topic n = 20 # let us iterate through each row of the LDA components, i.e. 7 topics for i,components in enumerate(LDA.components_): print(f"For topic no: {i}, top {n} words are") # let us argsort to get the highest probability words from the components, and find the actual words in the cv feature names print([cv.get_feature_names_out()[idx] for idx in components.argsort()[(-1*n):]]) print() print()
>>>For topic no: 0, top 20 words are ['president', 'state', 'tax', 'insurance', 'trump', 'companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']
For topic no: 1, top 20 words are ['white', 'according', 'attack', 'reported', 'war', 'military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']
For topic no: 2, top 20 words are ['little', 'know', 'don', 'year', 'make', 'way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']
For topic no: 3, top 20 words are ['world', 'research', 'university', 'percent', 'care', 'time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']
For topic no: 4, top 20 words are ['donald', 'political', 'states', 'law', 'just', 'voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']
# now for each of the document, let us find the appropriate topic number # .. and assign it to them # .. let us see the probability of each document belonging to the particular topic document_probability = LDA.transform(dtm) print(document_probability.shape) >>> (11992, 7)
# let us create a colum in our original df with a topic column df['topic'] = document_probability.argmax(axis=1)
And finally,
print(df.head(15)) >>> Article topic 0 In the Washington of 2016, even when the polic... 1 1 Donald Trump has used Twitter — his prefe... 1 2 Donald Trump is unabashedly praising Russian... 1 3 Updated at 2:50 p. m. ET, Russian President Vl... 1 4 From photography, illustration and video, to d... 2 5 I did not want to join yoga class. I hated tho... 3 6 With a who has publicly supported the debunk... 3 7 I was standing by the airport exit, debating w... 2 8 If movies were trying to be more realistic, pe... 3 9 Eighteen years ago, on New Year’s Eve, David F... 2 10 For years now, some of the best, wildest, most... 5 11 For years now, some of the best, wildest, most... 5 12 The Colorado River is like a giant bank accoun... 2 13 For the last installment of NPR’s holiday reci... 2 14 Being overweight can raise your blood pressure... 3