Natural Language Processing Topics: Non negative matrix factorization
- Unsupervised learning method that performs dimensionality reduction as well as clustering
- We can use it in conjunction with TF-IDF for model topic for documents.
- Mathematically, any given matrix A of dimension nxm is broken down into two matrixes W and H such that W is of dimension nxk and H is of dimension kxm.
- Matrix A is created by column of each object while each row is feautures of the object.
-
Each object, i.e. column of A can be approximated by a linear combination of k basis vectors in W.
- Each basis vector can be interpreted as a cluster, the membership of the objects in the clusters can be encoded by H.
For the algorithm to work, we need to provide following inputs:
- Matrix A. This is our TFIDF matrix.
- no of basis vector, k;
- initial random values for matrix W and H.
- Goal is to minimize the reconstruction error between A and WH.
Lets do in python scikit learn. The entire process is very similar to LDA . The dataset used is derived from this link .
import pandas as pd df = pd.read_csv('npr.csv') print(df.shape) >>> (11992, 1)
print(df.head(2)) >>> Article 0 In the Washington of 2016, even when the polic... 1 Donald Trump has used Twitter — his prefe...
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english') print(tfidf.get_feature_names_out().shape) >>> (54777,)
dtm = tfidf.fit_transform(df['Article'])
from sklearn.decomposition import NMF nmf_model = NMF(n_components=7,random_state=42) nmf_model.fit(dtm)
Let us print the top words for each topic to see them. They can be used later to determine consensus on what the topic should be.
n = 20 # let us iterate through each row of the NMF components, i.e. 7 topics for i,components in enumerate(nmf_model.components_): print(f"For topic no: {i}, top {n} words are") # let us argsort to get the highest probability words from the components, and find the actual words in the tfidf feature names print([tfidf.get_feature_names_out()[idx] for idx in components.argsort()[(-1*n):]]) print() print()
>>>For topic no: 0, top 20 words are ['years', 'brain', 'researchers', 'university', 'scientists', 'new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says'] For topic no: 1, top 20 words are ['intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump'] For topic no: 2, top 20 words are ['insurers', 'federal', 'said', 'aca', 'repeal', 'senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health'] For topic no: 3, top 20 words are ['killed', 'reported', 'military', 'justice', 'city', 'officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police'] For topic no: 4, top 20 words are ['candidate', 'said', 'win', 'candidates', 'republican', 'primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton'] For topic no: 5, top 20 words are ['going', 'kind', 'book', 'black', 'things', 'love', 've', 'don', 'album', 'way', 'time', 'song', 'life', 'really', 'know', 'people', 'think', 'just', 'music', 'like'] For topic no: 6, top 20 words are ['university', 'colleges', 'public', 'child', 'program', 'teacher', 'state', 'high', 'says', 'parents', 'devos', 'children', 'college', 'kids', 'teachers', 'student', 'education', 'schools', 'school', 'students']
document_probability = nmf_model.transform(dtm) # let us create a colum in our original df with a topic column df['topic'] = document_probability.argmax(axis=1)
print(df.head(15))
>>>Article topic 0 In the Washington of 2016, even when the polic... 1 1 Donald Trump has used Twitter — his prefe... 1 2 Donald Trump is unabashedly praising Russian... 1 3 Updated at 2:50 p. m. ET, Russian President Vl... 3 4 From photography, illustration and video, to d... 6 5 I did not want to join yoga class. I hated tho... 5 6 With a who has publicly supported the debunk... 0 7 I was standing by the airport exit, debating w... 0 8 If movies were trying to be more realistic, pe... 0 9 Eighteen years ago, on New Year’s Eve, David F... 5 10 For years now, some of the best, wildest, most... 5 11 For years now, some of the best, wildest, most... 5 12 The Colorado River is like a giant bank accoun... 0 13 For the last installment of NPR’s holiday reci... 5 14 Being overweight can raise your blood pressure... 0