We will be using articles scraped from NPR (National Public Radio), obtained from their website www.npr.org
In [1]:
import pandas as pd
In [2]:
npr = pd.read_csv('npr.csv')
In [3]:
npr.head()
Out[3]:
Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles.
In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
max_df: float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
In [5]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
In [6]:
dtm = tfidf.fit_transform(npr['Article'])
In [7]:
dtm
Out[7]:
In [8]:
from sklearn.decomposition import NMF
In [11]:
nmf_model = NMF(n_components=7,random_state=42)
In [12]:
# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm)
Out[12]:
In [13]:
len(tfidf.get_feature_names())
Out[13]:
In [14]:
import random
In [15]:
for i in range(10):
random_word_id = random.randint(0,54776)
print(tfidf.get_feature_names()[random_word_id])
In [16]:
for i in range(10):
random_word_id = random.randint(0,54776)
print(tfidf.get_feature_names()[random_word_id])
In [19]:
len(nmf_model.components_)
Out[19]:
In [21]:
nmf_model.components_
Out[21]:
In [22]:
len(nmf_model.components_[0])
Out[22]:
In [23]:
single_topic = nmf_model.components_[0]
In [24]:
# Returns the indices that would sort this array.
single_topic.argsort()
Out[24]:
In [25]:
# Word least representative of this topic
single_topic[18302]
Out[25]:
In [26]:
# Word most representative of this topic
single_topic[42993]
Out[26]:
In [49]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]
Out[49]:
In [27]:
top_word_indices = single_topic.argsort()[-10:]
In [28]:
for index in top_word_indices:
print(tfidf.get_feature_names()[index])
These look like business articles perhaps... Let's confirm by using .transform() on our vectorized articles to attach a label number. But first, let's view all the 10 topics found.
In [30]:
for index,topic in enumerate(nmf_model.components_):
print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
print('\n')
In [31]:
dtm
Out[31]:
In [32]:
dtm.shape
Out[32]:
In [33]:
len(npr)
Out[33]:
In [34]:
topic_results = nmf_model.transform(dtm)
In [35]:
topic_results.shape
Out[35]:
In [36]:
topic_results[0]
Out[36]:
In [37]:
topic_results[0].round(2)
Out[37]:
In [38]:
topic_results[0].argmax()
Out[38]:
This means that our model thinks that the first article belongs to topic #1.
In [39]:
npr.head()
Out[39]:
In [40]:
topic_results.argmax(axis=1)
Out[40]:
In [41]:
npr['Topic'] = topic_results.argmax(axis=1)
In [42]:
npr.head(10)
Out[42]: