Topic Modeling Assessment Project

Task: Import pandas and read in the quora_questions.csv file.


In [8]:
import pandas as pd

In [52]:
quora = pd.read_csv('quora_questions.csv')

In [53]:
quora.head()


Out[53]:
Question
0 What is the step by step guide to invest in sh...
1 What is the story of Kohinoor (Koh-i-Noor) Dia...
2 How can I increase the speed of my internet co...
3 Why am I mentally very lonely? How can I solve...
4 Which one dissolve in water quikly sugar, salt...

Preprocessing

Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.


In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [42]:
dtm = tfidf.fit_transform(quora['Question'])

In [43]:
dtm


Out[43]:
<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

Non-negative Matrix Factorization

TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..


In [44]:
from sklearn.decomposition import NMF

In [48]:
nmf_model = NMF(n_components=20,random_state=42)

In [49]:
nmf_model.fit(dtm)


Out[49]:
NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

TASK: Print our the top 15 most common words for each of the 20 topics.


In [50]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')


THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 'olympics', 'available', 'job', 'spotify', 'war', 'pakistan', 'india']


THE TOP 15 WORDS FOR TOPIC #6
['beginners', 'online', 'english', 'book', 'did', 'hacking', 'want', 'python', 'languages', 'java', 'learning', 'start', 'language', 'programming', 'learn']


THE TOP 15 WORDS FOR TOPIC #7
['happen', 'presidency', 'think', 'presidential', '2016', 'vote', 'better', 'election', 'did', 'win', 'hillary', 'president', 'clinton', 'donald', 'trump']


THE TOP 15 WORDS FOR TOPIC #8
['russia', 'business', 'win', 'coming', 'countries', 'place', 'pakistan', 'happen', 'end', 'country', 'iii', 'start', 'did', 'war', 'world']


THE TOP 15 WORDS FOR TOPIC #9
['indian', 'companies', 'don', 'guy', 'men', 'culture', 'women', 'work', 'girls', 'live', 'girl', 'look', 'sex', 'feel', 'like']


THE TOP 15 WORDS FOR TOPIC #10
['ca', 'departments', 'positions', 'movies', 'songs', 'business', 'read', 'start', 'job', 'work', 'engineering', 'ways', 'bad', 'books', 'good']


THE TOP 15 WORDS FOR TOPIC #11
['money', 'modi', 'currency', 'economy', 'think', 'government', 'ban', 'banning', 'black', 'indian', 'rupee', 'rs', '1000', 'notes', '500']


THE TOP 15 WORDS FOR TOPIC #12
['blowing', 'resolutions', 'resolution', 'mind', 'likes', 'girl', '2017', 'year', 'don', 'employees', 'going', 'day', 'things', 'new', 'know']


THE TOP 15 WORDS FOR TOPIC #13
['aspects', 'fluent', 'skill', 'spoken', 'ways', 'language', 'fluently', 'speak', 'communication', 'pronunciation', 'speaking', 'writing', 'skills', 'improve', 'english']


THE TOP 15 WORDS FOR TOPIC #14
['diet', 'help', 'healthy', 'exercise', 'month', 'pounds', 'reduce', 'quickly', 'loss', 'fast', 'fat', 'ways', 'gain', 'lose', 'weight']


THE TOP 15 WORDS FOR TOPIC #15
['having', 'feel', 'long', 'spend', 'did', 'person', 'machine', 'movies', 'favorite', 'job', 'home', 'sex', 'possible', 'travel', 'time']


THE TOP 15 WORDS FOR TOPIC #16
['marriage', 'make', 'did', 'girlfriend', 'feel', 'tell', 'forget', 'really', 'friend', 'true', 'know', 'person', 'girl', 'fall', 'love']


THE TOP 15 WORDS FOR TOPIC #17
['easy', 'hack', 'prepare', 'quickest', 'facebook', 'increase', 'painless', 'instagram', 'account', 'best', 'commit', 'fastest', 'suicide', 'easiest', 'way']


THE TOP 15 WORDS FOR TOPIC #18
['web', 'java', 'scripting', 'phone', 'mechanical', 'better', 'job', 'use', 'account', 'data', 'software', 'science', 'computer', 'engineering', 'difference']


THE TOP 15 WORDS FOR TOPIC #19
['earth', 'blowing', 'stop', 'use', 'easily', 'mind', 'google', 'flat', 'questions', 'hate', 'believe', 'ask', 'don', 'think', 'people']


TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.


In [54]:
quora.head()


Out[54]:
Question
0 What is the step by step guide to invest in sh...
1 What is the story of Kohinoor (Koh-i-Noor) Dia...
2 How can I increase the speed of my internet co...
3 Why am I mentally very lonely? How can I solve...
4 Which one dissolve in water quikly sugar, salt...

In [55]:
topic_results = nmf_model.transform(dtm)

In [56]:
topic_results.argmax(axis=1)

quora['Topic'] = topic_results.argmax(axis=1)

quora.head(10)


Out[56]:
Question Topic
0 What is the step by step guide to invest in sh... 5
1 What is the story of Kohinoor (Koh-i-Noor) Dia... 16
2 How can I increase the speed of my internet co... 17
3 Why am I mentally very lonely? How can I solve... 11
4 Which one dissolve in water quikly sugar, salt... 14
5 Astrology: I am a Capricorn Sun Cap moon and c... 1
6 Should I buy tiago? 0
7 How can I be a good geologist? 10
8 When do you use シ instead of し? 19
9 Motorola (company): Can I hack my Charter Moto... 17

Great job!