Topic Modeling for Twitter Accounts using Bayesian Nonnegative Matrix Factorization

Burak Suyunu & Şemsi Yiğit Özgümüş

Advisor: Assoc. Prof. Ali Taylan Cemgil

Boğaziçi University Department of Computer Engineering

Makers, scientists, influencers and many other people share their ideas, products and innovations via the most intellectual social network Twitter. It is hard to find the information about a topic in the giant network of Twitter. Our aim is to find users who are tweeting about the same topic. With this aim we want to bring people interested in the same community together. In this project, we focused on maker communities and influencers in the context of computer science, such as ML, Robotics, 3D Printing, Arduino. We worked on 1.118 users and approximately 3.250.000 tweets.

There are potential methods like LDA and NMF to tackle this problem, we want to investigate the addition of KL-BNMF and to see whether this method is an applicable solution candidate for this problem.

Natural Language Processing (NLP)

The language of twitter is generally close to daily language. People share their ideas and emotions at any time of the day. Other than normal texts, tweets can include hashtags, emoticons, pictures, videos, gifs, urls etc. Even normal text part of the tweets may consist of misspelled words. Apart from these, one user may tweet in lots of language. For example, one tweet may be in Turkish, and another one in English. So we need to make a cleanup before using those tweets. The list of applied processes:

  • Remove Twitter Accounts that has less than 2000 words in their tweets
  • Remove URLs
  • Tokenization
  • Stop words
  • Remove non-English accounts
  • Delete accounts whose number of left tokens are less than 200
  • Stemming
  • Remove words that appears at most 10 timse in the whole corpus

Importing the necessary libraries.


In [2]:
import langid
import logging
import nltk
import numpy as np
import re
import os
import sys
import time
from collections import defaultdict
from string import digits
import pyLDAvis.gensim
import pyLDAvis.sklearn
from gensim import corpora, models, similarities, matutils
import networkx as nx
import string
import math
import pickle

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans

from collections import Counter

import scipy.io
from scipy import sparse


C:\Users\Burki\Anaconda3\lib\site-packages\gensim\utils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Users\Burki\Anaconda3\lib\site-packages\numpy\lib\utils.py:99: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  warnings.warn(depdoc, DeprecationWarning)

In [2]:
def totalWordCount(tList):
    totalWords = 0
    for tt in tList:
        totalWords += len(tt)
    return totalWords

def totalWordCount2(corpus):
    totalWords = 0
    for corp in corpus:
        for c in corp:
            totalWords += c[1]
    return totalWords

Read and Remove Twitter Accounts that has less than 2000 words in their tweets

We have already collected tweets of random 900 followers of TRTWorld's twitter account. You can also find those Twitter API codes in this repo.

Here we are reading each user's tweets from files and saving them into a list (tweetList) if the number of words in the file greater than 2000 words.


In [5]:
tweetsList = []
userList = []

for file in os.listdir("tweets3"):
    path = "tweets3\\" + file
    f = open(path, 'r', encoding='utf-8')
    fread = f.read()
    if (len(fread.split()) > 2000):
        tweetsList.append(fread)
        userList.append(file[0:len(file)-4])
    f.close()

print("Number of Users: %d" %(len(tweetsList)))
print("Total Number of Words: %d" %(totalWordCount(tweetsList)))


Number of Users: 1117
Total Number of Words: 328154028

Remove URLs

We have removed all urls which are starting with "http://" or "https://. So we excluded all pictures, videos, gifs etc. from the text.


In [6]:
def remove_urls(text):
    text = re.sub(r"(?:\@|http?\://)\S+", "", text)
    text = re.sub(r"(?:\@|https?\://)\S+", "", text)
    return text

def doc_rm_urls():
    return [ remove_urls(tweets) for tweets in tweetsList]

tweetsList = doc_rm_urls()

print("Total Number of Words: %d" %(totalWordCount(tweetsList)))


Total Number of Words: 236316529

Tokenization

Tokenization is basically process of splitting text into words, phrases or other meaningful elements called tokens. We words as our tokens. To better process the text and to create a dictionary and a corpus we tokenized and converted to lower case all the tweets. We used nltk library with regexp to tokenize.


In [7]:
# This returns a list of tokens / single words for each user
def tokenize_tweet():
    '''
        Tokenizes the raw text of each document
    '''
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    return [ tokenizer.tokenize(t.lower()) for t in tweetsList]

tweetsList = tokenize_tweet()

print("Total Number of Words: " + str(totalWordCount(tweetsList)))


Total Number of Words: 38119510

Stop words

Stop words usually refer to the most common words in a language. So being common makes stopwords less effective and sometimes misleading while making decisions. Thus generally stop words are words which are filtered out. We used nltk library to obtain general English stop words, also we determined some words ourselves and also added one and two character words from tweets to stop words.


In [8]:
# Remove stop words
stoplist_tw=['amp','get','got','hey','hmm','hoo','hop','iep','let','ooo','par',
            'pdt','pln','pst','wha','yep','yer','aest','didn','nzdt','via',
            'one','com','new','like','great','make','top','awesome','best',
            'good','wow','yes','say','yay','would','thanks','thank','going',
            'new','use','should','could','best','really','see','want','nice',
            'while','know', 'rt', 'http', 'https']

stoplist  = set(nltk.corpus.stopwords.words("english") + stoplist_tw)

## bu sayi olayini yapicaksak 3d olayina dikkat et
tweetsList = [[token for token in tweets if token not in stoplist and len(token) > 1]
                for tweets in tweetsList]

print("Total Number of Words: " + str(totalWordCount(tweetsList)))


Total Number of Words: 22240226

Remove non-English accounts

It is an extension process to removing non-English words. After removing non-English words from tweets, we removed accounts from our corpus whose tweets are majorly not in English. We used a library called langid to detect English accounts.


In [9]:
# Delete Accounts whose tweets are not majorly in English
tweetsList2 = [tweets for tweets in tweetsList if langid.classify(' '.join(tweets))[0] == 'en']

print("Number of Users: " + str(len(tweetsList2)))
print("Total Number of Words: " + str(totalWordCount(tweetsList2)))


Number of Users: 998
Total Number of Words: 19530500

Delete accounts whose number of left tokens are less than 200

After all those preprocessing on tweets, we have removed lots of words from original tweets. Some of the accounts, which are possibly not majorly in English but still includes English words, effected more but still existed in the corpus. So to eliminate those misleading accounts from the corpus we deleted accounts whose number of left tokens are less than 200.


In [10]:
# Delete Accounts whose length of tokenized tweets are less than 200
tweetsList2 = [tweets for tweets in tweetsList2 if len(tweets) > 200]
print("Number of Users: " + str(len(tweetsList2)))


Number of Users: 998

Remap the User IDs to tweets

In the last two steps we have deleted some accounts. So we need to remap the user ids to the tweets. You can reach tweets from tweetsList2 and users from userList2.


In [11]:
userList2 = []
for i in range(len(tweetsList2)):
    for j in range(i, len(tweetsList)):
        if tweetsList2[i] == tweetsList[j]:
            userList2.append(userList[j])
            break

Stemming

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. nltk library has mainly 3 kinds of stemming tools for English: lancaster, porter and snowball. We chose Snowball stemmer because it uses a more developed algorithm then Porter Stemmer (Snowball is also called as Porter2) and less aggressive than Lancaster.


In [12]:
# Porter Stemmer and Snowball Stemmer (Porter2) - We useed Snowball Stemmer
# http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg

sno = nltk.stem.SnowballStemmer('english')

tweetsList2 = [[sno.stem(token) for token in tweets]
          for tweets in tweetsList2]

print("Total Number of Words: " + str(totalWordCount(tweetsList2)))


Total Number of Words: 19530500

Dictionary and Corpus

To properly use the Twitter data that we have preprocessed, we need to put into a shape that will be understandable by Topic Modeling algorithms. Bag-of-words representation is perfect fit for those kind of algorithms. In bag-of-words we first created a dictionary which consists of all the words from our preprocessed twitter data as values and their ids as keys. Then we created our corpus. Each element of the corpus corresponds to one Twitter account. Each element consists tuples which includes dictionary id of words and the number of that words' occurrences in that account. We used a very useful python library called Gensim to create our dictionary and corpus.


In [13]:
# Build a dictionary where for each document each word has its own id
dictionary = corpora.Dictionary(tweetsList2)
dictionary.compactify()

# Build the corpus: vectors with occurence of each word for each document
# convert tokenized documents to vectors
corpus = [dictionary.doc2bow(tweets) for tweets in tweetsList2]

print(dictionary)


Dictionary(375149 unique tokens: ['fcpsnew', 'projectaspir', 'malagamak', 'gwendolyn', 'amputeefit']...)

In [16]:
# Removing words that appears at most 10 times in the whole corpus

dictCtr = np.zeros(len(dictionary))

for c in corpus:
    for tuples in c:
        dictCtr[tuples[0]] = dictCtr[tuples[0]] + tuples[1]
        
badids = []
for i in range(len(dictCtr)):
    if dictCtr[i] < 11:
        badids.append(i)
        
        
dictionary.filter_tokens(bad_ids=badids)
dictionary.compactify()

corpus = [dictionary.doc2bow(tweets) for tweets in tweetsList2]

print(dictionary)


Dictionary(45262 unique tokens: ['generalelect', 'pva', 'busker', '60fps', 'anxieti']...)

In [17]:
tweetList = []

for c in corpus:
    str = ''
    for tokens in c:
        str = str + ((dictionary[tokens[0]]+' ') * tokens[1])
    tweetList.append(str)

print("Number of Users: %d"  %(len(tweetList)))
print("Total Number of Words: %d" %(totalWordCount2(corpus)))


Number of Users: 998
Total Number of Words: 18874173

Word2Vec

  • Word2Vec uses word embedding to map words to a vector of real numbers.
  • We applied k-means clustering to the vectors to see the relevant words together.
  • We chose the word at the center of the cluster to represent the other words from the same cluster in the word corpus.
  • We normalized the number of occurrences in the corpus to handle the problem of less frequent words being more important.

Training word2vec


In [18]:
wordModel = models.Word2Vec(tweetsList2, size=30, window=5, min_count=11, workers=4)

print(wordModel)


Word2Vec(vocab=45262, size=30, alpha=0.025)

In [19]:
#print(len(wordModel.wv.index2word))
vocab = wordModel.wv.index2word
wordvectors = wordModel.wv[vocab]

kmeans clustering with 2000 clusters


In [20]:
kmeansList = np.asarray(wordvectors).astype(np.float64)

kmeans = KMeans(n_clusters=2000).fit(kmeansList)

In [21]:
clusters = {}
labels = {}
centers = []
inVocab = {}

for i in range(0,2000):
    clusters[i] = []

for i, label in enumerate(kmeans.labels_):
    clusters[label].append(vocab[i])
    labels[vocab[i]] = label
    
for c in kmeans.cluster_centers_:
    centers.append(wordModel.similar_by_vector(c)[0][0])
    
for v in vocab:
    inVocab[v] = 1

Relabeling


In [22]:
# Change words in tweets with their cluster center words
tweets2 = [[centers[labels[r]] for r in row if r in inVocab]
          for row in tweetsList2]

Dictionary and Corpus creating for word2vec


In [23]:
# Build a dictionary where for each document each word has its own id
dictionaryVW = corpora.Dictionary(tweets2)
dictionaryVW.compactify()

# Build the corpus: vectors with occurence of each word for each document
# convert tokenized documents to vectors
corpusVW = [dictionaryVW.doc2bow(tweets) for tweets in tweets2]

print(dictionaryVW)


Dictionary(1991 unique tokens: ['sale', 'star', 'tutori', 'note', 'construct']...)

Norma;izing


In [24]:
# Normalize word counts by dividing it to the number of elements in its cluster
corpusVW = [[(r[0], int(math.ceil(r[1]/ len(clusters[labels[dictionaryVW[r[0]]]]))) ) for r in row]
          for row in corpusVW]

In [25]:
tweetListVW = []

for c in corpusVW:
    str = ''
    for tokens in c:
        str = str + ((dictionaryVW[tokens[0]]+' ') * tokens[1])
    tweetListVW.append(str)

print("Number of Users: %d"  %(len(tweetListVW)))
print("Total Number of Words: %d" %(totalWordCount2(corpusVW)))


Number of Users: 998
Total Number of Words: 10401128

KLBNMF Implementation

This code is converted from Cemgil's MATLAB code. https://www.cmpe.boun.edu.tr/~cemgil/bnmf/index.html

KLBNMF Process


In [5]:
# %load gnmf_solvebynewton.py
from __future__ import division
import numpy as np
import scipy as sp
from scipy import special
import numpy.matlib as M

def gnmf_solvebynewton(c, a0 = None):

    if a0 is None:
        a0 = 0.1 * np.ones(np.shape(c))

    M, N = np.shape(a0)
    if len(np.shape(c)) == 0:
        Mc , Nc = 1,1
    else:
        Mc, Nc = np.shape(c)



    a = None
    cond = 0

    if (M == Mc and N == Nc):
        a = a0
        cond = 1

    elif (Mc == 1 and Nc >1):
        cond = 2
        a = a0[0,:]
    elif (Mc > 1 and Nc == 1):
        cond = 3
        a = a0[:,0]
    elif (Mc == 1 and Nc == 1):
        cond = 4
        a = a0[0,0]

    a2 = None
    for index in range(10):
        a2 = a - (np.log(a) - special.polygamma(0,a) + 1 - c) / (1/a - special.polygamma(1,a))
        idx = np.where(a2<0)
        if len(idx[0]) > 0:
            if isinstance(a, float):
                a2 = a / 2
            else:
                a2[idx] = a[idx] / 2
        a = a2

    if(cond == 2):
        a = M.repmat(a,M,1)
    elif(cond == 3):
        a = M.repmat(a,1,N)
    elif(cond == 4):
        a = a * np.ones([M,N])

    return a

In [6]:
# %load gnmf_vb_poisson_mult_fast.py
from __future__ import division
import numpy as np
import scipy as sp
import math
from scipy import special
import numpy.matlib as M

def gnmf_vb_poisson_mult_fast(x,
                            a_tm,
                            b_tm,
                            a_ve,
                            b_ve,
                            EPOCH =1000,
                            Method = 'vb',
                            Update = np.inf,
                            tie_a_ve = 'clamp',
                            tie_b_ve = 'clamp',
                            tie_a_tm = 'clamp',
                            tie_b_tm = 'clamp',
                            print_period = 500
                            ):

    # Result initialiation
    g = dict()
    g['E_T'] = None
    g['E_logT'] = None
    g['E_V'] = None
    g['E_logV'] = None
    g['Bound'] = None
    g['a_ve'] = None
    g['b_ve'] = None
    g['a_tm'] = None
    g['b_tm'] = None

    logm = np.vectorize(math.log)
    W = x.shape[0]
    K = x.shape[1]
    I = b_tm.shape[1]

    M = ~np.isnan(x)
    X = np.zeros(x.shape)
    X[M] = x[M]

    t_init = np.random.gamma(a_tm, b_tm/a_tm)
    v_init = np.random.gamma(a_ve, b_ve/a_ve)
    L_t = t_init
    L_v = v_init
    E_t = t_init
    E_v = v_init
    Sig_t = t_init
    Sig_v = v_init

    B = np.zeros([1,EPOCH])
    gammalnX = special.gammaln(X+1)

    for e in range(1,EPOCH+1):

        LtLv = L_t.dot(L_v)
        tmp = X / (LtLv)
        #check Tranpose
        Sig_t = L_t * (tmp.dot(L_v.T))
        Sig_v = L_v * (L_t.T.dot(tmp))

        alpha_tm = a_tm + Sig_t
        beta_tm = 1/((a_tm/b_tm) + M.dot(E_v.T))
        E_t = alpha_tm * (beta_tm)

        alpha_ve = a_ve + Sig_v
        beta_ve = 1/((a_ve/b_ve) + E_t.T.dot(M))

        E_v = alpha_ve * (beta_ve)
        # Compute the bound
        if(e%10 == 1):
            print("*", end='')
        if(e%print_period == 1 or e == EPOCH):
            g['E_T'] = E_t
            g['E_logT'] = logm(L_t)
            g['E_V'] = E_v
            g['E_logV'] = logm(L_v)

            g['Bound'] = -np.sum(np.sum(M * (g['E_T'].dot(g['E_V'])) + gammalnX))\
                        + np.sum(np.sum(-X * ( ((L_t * g['E_logT']).dot(L_v) + L_t.dot(L_v * g['E_logV']))/(LtLv) - logm(LtLv) ) ))\
                        + np.sum(np.sum((-a_tm/b_tm)* g['E_T'] - special.gammaln(a_tm) + a_tm * logm(a_tm /b_tm)))\
                        + np.sum(np.sum((-a_ve/b_ve)* g['E_V'] - special.gammaln(a_ve) + a_ve * logm(a_ve /b_ve)))\
                        + np.sum(np.sum( special.gammaln(alpha_tm) + alpha_tm * logm(beta_tm) + 1))\
                        + np.sum(np.sum(special.gammaln(alpha_ve) + alpha_ve * logm(beta_ve) + 1 ))

            g['a_ve'] = a_ve
            g['b_ve'] = b_ve
            g['a_tm'] = a_tm
            g['b_tm'] = b_tm

            print()
            print( g['Bound'], a_ve.flatten()[0], b_ve.flatten()[0], a_tm.flatten()[0], b_tm.flatten()[0])
        if (e == EPOCH):
            break;
        L_t = np.exp(special.psi(alpha_tm)) * beta_tm
        L_v = np.exp(special.psi(alpha_ve)) * beta_ve

        Z = None
        if( e> Update):
            if(not tie_a_tm == 'clamp' ):
                Z = (E_t / b_tm) - (logm(L_t) - logm(b_tm))
                if(tie_a_tm == 'clamp'):
                    a_tm = gnmf_solvebynewton(Z,a0=a_tm)
                elif(tie_a_tm == 'rows'):
                    a_tm = gnmf_solvebynewton(np.sum(Z,0)/W, a0=a_tm)
                elif(tie_a_tm == 'cols'):
                    a_tm = gnmf_solvebynewton(np.sum(Z,1)/I, a0=a_tm)
                elif(tie_a_tm == 'tie_all'):
                    #print(np.sum(Z)/(W * I))
                    #print(a_tm)
                    a_tm = gnmf_solvebynewton(np.sum(Z)/(W * I), a0=a_tm)

            if(tie_b_tm == 'free'):
                b_tm = E_t
            elif(tie_b_tm == 'rows'):
                b_tm = M.repmat(np.sum(a_tm * E_t,0)/np.sum(a_tm,0),W,1)
            elif(tie_b_tm == 'cols'):
                b_tm = M.repmat(np.sum(a_tm * E_t,1)/np.sum(a_tm,1),1,I)
            elif(tie_b_tm == 'tie_all'):
                b_tm = (np.sum(a_tm*E_t)/ np.sum(a_tm)) * np.ones([W,I])

            if(not tie_a_ve == 'clamp' ):
                Z = (E_v / b_ve) - (logm(L_v) - logm(b_ve))
                if(tie_a_ve == 'clamp'):
                    a_ve = gnmf_solvebynewton(Z,a_ve)
                elif(tie_a_ve == 'rows'):
                    a_ve = gnmf_solvebynewton(np.sum(Z,0)/I, a0=a_ve)
                elif(tie_a_ve == 'cols'):
                    a_ve = gnmf_solvebynewton(np.sum(Z,1)/K, a0=a_ve)
                elif(tie_a_ve == 'tie_all'):
                    a_ve = gnmf_solvebynewton(np.sum(Z)/(I * K), a0=a_ve)

            if(tie_b_ve == 'free'):
                b_ve = E_v
            elif(tie_b_ve == 'rows'):
                b_ve = M.repmat(np.sum(a_ve * E_v,0)/np.sum(a_ve,0),I,1)
            elif(tie_b_tm == 'cols'):
                b_ve = M.repmat(np.sum(a_ve * E_v,1)/np.sum(a_ve,1),1,K)
            elif(tie_b_tm == 'tie_all'):
                b_ve = (np.sum(a_ve*E_v)/ np.sum(a_ve)) * np.ones([I,K])
    return g

In [8]:
n_topics = 17
n_top_words = 7
n_top_topics = 3

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        sm = sum(topic)
        print("Topic #%d:" % topic_idx)
        for i in topic.argsort()[:-n_top_words - 1:-1]:
            print("(%s, %lf)  " %(feature_names[i], topic[i]/sm), end='')
        print()
    print()
    
def print_top_words2(H, feature_names, n_top_words):
    for topic_idx, topic in enumerate(H):
        sm = sum(topic)
        print("Topic #%d:" % topic_idx)
        for i in topic.argsort()[:-n_top_words - 1:-1]:
            print("(%s, %lf)  " %(feature_names[i], topic[i]/sm), end='')
        print()
    print()
    
def print_top_topics(doc_topic, user, n_top_topics):
    for i in doc_topic[user].argsort()[:-n_top_topics - 1:-1]:
        print("(%d, %lf)  " %(i, doc_topic[user][i]), end='')
    
def topicAndWords(model, doc_topic, user, feature_names):
    model_comp = model.components_
    for i in doc_topic[user].argsort()[:-3 - 1:-1]:
        print("(%d, %lf)  " %(i, doc_topic[user][i]), end='')
        sm = sum(model_comp[i])
        for j in model_comp[i].argsort()[:-3 - 1:-1]:
            print("(%s, %lf)  " %(feature_names[j], model_comp[i][j]/sm), end='')
        print()

Sci-kit NMF


In [8]:
n_samples = len(tweetList)
n_features = len(dictionary)

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")

tfidf_vectorizer = TfidfVectorizer(max_features=n_features)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(tweetList)
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with tf-idf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer)
pyLDAvis.display(nmf_vis_data)


Extracting tf-idf features for NMF...
done in 11.119s.
Fitting the NMF model with tf-idf features, n_samples=998 and n_features=45262...
done in 9.387s.

Topics in NMF model:
Topic #0:
(work, 0.004914)  (time, 0.004630)  (look, 0.004345)  (think, 0.003899)  (peopl, 0.003432)  (day, 0.003389)  (need, 0.003280)  
Topic #1:
(data, 0.241154)  (scienc, 0.044569)  (analyt, 0.036594)  (learn, 0.036334)  (big, 0.031454)  (scientist, 0.026678)  (machin, 0.022504)  
Topic #2:
(bigdata, 0.154639)  (analyt, 0.068607)  (data, 0.043607)  (iot, 0.042267)  (big, 0.025896)  (busi, 0.017473)  (market, 0.015943)  
Topic #3:
(learn, 0.040745)  (deep, 0.024388)  (neural, 0.022234)  (paper, 0.018215)  (deeplearn, 0.017165)  (machin, 0.015572)  (network, 0.013196)  
Topic #4:
(3dprint, 0.249377)  (3d, 0.125076)  (print, 0.101585)  (printer, 0.035551)  (3dprinter, 0.033668)  (design, 0.022039)  (makerbot, 0.016994)  
Topic #5:
(arduino, 0.336034)  (maker, 0.035116)  (raspberrypi, 0.033419)  (shield, 0.026470)  (kit, 0.023299)  (iot, 0.021990)  (project, 0.021461)  
Topic #6:
(datasci, 0.281868)  (machinelearn, 0.194243)  (bigdata, 0.070380)  (deeplearn, 0.070199)  (python, 0.031149)  (learn, 0.022705)  (datascientist, 0.016012)  
Topic #7:
(robot, 0.348339)  (autom, 0.022113)  (robohub, 0.015528)  (manufactur, 0.015457)  (omgrobot, 0.014365)  (drone, 0.014315)  (industri, 0.014309)  
Topic #8:
(edtech, 0.058550)  (stem, 0.056664)  (maker, 0.037456)  (student, 0.028015)  (edchat, 0.026841)  (learn, 0.026525)  (code, 0.026166)  
Topic #9:
(python, 0.108389)  (ipython, 0.071362)  (notebook, 0.039440)  (pydata, 0.039071)  (conda, 0.033065)  (jupyt, 0.031849)  (panda, 0.023957)  
Topic #10:
(kuka, 0.614207)  (ukmfg, 0.070367)  (kr, 0.050475)  (iiwa, 0.049722)  (agilus, 0.049576)  (lbr, 0.032453)  (robot, 0.022532)  
Topic #11:
(drone, 0.379757)  (uav, 0.114230)  (ua, 0.047400)  (dji, 0.034248)  (aerial, 0.024703)  (fpv, 0.021805)  (faa, 0.021736)  
Topic #12:
(tableau, 0.165024)  (makeovermonday, 0.112358)  (data16, 0.072469)  (viz, 0.068485)  (ironviz, 0.037251)  (data, 0.031616)  (dataviz, 0.030257)  
Topic #13:
(hadoop, 0.123041)  (cloudera, 0.068899)  (apach, 0.066658)  (kafka, 0.042116)  (stratahadoop, 0.040809)  (spark, 0.033302)  (data, 0.027932)  
Topic #14:
(ai, 0.189789)  (iot, 0.049728)  (machinelearn, 0.042668)  (artificialintellig, 0.030763)  (intellig, 0.027982)  (artifici, 0.026995)  (fintech, 0.019612)  
Topic #15:
(rstat, 0.417512)  (rstudio, 0.044314)  (datasci, 0.037892)  (packag, 0.037259)  (ggplot2, 0.031521)  (datadc, 0.029394)  (rstudioconf, 0.025596)  
Topic #16:
(trump, 0.156611)  (clinton, 0.026745)  (gop, 0.022901)  (russia, 0.021104)  (presid, 0.020789)  (comey, 0.019030)  (elect, 0.018357)  

Out[8]:

Sci-kit NMF with word2vec


In [20]:
n_samples = len(tweetListVW)
n_features = len(dictionaryVW)

# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")

tfidf_vectorizerWV = TfidfVectorizer(max_features=n_features)
t0 = time()
tfidfWV = tfidf_vectorizerWV.fit_transform(tweetListVW)
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with tf-idf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
nmfWV = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidfWV)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizerWV.get_feature_names()
print_top_words(nmfWV, tfidf_feature_names, n_top_words)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(nmfWV, tfidfWV, tfidf_vectorizerWV)
pyLDAvis.display(nmf_vis_data)


Extracting tf-idf features for NMF...
done in 6.401s.
Fitting the NMF model with tf-idf features, n_samples=998 and n_features=1991...
done in 2.732s.

Topics in NMF model:
Topic #0:
(day, 0.008128)  (time, 0.007678)  (peopl, 0.006849)  (done, 0.006358)  (thing, 0.006102)  (us, 0.005997)  (use, 0.005837)  
Topic #1:
(data, 0.183956)  (analyt, 0.041112)  (scienc, 0.028836)  (big, 0.025990)  (learn, 0.021398)  (scientist, 0.017654)  (bigdata, 0.015501)  
Topic #2:
(bigdata, 0.232107)  (machinelearn, 0.118212)  (analyt, 0.073407)  (datamin, 0.051580)  (artificialintellig, 0.037845)  (iot, 0.031193)  (big, 0.021858)  
Topic #3:
(learn, 0.055295)  (deep, 0.029912)  (neural, 0.025559)  (machinelearn, 0.024601)  (machin, 0.024289)  (paper, 0.022535)  (artificialintellig, 0.020335)  
Topic #4:
(3dprint, 0.273420)  (3d, 0.138683)  (print, 0.121548)  (printer, 0.044875)  (3dprinter, 0.036580)  (design, 0.030340)  (manufactur, 0.019320)  
Topic #5:
(arduino, 0.399388)  (kit, 0.042451)  (shield, 0.033335)  (project, 0.026665)  (diy, 0.025277)  (electron, 0.023424)  (raspberrypi, 0.019443)  
Topic #6:
(edtech, 0.069536)  (stem, 0.055575)  (code, 0.055182)  (learn, 0.037266)  (edchat, 0.032892)  (teacher, 0.029682)  (kid, 0.025035)  
Topic #7:
(robot, 0.362028)  (kuka, 0.023951)  (autom, 0.019504)  (kit, 0.019292)  (artificialintellig, 0.017896)  (robounivers, 0.016750)  (omgrobot, 0.014956)  
Topic #8:
(python, 0.175032)  (pydata, 0.053101)  (jupyt, 0.050146)  (data, 0.031754)  (scikit, 0.029542)  (anaconda, 0.027339)  (panda, 0.026546)  
Topic #9:
(trump, 0.169608)  (peopl, 0.026274)  (vote, 0.019955)  (obama, 0.018103)  (us, 0.018064)  (white, 0.015940)  (hillari, 0.014115)  
Topic #10:
(startup, 0.022973)  (busi, 0.020746)  (innov, 0.019758)  (tech, 0.018304)  (iot, 0.012634)  (entrepreneur, 0.012587)  (artificialintellig, 0.011731)  
Topic #11:
(drone, 0.477388)  (uav, 0.162813)  (aerial, 0.038346)  (fpv, 0.028588)  (dji, 0.028420)  (fli, 0.026127)  (dronenew, 0.025871)  
Topic #12:
(pi, 0.191908)  (raspberri, 0.147220)  (raspberrypi, 0.120222)  (zero, 0.036511)  (kit, 0.035686)  (project, 0.020166)  (board, 0.020116)  
Topic #13:
(rstat, 0.476965)  (data, 0.052888)  (datamin, 0.047756)  (packag, 0.042576)  (machinelearn, 0.030565)  (use, 0.028144)  (statist, 0.026831)  
Topic #14:
(tableau, 0.231891)  (makeovermonday, 0.152681)  (viz, 0.102653)  (data16, 0.100979)  (ironviz, 0.054593)  (dataviz, 0.046260)  (data, 0.041919)  
Topic #15:
(apach, 0.102599)  (kafka, 0.078088)  (cloudera, 0.074239)  (spark, 0.056054)  (nosql, 0.049354)  (data, 0.032301)  (stream, 0.029791)  
Topic #16:
(maker, 0.187175)  (makerspac, 0.062676)  (makerfair, 0.046608)  (learn, 0.024537)  (fair, 0.023221)  (project, 0.023045)  (kit, 0.022005)  

Out[20]:

KLBNMF with term frequency


In [9]:
n_samples = len(tweetList)
n_features = len(dictionary)

# Use tf-idf features for NMF.
print("Extracting tf features for NMF...")

tf_vectorizer = CountVectorizer(max_features=n_features)
t0 = time()
tf = tf_vectorizer.fit_transform(tweetList)

#tfidf_vectorizer = TfidfVectorizer(max_features=n_features)
#t0 = time()
#tfidf = tfidf_vectorizer.fit_transform(tweetList)
print("done in %0.3fs." % (time() - t0))


Extracting tf features for NMF...
done in 11.969s.

In [11]:
tfDense = tf.todense()
tfDense2 = tfDense
#idx = np.where(tfidfDense2>100)
#print(len(idx[0]))

W = tf.shape[0]
K = tf.shape[1]
I = n_topics

a_tm = 10 * np.ones([W,I])
b_tm = np.ones([W,I])
a_ve = np.ones([I,K])
b_ve = 10 * np.ones([I,K])

#T = np.random.gamma(a_tm,b_tm)
#V = np.random.gamma(a_ve,b_ve)

#x = np.random.poisson(T.dot(V))

#idx = np.where(x>100)
#print(len(idx[0]))

t0 = time()

klbnmf = gnmf_vb_poisson_mult_fast(np.asarray(tfDense2),a_tm,b_tm,a_ve,b_ve,
                                EPOCH=500,
                                Update =10,
                                tie_a_ve='tie_all',
                                tie_b_ve='tie_all',
                                tie_a_tm='tie_all',
                                tie_b_tm='tie_all')

print("done in %0.3fs." % (time() - t0))


*
-72038208.7544 1.0 10.0 10.0 1.0
*************************************************
-60658179.8308 0.0330212241114 9.1040319085 0.201508692109 0.00265750213046
done in 703.456s.

We can directly obtain the word distributions of topics from the factorized matrices generated by KLBNMF algorithm. Below you can see the output. However, to visualize the output we wanted to use LDAvis library. To use it we generated our initial corpus (matrix) by multiplying the outputs of KLBNMF (klbnmf['E_T'],klbnmf['E_V']) and we put it into the regular sci-kit NMF. Then we put the resulting NMF model into the LDAvis to visualize the word-topic distribution. As we inspect both results, there are nearly no difference and with LDAvis we have achieved to visualize. After this ecample we will follow the same proccess of visualization for all the outputs. But if ypu want you can use the below function to directly obtain the word-topic distribution.


In [12]:
print_top_words2(klbnmf['E_V'], tf_vectorizer.get_feature_names(), n_top_words)


Topic #0:
(3dprint, 0.051522)  (3d, 0.051316)  (print, 0.046890)  (design, 0.018694)  (printer, 0.011010)  (fashion, 0.009989)  (maker, 0.009214)  
Topic #1:
(stori, 0.010387)  (innov, 0.009948)  (busi, 0.007428)  (global, 0.007254)  (world, 0.006177)  (daili, 0.005670)  (startup, 0.005619)  
Topic #2:
(us, 0.031361)  (game, 0.020392)  (influenc, 0.014002)  (includ, 0.013034)  (check, 0.012780)  (odsc, 0.012566)  (free, 0.011940)  
Topic #3:
(work, 0.009236)  (think, 0.009111)  (time, 0.007206)  (peopl, 0.007199)  (look, 0.006239)  (thing, 0.005589)  (need, 0.005499)  
Topic #4:
(learn, 0.028403)  (python, 0.018148)  (data, 0.014590)  (deep, 0.011802)  (machin, 0.011490)  (use, 0.009740)  (model, 0.009695)  
Topic #5:
(data, 0.076920)  (bigdata, 0.068380)  (datasci, 0.049201)  (analyt, 0.026873)  (machinelearn, 0.025399)  (big, 0.022977)  (hadoop, 0.013304)  
Topic #6:
(stem, 0.025063)  (code, 0.022266)  (learn, 0.019078)  (student, 0.017609)  (edtech, 0.015508)  (educ, 0.015171)  (school, 0.012714)  
Topic #7:
(market, 0.026355)  (twitter, 0.017855)  (follow, 0.013087)  (busi, 0.012010)  (startup, 0.010916)  (social, 0.010887)  (digit, 0.010534)  
Topic #8:
(love, 0.012852)  (day, 0.008063)  (look, 0.005976)  (today, 0.005925)  (time, 0.005880)  (happi, 0.004870)  (book, 0.004837)  
Topic #9:
(data, 0.027918)  (talk, 0.013966)  (us, 0.011710)  (join, 0.010495)  (learn, 0.009439)  (check, 0.008125)  (help, 0.007715)  
Topic #10:
(data, 0.026577)  (analyt, 0.025800)  (cloud, 0.014582)  (busi, 0.012207)  (custom, 0.007546)  (ibm, 0.007268)  (today, 0.007207)  
Topic #11:
(manufactur, 0.019522)  (de, 0.012234)  (la, 0.007425)  (job, 0.007012)  (ddj, 0.006784)  (en, 0.005823)  (industri, 0.005749)  
Topic #12:
(ai, 0.052142)  (iot, 0.030243)  (tech, 0.014939)  (technolog, 0.010586)  (machinelearn, 0.010118)  (intellig, 0.009229)  (robot, 0.008495)  
Topic #13:
(scienc, 0.012454)  (research, 0.009733)  (student, 0.008013)  (today, 0.005633)  (us, 0.004692)  (open, 0.004548)  (work, 0.004524)  
Topic #14:
(robot, 0.051806)  (us, 0.009226)  (look, 0.008898)  (check, 0.007433)  (team, 0.006791)  (day, 0.005858)  (product, 0.005854)  
Topic #15:
(drone, 0.025118)  (trump, 0.010933)  (news, 0.005309)  (us, 0.005306)  (appl, 0.004308)  (uav, 0.003844)  (first, 0.003046)  
Topic #16:
(arduino, 0.030326)  (maker, 0.014935)  (project, 0.012441)  (kit, 0.010000)  (build, 0.009642)  (pi, 0.008443)  (use, 0.007306)  


In [14]:
tfKL = np.dot(klbnmf['E_T'],klbnmf['E_V'])

# Fit the NMF model
print("Fitting the KLBNMF model with tf-idf(klbnmf['E_T']*klbnmf['E_V']) features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
klbnmf2tf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfKL)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(klbnmf2tf, tfidf_feature_names, n_top_words)

tfKLsparse = sparse.csr_matrix(tfKL)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(klbnmf2tf, tfKLsparse, tf_vectorizer)
pyLDAvis.display(nmf_vis_data)


Fitting the KLBNMF model with tf-idf(klbnmf['E_T']*klbnmf['E_V']) features, n_samples=998 and n_features=45262...
done in 15.380s.

Topics in NMF model:
Topic #0:
(data, 0.077000)  (bigdata, 0.068454)  (datasci, 0.049255)  (analyt, 0.026900)  (machinelearn, 0.025428)  (big, 0.022999)  (hadoop, 0.013318)  
Topic #1:
(work, 0.009170)  (think, 0.009043)  (time, 0.007189)  (peopl, 0.007150)  (look, 0.006232)  (thing, 0.005561)  (need, 0.005478)  
Topic #2:
(3dprint, 0.051537)  (3d, 0.051330)  (print, 0.046903)  (design, 0.018700)  (printer, 0.011013)  (fashion, 0.009991)  (maker, 0.009216)  
Topic #3:
(ai, 0.052181)  (iot, 0.030266)  (tech, 0.014949)  (technolog, 0.010592)  (machinelearn, 0.010126)  (intellig, 0.009236)  (robot, 0.008505)  
Topic #4:
(stem, 0.025053)  (code, 0.022259)  (learn, 0.019070)  (student, 0.017602)  (edtech, 0.015502)  (educ, 0.015166)  (school, 0.012710)  
Topic #5:
(learn, 0.028273)  (python, 0.018056)  (data, 0.014552)  (deep, 0.011744)  (machin, 0.011435)  (use, 0.009719)  (model, 0.009652)  
Topic #6:
(robot, 0.051889)  (us, 0.009240)  (look, 0.008902)  (check, 0.007444)  (team, 0.006801)  (product, 0.005862)  (day, 0.005862)  
Topic #7:
(arduino, 0.030395)  (maker, 0.014973)  (project, 0.012463)  (kit, 0.010018)  (build, 0.009650)  (pi, 0.008462)  (use, 0.007314)  
Topic #8:
(data, 0.026595)  (analyt, 0.025810)  (cloud, 0.014586)  (busi, 0.012210)  (custom, 0.007549)  (ibm, 0.007271)  (today, 0.007206)  
Topic #9:
(market, 0.026373)  (twitter, 0.017867)  (follow, 0.013095)  (busi, 0.012018)  (startup, 0.010924)  (social, 0.010894)  (digit, 0.010541)  
Topic #10:
(stori, 0.010385)  (innov, 0.009949)  (busi, 0.007428)  (global, 0.007254)  (world, 0.006177)  (daili, 0.005671)  (startup, 0.005619)  
Topic #11:
(us, 0.031420)  (game, 0.020430)  (influenc, 0.014030)  (includ, 0.013060)  (check, 0.012802)  (odsc, 0.012591)  (free, 0.011962)  
Topic #12:
(drone, 0.025118)  (trump, 0.010933)  (news, 0.005312)  (us, 0.005307)  (appl, 0.004308)  (uav, 0.003844)  (first, 0.003048)  
Topic #13:
(data, 0.027646)  (talk, 0.013834)  (us, 0.011623)  (join, 0.010398)  (learn, 0.009356)  (check, 0.008080)  (help, 0.007666)  
Topic #14:
(scienc, 0.012460)  (research, 0.009738)  (student, 0.008018)  (today, 0.005627)  (us, 0.004691)  (open, 0.004551)  (work, 0.004533)  
Topic #15:
(manufactur, 0.019541)  (de, 0.012245)  (la, 0.007432)  (job, 0.007018)  (ddj, 0.006790)  (en, 0.005828)  (industri, 0.005755)  
Topic #16:
(love, 0.012877)  (day, 0.008073)  (look, 0.005960)  (today, 0.005939)  (time, 0.005857)  (happi, 0.004880)  (book, 0.004843)  

Out[14]:

KLBNMF with term frequency - inverse document frequency

Here in tf-idf appraoch we needed to multiply the corpus values with 10000 to get results. Because it has a Poisson distributions, it needs integer values. You can compare the results with term frequnecy approach which has integer values inherently and we didn't need to multiply the values with anything to get results.


In [9]:
tfidfDense = tfidf.todense()
tfidfDense2 = tfidfDense*10000
#idx = np.where(tfidfDense2>100)
#print(len(idx[0]))

W = tfidf.shape[0]
K = tfidf.shape[1]
I = n_topics

a_tm = 1 * np.ones([W,I])
b_tm = np.ones([W,I])
a_ve = np.ones([I,K])
b_ve = 8 * np.ones([I,K])

#T = np.random.gamma(a_tm,b_tm)
#V = np.random.gamma(a_ve,b_ve)

#x = np.random.poisson(T.dot(V))

#idx = np.where(x>100)
#print(len(idx[0]))

t0 = time()

klbnmf = gnmf_vb_poisson_mult_fast(np.asarray(tfidfDense2),a_tm,b_tm,a_ve,b_ve,
                                EPOCH=500,
                                Update =10,
                                tie_a_ve='tie_all',
                                tie_b_ve='tie_all',
                                tie_a_tm='tie_all',
                                tie_b_tm='tie_all')

print("done in %0.3fs." % (time() - t0))


*
-736129414.445 1.0 8.0 1.0 1.0
*************************************************
-640927794.57 0.0520008057635 7.97555284127 0.0665246518457 0.0299543017417
done in 643.631s.

In [11]:
tfidfKL = np.dot(klbnmf['E_T'],klbnmf['E_V'])

# Fit the NMF model
print("Fitting the KLBNMF model with tf-idf(klbnmf['E_T']*klbnmf['E_V']) features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
klbnmf2 = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidfKL)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(klbnmf2, tfidf_feature_names, n_top_words)

tfidfKLsparse = sparse.csr_matrix(tfidfKL)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(klbnmf2, tfidfKLsparse, tfidf_vectorizer)
pyLDAvis.display(nmf_vis_data)


Fitting the KLBNMF model with tf-idf(klbnmf['E_T']*klbnmf['E_V']) features, n_samples=998 and n_features=45262...
done in 20.213s.

Topics in NMF model:
Topic #0:
(data, 0.006582)  (think, 0.005456)  (work, 0.005071)  (time, 0.004175)  (peopl, 0.004022)  (look, 0.003577)  (thing, 0.003496)  
Topic #1:
(data, 0.035618)  (bigdata, 0.027927)  (datasci, 0.022010)  (analyt, 0.017453)  (machinelearn, 0.013725)  (ai, 0.010039)  (iot, 0.008771)  
Topic #2:
(arduino, 0.033056)  (robot, 0.014797)  (maker, 0.006043)  (kit, 0.005478)  (iot, 0.005437)  (project, 0.004892)  (raspberrypi, 0.004839)  
Topic #3:
(learn, 0.012937)  (ai, 0.007429)  (deep, 0.006982)  (neural, 0.006111)  (paper, 0.005890)  (deeplearn, 0.005710)  (machin, 0.005583)  
Topic #4:
(3dprint, 0.035776)  (3d, 0.019450)  (print, 0.017560)  (printer, 0.006301)  (design, 0.005387)  (3dprinter, 0.004917)  (makerbot, 0.004126)  
Topic #5:
(robot, 0.013779)  (drone, 0.013642)  (ai, 0.005539)  (tech, 0.005433)  (iot, 0.004645)  (wearabl, 0.004533)  (vr, 0.004264)  
Topic #6:
(stem, 0.011202)  (edtech, 0.011067)  (code, 0.009961)  (learn, 0.008700)  (student, 0.008148)  (maker, 0.007695)  (teacher, 0.006222)  
Topic #7:
(python, 0.018802)  (ipython, 0.009899)  (pydata, 0.006602)  (notebook, 0.006354)  (jupyt, 0.005351)  (conda, 0.004800)  (panda, 0.004571)  
Topic #8:
(trump, 0.008943)  (us, 0.002730)  (peopl, 0.002481)  (year, 0.002473)  (world, 0.002001)  (presid, 0.001957)  (say, 0.001839)  
Topic #9:
(hadoop, 0.010350)  (apach, 0.007553)  (data, 0.006457)  (kafka, 0.006431)  (spark, 0.006018)  (stratahadoop, 0.006011)  (talk, 0.005803)  
Topic #10:
(startup, 0.008221)  (market, 0.006151)  (busi, 0.005180)  (entrepreneur, 0.005127)  (tech, 0.003163)  (social, 0.003066)  (innov, 0.002901)  
Topic #11:
(tableau, 0.014755)  (makeovermonday, 0.009816)  (viz, 0.007770)  (data16, 0.006633)  (data, 0.006548)  (dataviz, 0.005001)  (3dexperi, 0.004083)  
Topic #12:
(pi, 0.005444)  (raspberri, 0.003846)  (look, 0.003673)  (lego, 0.002754)  (us, 0.002711)  (build, 0.002710)  (time, 0.002672)  
Topic #13:
(manufactur, 0.007536)  (innov, 0.005379)  (robot, 0.005299)  (ai, 0.004995)  (healthcar, 0.004837)  (industri, 0.003953)  (technolog, 0.003660)  
Topic #14:
(scienc, 0.005965)  (research, 0.003975)  (opendata, 0.002777)  (climatechang, 0.002605)  (today, 0.002421)  (student, 0.002310)  (day, 0.002254)  
Topic #15:
(look, 0.004635)  (uk, 0.003442)  (day, 0.003161)  (us, 0.003121)  (today, 0.003073)  (retrogam, 0.003027)  (love, 0.002975)  
Topic #16:
(day, 0.004204)  (love, 0.003371)  (time, 0.003229)  (today, 0.002466)  (look, 0.002152)  (kid, 0.002051)  (work, 0.002026)  

Out[11]:

KLBNMF with term frequency - inverse document frequency with word2vec


In [21]:
tfidfDenseWV = tfidfWV.todense()
tfidfDense2WV = tfidfDenseWV*10000
#idx = np.where(tfidfDense2>100)
#print(len(idx[0]))

W = tfidfWV.shape[0]
K = tfidfWV.shape[1]
I = n_topics

a_tm = 1 * np.ones([W,I])
b_tm = np.ones([W,I])
a_ve = np.ones([I,K])
b_ve = 8 * np.ones([I,K])

#T = np.random.gamma(a_tm,b_tm)
#V = np.random.gamma(a_ve,b_ve)

#x = np.random.poisson(T.dot(V))

#idx = np.where(x>100)
#print(len(idx[0]))

klbnmfWV = gnmf_vb_poisson_mult_fast(np.asarray(tfidfDense2WV),a_tm,b_tm,a_ve,b_ve,
                                EPOCH=1000,
                                Update =10,
                                tie_a_ve='tie_all',
                                tie_b_ve='tie_all',
                                tie_a_tm='tie_all',
                                tie_b_tm='tie_all')


*
-348689703.074 1.0 8.0 1.0 1.0
**************************************************
-301662516.133 0.178982062587 7.98024126501 0.0853115805522 0.441200384068
*************************************************
-301660493.234 0.177644143738 7.96804094317 0.0844545488278 0.441191024482

In [22]:
tfidfKLWV = np.dot(klbnmfWV['E_T'],klbnmfWV['E_V'])

# Fit the NMF model
print("Fitting the KLBNMF model with tf-idf(klbnmf['E_T']*klbnmf['E_V']) features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
klbnmfWV2 = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidfKLWV)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizerWV.get_feature_names()
print_top_words(klbnmfWV2, tfidf_feature_names, n_top_words)

tfidfKLsparseWV = sparse.csr_matrix(tfidfKLWV)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(klbnmfWV2, tfidfKLsparseWV, tfidf_vectorizerWV)
pyLDAvis.display(nmf_vis_data)


Fitting the KLBNMF model with tf-idf(klbnmf['E_T']*klbnmf['E_V']) features, n_samples=998 and n_features=1991...
done in 0.780s.

Topics in NMF model:
Topic #0:
(time, 0.013646)  (thing, 0.013603)  (peopl, 0.012276)  (day, 0.011863)  (done, 0.011262)  (look, 0.009092)  (take, 0.007558)  
Topic #1:
(bigdata, 0.081065)  (machinelearn, 0.059408)  (data, 0.032551)  (artificialintellig, 0.031902)  (analyt, 0.027554)  (datamin, 0.020429)  (iot, 0.019658)  
Topic #2:
(robot, 0.075723)  (drone, 0.044026)  (uav, 0.011424)  (wearabl, 0.011403)  (tech, 0.010894)  (car, 0.009853)  (omgrobot, 0.009625)  
Topic #3:
(learn, 0.042553)  (deep, 0.025870)  (neural, 0.023508)  (paper, 0.018302)  (machin, 0.015586)  (model, 0.014856)  (use, 0.012216)  
Topic #4:
(3dprint, 0.113673)  (3d, 0.068146)  (print, 0.061105)  (manufactur, 0.026201)  (printer, 0.022611)  (design, 0.019339)  (3dprinter, 0.014940)  
Topic #5:
(data, 0.081904)  (python, 0.057536)  (rstat, 0.034199)  (scienc, 0.023673)  (learn, 0.019013)  (statist, 0.014533)  (pydata, 0.014239)  
Topic #6:
(edtech, 0.032878)  (stem, 0.029990)  (code, 0.029706)  (learn, 0.023071)  (teacher, 0.019558)  (school, 0.017575)  (kid, 0.016763)  
Topic #7:
(arduino, 0.056200)  (maker, 0.032523)  (kit, 0.022393)  (pi, 0.021406)  (robot, 0.016548)  (raspberri, 0.016265)  (project, 0.015941)  
Topic #8:
(data, 0.062562)  (analyt, 0.036807)  (bigdata, 0.022004)  (apach, 0.015627)  (big, 0.013670)  (spark, 0.012503)  (nosql, 0.012152)  
Topic #9:
(us, 0.026720)  (day, 0.017518)  (join, 0.017088)  (check, 0.014357)  (team, 0.013084)  (help, 0.011212)  (excit, 0.010578)  
Topic #10:
(trump, 0.032157)  (us, 0.011915)  (peopl, 0.011671)  (twitter, 0.009045)  (news, 0.006651)  (vote, 0.006046)  (report, 0.005748)  
Topic #11:
(data, 0.043081)  (tableau, 0.042927)  (dataviz, 0.027350)  (viz, 0.025625)  (makeovermonday, 0.025147)  (visual, 0.020215)  (indiedev, 0.018898)  
Topic #12:
(startup, 0.031082)  (busi, 0.022587)  (tech, 0.020072)  (innov, 0.016951)  (entrepreneur, 0.015301)  (compani, 0.013626)  (digit, 0.010756)  
Topic #13:
(data, 0.022629)  (scienc, 0.017825)  (research, 0.016573)  (comput, 0.011928)  (paper, 0.009507)  (learn, 0.009301)  (talk, 0.008680)  
Topic #14:
(art, 0.011043)  (game, 0.009231)  (design, 0.008478)  (watch, 0.008126)  (music, 0.008032)  (book, 0.007939)  (stori, 0.007483)  
Topic #15:
(app, 0.014039)  (use, 0.013762)  (googl, 0.013576)  (releas, 0.011230)  (code, 0.010975)  (open, 0.010648)  (post, 0.009227)  
Topic #16:
(health, 0.011468)  (healthcar, 0.010648)  (innov, 0.009932)  (help, 0.009773)  (world, 0.009567)  (global, 0.008044)  (chang, 0.007934)  

Out[22]:

Results

Comparision of the methods on selected users


In [13]:
ids = []
chosenNames = ['andrewyng', 'radbuzzz', 'RoboticsEU', 'karpathy', 'polar3d', 'thearduinoguy']

# @andrewyng => 216939636
ids.append(userList2.index('216939636'))
# @radbuzzz => 28953366 
ids.append(userList2.index('28953366'))
# @RoboticsEU => 335419621
ids.append(userList2.index('335419621'))
# @karpathy => 33836629
ids.append(userList2.index('33836629'))
# @polar3d => 2875670213
ids.append(userList2.index('2875670213'))
# @thearduinoguy => 15392736
ids.append(userList2.index('15392736'))

In [23]:
from operator import itemgetter
klbnmf2Topics = klbnmf2.transform(tfidfKLsparse)
klbnmfWV2Topics = klbnmfWV2.transform(tfidfKLsparseWV)

In [26]:
for i, id in enumerate(ids):
    print(chosenNames[i])
    print("KLBNMF")
    topicAndWords(klbnmf2, klbnmf2Topics, id, tfidf_vectorizer.get_feature_names())
    print()
    print("KLBNMF (w2v)")
    topicAndWords(klbnmfWV2, klbnmfWV2Topics, id, tfidf_vectorizerWV.get_feature_names())
    print()


andrewyng
KLBNMF
(3, 12.496066)  (learn, 0.012937)  (ai, 0.007429)  (deep, 0.006982)  
(13, 9.892293)  (manufactur, 0.007536)  (innov, 0.005379)  (robot, 0.005299)  
(6, 6.061536)  (stem, 0.011202)  (edtech, 0.011067)  (code, 0.009961)  

KLBNMF (w2v)
(13, 22.401302)  (data, 0.022629)  (scienc, 0.017825)  (research, 0.016573)  
(12, 6.457801)  (startup, 0.031082)  (busi, 0.022587)  (tech, 0.020072)  
(3, 5.859447)  (learn, 0.042553)  (deep, 0.025870)  (neural, 0.023508)  

radbuzzz
KLBNMF
(13, 14.581962)  (manufactur, 0.007536)  (innov, 0.005379)  (robot, 0.005299)  
(4, 9.780511)  (3dprint, 0.035776)  (3d, 0.019450)  (print, 0.017560)  
(5, 6.421823)  (robot, 0.013779)  (drone, 0.013642)  (ai, 0.005539)  

KLBNMF (w2v)
(4, 16.783202)  (3dprint, 0.113673)  (3d, 0.068146)  (print, 0.061105)  
(16, 12.949837)  (health, 0.011468)  (healthcar, 0.010648)  (innov, 0.009932)  
(1, 4.218421)  (bigdata, 0.081065)  (machinelearn, 0.059408)  (data, 0.032551)  

RoboticsEU
KLBNMF
(14, 18.949887)  (scienc, 0.005965)  (research, 0.003975)  (opendata, 0.002777)  
(13, 6.710685)  (manufactur, 0.007536)  (innov, 0.005379)  (robot, 0.005299)  
(5, 4.656647)  (robot, 0.013779)  (drone, 0.013642)  (ai, 0.005539)  

KLBNMF (w2v)
(2, 11.468050)  (robot, 0.075723)  (drone, 0.044026)  (uav, 0.011424)  
(16, 7.160622)  (health, 0.011468)  (healthcar, 0.010648)  (innov, 0.009932)  
(9, 6.578083)  (us, 0.026720)  (day, 0.017518)  (join, 0.017088)  

karpathy
KLBNMF
(3, 21.955956)  (learn, 0.012937)  (ai, 0.007429)  (deep, 0.006982)  
(0, 12.272301)  (data, 0.006582)  (think, 0.005456)  (work, 0.005071)  
(5, 3.099935)  (robot, 0.013779)  (drone, 0.013642)  (ai, 0.005539)  

KLBNMF (w2v)
(3, 21.397635)  (learn, 0.042553)  (deep, 0.025870)  (neural, 0.023508)  
(0, 11.062898)  (time, 0.013646)  (thing, 0.013603)  (peopl, 0.012276)  
(15, 8.414304)  (app, 0.014039)  (use, 0.013762)  (googl, 0.013576)  

polar3d
KLBNMF
(4, 21.352156)  (3dprint, 0.035776)  (3d, 0.019450)  (print, 0.017560)  
(6, 5.922573)  (stem, 0.011202)  (edtech, 0.011067)  (code, 0.009961)  
(7, 2.638872)  (python, 0.018802)  (ipython, 0.009899)  (pydata, 0.006602)  

KLBNMF (w2v)
(4, 28.723592)  (3dprint, 0.113673)  (3d, 0.068146)  (print, 0.061105)  
(6, 7.948516)  (edtech, 0.032878)  (stem, 0.029990)  (code, 0.029706)  
(9, 1.579168)  (us, 0.026720)  (day, 0.017518)  (join, 0.017088)  

thearduinoguy
KLBNMF
(2, 10.993086)  (arduino, 0.033056)  (robot, 0.014797)  (maker, 0.006043)  
(12, 7.266771)  (pi, 0.005444)  (raspberri, 0.003846)  (look, 0.003673)  
(15, 1.517093)  (look, 0.004635)  (uk, 0.003442)  (day, 0.003161)  

KLBNMF (w2v)
(7, 18.842614)  (arduino, 0.056200)  (maker, 0.032523)  (kit, 0.022393)  
(0, 1.212057)  (time, 0.013646)  (thing, 0.013603)  (peopl, 0.012276)  
(6, 0.561698)  (edtech, 0.032878)  (stem, 0.029990)  (code, 0.029706)