New Term Topics Methods and Document Coloring


In [20]:
from gensim.corpora import Dictionary
from gensim.models import ldamodel
import numpy
%matplotlib inline

import logging
logging.basicConfig(level=logging.INFO)

We're setting up our corpus now. We want to show off the new get_term_topics and get_document_topics functionalities, and a good way to do so is to play around with words which might have different meanings in different context.

The word bank is a good candidate here, where it can mean either the financial institution or a river bank. In the toy corpus presented, there are 11 documents, 5 river related and 6 finance related.


In [21]:
import gensim.downloader
corpus = gensim.downloader.load("20-newsgroups")

In [22]:
import collections
from gensim.parsing.preprocessing import preprocess_string

texts = [
    preprocess_string(text['data'])
    for text in corpus
    if text['topic'] in ('soc.religion.christian', 'talk.politics.guns')
]


WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead

In [23]:
dictionary = Dictionary(texts)
dictionary.filter_extremes(no_above=0.1, no_below=10)
corpus = [dictionary.doc2bow(text) for text in texts]


INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17455 unique tokens: ['accept', 'action', 'adulter', 'adulteri', 'adventur']...) from 1907 documents (total 318505 corpus positions)
INFO:gensim.corpora.dictionary:discarding 14137 tokens: [('accept', 259), ('adventur', 6), ('annia', 3), ('believ', 649), ('bibl', 280), ('calvinist', 5), ('cannanit', 4), ('case', 317), ('chastis', 4), ('christian', 527)]...
INFO:gensim.corpora.dictionary:keeping 3318 tokens which were in no less than 10 and no more than 190 (=10.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(3318 unique tokens: ['action', 'adulter', 'adulteri', 'affect', 'andi']...)

We set up the LDA model in the corpus. We set the number of topics to be 2, and expect to see one which is to do with river banks, and one to do with financial banks.


In [27]:
numpy.random.seed(1) # setting random seed to get the same results each time.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, alpha='asymmetric', minimum_probability=1e-8)


INFO:gensim.models.ldamodel:using asymmetric alpha [0.63060194, 0.36939806]
INFO:gensim.models.ldamodel:using symmetric eta at 0.5
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (single-pass) LDA training, 2 topics, 1 passes over the supplied corpus of 1907 documents, updating model once every 1907 documents, evaluating perplexity every 1907 documents, iterating 50x with a convergence threshold of 0.001000
WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
INFO:gensim.models.ldamodel:-8.511 per-word bound, 364.7 perplexity estimate based on a held-out corpus of 1907 documents with 180557 words
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #1907/1907
INFO:gensim.models.ldamodel:topic #0 (0.631): 0.004*"homosexu" + 0.004*"hell" + 0.003*"paul" + 0.002*"firearm" + 0.002*"batf" + 0.002*"cathol" + 0.002*"natur" + 0.002*"crime" + 0.002*"lord" + 0.002*"shall"
INFO:gensim.models.ldamodel:topic #1 (0.369): 0.004*"firearm" + 0.004*"homosexu" + 0.003*"file" + 0.003*"author" + 0.003*"paul" + 0.003*"scriptur" + 0.002*"stratu" + 0.002*"amend" + 0.002*"koresh" + 0.002*"crimin"
INFO:gensim.models.ldamodel:topic diff=0.737151, rho=1.000000

In [25]:
model.show_topics()


Out[25]:
[(0,
  '0.004*"homosexu" + 0.004*"hell" + 0.003*"paul" + 0.002*"firearm" + 0.002*"batf" + 0.002*"cathol" + 0.002*"natur" + 0.002*"crime" + 0.002*"lord" + 0.002*"shall"'),
 (1,
  '0.004*"firearm" + 0.004*"homosexu" + 0.003*"file" + 0.003*"author" + 0.003*"paul" + 0.003*"scriptur" + 0.002*"stratu" + 0.002*"amend" + 0.002*"koresh" + 0.002*"crimin"')]

And like we expected, the LDA model has given us near perfect results. Bank is the most influential word in both the topics, as we can see. The other words help define what kind of bank we are talking about. Let's now see where our new methods fit in.

get_term_topics

The function get_term_topics returns the odds of that particular word belonging to a particular topic. A few examples:


In [28]:
model.get_term_topics('hell')


Out[28]:
[(0, 0.0035053839), (1, 0.0011557308)]

Makes sense, the value for it belonging to topic_0 is a lot more.


In [30]:
model.get_term_topics('firearm')


Out[30]:
[(0, 0.002482554), (1, 0.0036967357)]

This also works out well, the word finance is more likely to be in topic_1 to do with financial banks.


In [35]:
model.get_term_topics('car')


Out[35]:
[(0, 0.000701838), (1, 0.0006635987)]

And this is particularly interesting. Since the word bank is likely to be in both the topics, the values returned are also very similar.

get_document_topics and Document Word-Topic Coloring

get_document_topics is an already existing gensim functionality which uses the inference function to get the sufficient statistics and figure out the topic distribution of the document.

The addition to this is the ability for us to now know the topic distribution for each word in the document. Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.

The get_document_topics method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when per_word_topics is set as true.


In [8]:
bow_water = ['bank','water','bank']
bow_finance = ['bank','finance','bank']

In [9]:
bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)

word_topics


Out[9]:
[(0, [0, 1]), (3, [0, 1])]

Now what does that output mean? It means that like word_type 1, our word_type 3, which is the word bank, is more likely to be in topic_0 than topic_1.

You must have noticed that while we unpacked into doc_topics and word_topics, there is another variable - phi_values. Like the name suggests, phi_values contains the phi values for each topic for that particular word, scaled by feature length. Phi is essentially the probability of that word in that document belonging to a particular topic. The next few lines should illustrate this.


In [10]:
phi_values


Out[10]:
[(0, [(0, 1.8300905), (1, 0.16990812)]),
 (3, [(0, 0.8581231), (1, 0.14187533)])]

This means that word_type 0 has the following phi_values for each of the topics. What is intresting to note is word_type 3 - because it has 2 occurences (i.e, the word bank appears twice in the bow), we can see that the scaling by feature length is very evident. The sum of the phi_values is 2, and not 1.

Now that we know exactly what get_document_topics does, let us now do the same with our second document, bow_finance.


In [11]:
bow = model.id2word.doc2bow(bow_finance) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)

word_topics


Out[11]:
[(0, [0, 1]), (10, [0, 1])]

And lo and behold, because the word bank is now used in the financial context, it immedietly swaps to being more likely associated with topic_1.

We've seen quite clearly that based on the context, the most likely topic associated with a word can change. This differs from our previous method, get_term_topics, where it is a 'static' topic distribution.

It must also be noted that because the gensim implementation of LDA uses Variational Bayes sampling, a word_type in a document is only given one topic distribution. For example, the sentence 'the bank by the river bank' is likely to be assigned to topic_0, and each of the bank word instances have the same distribution.

get_document_topics for entire corpus

You can get doc_topics, word_topics and phi_values for all the documents in the corpus in the following manner :


In [12]:
all_topics = model.get_document_topics(corpus, per_word_topics=True)

for doc_topics, word_topics, phi_values in all_topics:
    print('New Document \n')
    print('Document topics:', doc_topics)
    print('Word topics:', word_topics)
    print('Phi values:', phi_values)
    print(" ")
    print('-------------- \n')


New Document 

Document topics: [(0, 0.73633265), (1, 0.26366737)]
Word topics: [(0, [0, 1]), (1, [0, 1]), (2, [0, 1]), (3, [0, 1])]
Phi values: [(0, [(0, 0.8527051), (1, 0.14729421)]), (1, [(0, 0.795473), (1, 0.20452519)]), (2, [(0, 0.7709577), (1, 0.2290356)]), (3, [(0, 0.76475203), (1, 0.23524645)])]
 
-------------- 

New Document 

Document topics: [(0, 0.7539386), (1, 0.2460614)]
Word topics: [(1, [0, 1]), (3, [0, 1]), (4, [0, 1]), (5, [0, 1]), (6, [0, 1])]
Phi values: [(1, [(0, 0.80670816), (1, 0.19329017)]), (3, [(0, 0.77720207), (1, 0.22279654)]), (4, [(0, 0.90712374), (1, 0.092870995)]), (5, [(0, 0.7294175), (1, 0.27057865)]), (6, [(0, 0.805549), (1, 0.19444753)])]
 
-------------- 

New Document 

Document topics: [(0, 0.17039819), (1, 0.82960176)]
Word topics: [(0, [1, 0]), (3, [1, 0]), (5, [1, 0]), (7, [1, 0])]
Phi values: [(0, [(0, 0.15414648), (1, 0.8458525)]), (3, [(0, 0.09283445), (1, 0.90716416)]), (5, [(0, 0.073286586), (1, 0.92671037)]), (7, [(0, 0.031027067), (1, 0.96896887)])]
 
-------------- 

New Document 

Document topics: [(0, 0.8758601), (1, 0.124139935)]
Word topics: [(0, [0, 1]), (1, [0, 1]), (3, [0, 1]), (8, [0, 1])]
Phi values: [(0, [(0, 1.9142816), (1, 0.085716955)]), (1, [(0, 0.9375137), (1, 0.06248462)]), (3, [(0, 0.92614746), (1, 0.073851064)]), (8, [(0, 0.97765744), (1, 0.022337979)])]
 
-------------- 

New Document 

Document topics: [(0, 0.1712567), (1, 0.8287433)]
Word topics: [(1, [1, 0]), (3, [1, 0]), (6, [1, 0]), (9, [1, 0])]
Phi values: [(1, [(0, 0.11006477), (1, 0.8899334)]), (3, [(0, 0.09368856), (1, 0.9063102)]), (6, [(0, 0.109341085), (1, 0.8906553)]), (9, [(0, 0.04248068), (1, 0.9575148)])]
 
-------------- 

New Document 

Document topics: [(0, 0.16402602), (1, 0.8359739)]
Word topics: [(0, [1, 0]), (10, [1, 0]), (11, [1, 0]), (12, [1, 0])]
Phi values: [(0, [(0, 0.14435525), (1, 0.85564375)]), (10, [(0, 0.07960652), (1, 0.92039144)]), (11, [(0, 0.07176139), (1, 0.92823654)]), (12, [(0, 0.023786588), (1, 0.97620964)])]
 
-------------- 

New Document 

Document topics: [(0, 0.80524194), (1, 0.19475807)]
Word topics: [(0, [0, 1]), (11, [0, 1]), (13, [0, 1])]
Phi values: [(0, [(0, 0.9217699), (1, 0.07822937)]), (11, [(0, 0.84373295), (1, 0.1562643)]), (13, [(0, 0.95610404), (1, 0.043893255)])]
 
-------------- 

New Document 

Document topics: [(0, 0.65283513), (1, 0.3471648)]
Word topics: [(0, [0, 1]), (10, [0, 1])]
Phi values: [(0, [(0, 0.79454756), (1, 0.20545153)]), (10, [(0, 0.6647264), (1, 0.33527094)])]
 
-------------- 

New Document 

Document topics: [(0, 0.7917557), (1, 0.20824426)]
Word topics: [(0, [0, 1]), (10, [0, 1]), (11, [0, 1]), (14, [0, 1])]
Phi values: [(0, [(0, 0.9003985), (1, 0.099600814)]), (10, [(0, 0.8225217), (1, 0.17747577)]), (11, [(0, 0.80554056), (1, 0.19445676)]), (14, [(0, 0.9310509), (1, 0.0689471)])]
 
-------------- 

New Document 

Document topics: [(0, 0.80841804), (1, 0.19158193)]
Word topics: [(13, [0, 1]), (14, [0, 1])]
Phi values: [(13, [(0, 0.9664923), (1, 0.033504896)]), (14, [(0, 0.95886075), (1, 0.041137177)])]
 
-------------- 

New Document 

Document topics: [(0, 0.84000635), (1, 0.15999362)]
Word topics: [(0, [0, 1]), (14, [0, 1]), (15, [0, 1])]
Phi values: [(0, [(0, 0.94806826), (1, 0.051930968)]), (14, [(0, 0.9646261), (1, 0.035372045)]), (15, [(0, 0.9475713), (1, 0.052422963)])]
 
-------------- 

In case you want to store doc_topics, word_topics and phi_values for all the documents in the corpus in a variable and later access details of a particular document using its index, it can be done in the following manner:


In [13]:
topics = model.get_document_topics(corpus, per_word_topics=True)
all_topics = [(doc_topics, word_topics, word_phis) for doc_topics, word_topics, word_phis in topics]

Now, I can access details of a particular document, say Document #3, as follows:


In [14]:
doc_topic, word_topics, phi_values = all_topics[2]
print('Document topic:', doc_topics, "\n")
print('Word topic:', word_topics, "\n")
print('Phi value:', phi_values)


Document topic: [(0, 0.84000635), (1, 0.15999362)] 

Word topic: [(0, [1, 0]), (3, [1, 0]), (5, [1, 0]), (7, [1, 0])] 

Phi value: [(0, [(0, 0.1540126), (1, 0.8459863)]), (3, [(0, 0.09274801), (1, 0.90725076)]), (5, [(0, 0.07321687), (1, 0.9267802)]), (7, [(0, 0.030996205), (1, 0.96899974)])]

We can print details for all the documents (as shown above), in the following manner:


In [15]:
for doc in all_topics:
    print('New Document \n')
    print('Document topic:', doc[0])
    print('Word topic:', doc[1])
    print('Phi value:', doc[2])
    print(" ")
    print('-------------- \n')


New Document 

Document topic: [(0, 0.7339641), (1, 0.26603588)]
Word topic: [(0, [0, 1]), (1, [0, 1]), (2, [0, 1]), (3, [0, 1])]
Phi value: [(0, [(0, 0.850583), (1, 0.14941624)]), (1, [(0, 0.7927268), (1, 0.20727131)]), (2, [(0, 0.7679785), (1, 0.23201483)]), (3, [(0, 0.76171696), (1, 0.23828152)])]
 
-------------- 

New Document 

Document topic: [(0, 0.7540561), (1, 0.24594393)]
Word topic: [(1, [0, 1]), (3, [0, 1]), (4, [0, 1]), (5, [0, 1]), (6, [0, 1])]
Phi value: [(1, [(0, 0.8068402), (1, 0.19315805)]), (3, [(0, 0.77734876), (1, 0.2226498)]), (4, [(0, 0.9071951), (1, 0.09279961)]), (5, [(0, 0.7295848), (1, 0.27041143)]), (6, [(0, 0.80568177), (1, 0.1943148)])]
 
-------------- 

New Document 

Document topic: [(0, 0.17031126), (1, 0.8296888)]
Word topic: [(0, [1, 0]), (3, [1, 0]), (5, [1, 0]), (7, [1, 0])]
Phi value: [(0, [(0, 0.1540126), (1, 0.8459863)]), (3, [(0, 0.09274801), (1, 0.90725076)]), (5, [(0, 0.07321687), (1, 0.9267802)]), (7, [(0, 0.030996205), (1, 0.96899974)])]
 
-------------- 

New Document 

Document topic: [(0, 0.87583715), (1, 0.12416287)]
Word topic: [(0, [0, 1]), (1, [0, 1]), (3, [0, 1]), (8, [0, 1])]
Phi value: [(0, [(0, 1.9142504), (1, 0.08574835)]), (1, [(0, 0.9374913), (1, 0.062507026)]), (3, [(0, 0.9261214), (1, 0.07387724)]), (8, [(0, 0.97764915), (1, 0.022346335)])]
 
-------------- 

New Document 

Document topic: [(0, 0.17121859), (1, 0.8287814)]
Word topic: [(1, [1, 0]), (3, [1, 0]), (6, [1, 0]), (9, [1, 0])]
Phi value: [(1, [(0, 0.11002095), (1, 0.8899771)]), (3, [(0, 0.093650565), (1, 0.906348)]), (6, [(0, 0.109297544), (1, 0.8906989)]), (9, [(0, 0.04246249), (1, 0.95753306)])]
 
-------------- 

New Document 

Document topic: [(0, 0.16401243), (1, 0.83598757)]
Word topic: [(0, [1, 0]), (10, [1, 0]), (11, [1, 0]), (12, [1, 0])]
Phi value: [(0, [(0, 0.14433442), (1, 0.85566455)]), (10, [(0, 0.07959415), (1, 0.92040366)]), (11, [(0, 0.07175015), (1, 0.9282477)]), (12, [(0, 0.023782672), (1, 0.9762135)])]
 
-------------- 

New Document 

Document topic: [(0, 0.8052399), (1, 0.19476008)]
Word topic: [(0, [0, 1]), (11, [0, 1]), (13, [0, 1])]
Phi value: [(0, [(0, 0.9217683), (1, 0.07823093)]), (11, [(0, 0.84373015), (1, 0.15626718)]), (13, [(0, 0.95610315), (1, 0.043894168)])]
 
-------------- 

New Document 

Document topic: [(0, 0.6528616), (1, 0.34713835)]
Word topic: [(0, [0, 1]), (10, [0, 1])]
Phi value: [(0, [(0, 0.7945762), (1, 0.20542288)]), (10, [(0, 0.6647654), (1, 0.3352318)])]
 
-------------- 

New Document 

Document topic: [(0, 0.79167444), (1, 0.20832555)]
Word topic: [(0, [0, 1]), (10, [0, 1]), (11, [0, 1]), (14, [0, 1])]
Phi value: [(0, [(0, 0.9003313), (1, 0.09966788)]), (10, [(0, 0.82241255), (1, 0.17758493)]), (11, [(0, 0.80542344), (1, 0.19457388)]), (14, [(0, 0.931003), (1, 0.06899511)])]
 
-------------- 

New Document 

Document topic: [(0, 0.8083355), (1, 0.19166456)]
Word topic: [(13, [0, 1]), (14, [0, 1])]
Phi value: [(13, [(0, 0.9664568), (1, 0.033540305)]), (14, [(0, 0.9588177), (1, 0.04118031)])]
 
-------------- 

New Document 

Document topic: [(0, 0.8399969), (1, 0.1600031)]
Word topic: [(0, [0, 1]), (14, [0, 1]), (15, [0, 1])]
Phi value: [(0, [(0, 0.9480616), (1, 0.051937718)]), (14, [(0, 0.9646214), (1, 0.03537672)]), (15, [(0, 0.9475646), (1, 0.052429777)])]
 
-------------- 

Coloring topic-terms

These methods can come in handy when we want to color the words in a corpus or a document. If we wish to color the words in a corpus (i.e, color all the words in the dictionary of the corpus), then get_term_topics would be a better choice. If not, get_document_topics would do the trick.

We'll now attempt to color these words and plot it using matplotlib. This is just one way to go about plotting words - there are more and better ways.

WordCloud is such a python package which also does this.

For our simple illustration, let's keep topic_1 as red, and topic_0 as blue.


In [16]:
# this is a sample method to color words. Like mentioned before, there are many ways to do this.

def color_words(model, doc):
    import matplotlib.pyplot as plt
    import matplotlib.patches as patches
    
    # make into bag of words
    doc = model.id2word.doc2bow(doc)
    # get word_topics
    doc_topics, word_topics, phi_values = model.get_document_topics(doc, per_word_topics=True)

    # color-topic matching
    topic_colors = { 1:'red', 0:'blue'}
    
    # set up fig to plot
    fig = plt.figure()
    ax = fig.add_axes([0,0,1,1])

    # a sort of hack to make sure the words are well spaced out.
    word_pos = 1/len(doc)
    
    # use matplotlib to plot words
    for word, topics in word_topics:
        ax.text(word_pos, 0.8, model.id2word[word],
                horizontalalignment='center',
                verticalalignment='center',
                fontsize=20, color=topic_colors[topics[0]],  # choose just the most likely topic
                transform=ax.transAxes)
        word_pos += 0.2 # to move the word for the next iter

    ax.set_axis_off()
    plt.show()

Let us revisit our old examples to show some examples of document coloring


In [17]:
# our river bank document

bow_water = ['bank','water','bank']
color_words(model, bow_water)



In [18]:
bow_finance = ['bank','finance','bank']
color_words(model, bow_finance)


What is fun to note here is that while bank was colored blue in our first example, it is now red because of the financial context - something which the numbers proved to us before.


In [19]:
# sample doc with a somewhat even distribution of words among the likely topics

doc = ['bank', 'water', 'bank', 'finance', 'money','sell','river','fast','tree']
color_words(model, doc)


We see that the document word coloring is done just the way we expected. :)

Word-coloring a dictionary

We can do the same for the entire vocabulary, statically. The only difference would be in using get_term_topics, and iterating over the dictionary.

We will use a modified version of the coloring code when passing an entire dictionary.


In [20]:
def color_words_dict(model, dictionary):
    import matplotlib.pyplot as plt
    import matplotlib.patches as patches

    word_topics = []
    for word_id in dictionary:
        word = str(dictionary[word_id])
        # get_term_topics returns static topics, as mentioned before
        probs = model.get_term_topics(word)
        # we are creating word_topics which is similar to the one created by get_document_topics
        try:
            if probs[0][1] >= probs[1][1]:
                word_topics.append((word_id, [0, 1]))
            else:
                word_topics.append((word_id, [1, 0]))
        # this in the case only one topic is returned
        except IndexError:
            word_topics.append((word_id, [probs[0][0]]))
            
    # color-topic matching
    topic_colors = { 1:'red', 0:'blue'}
    
    # set up fig to plot
    fig = plt.figure()
    ax = fig.add_axes([0,0,1,1])

    # a sort of hack to make sure the words are well spaced out.
    word_pos = 1/len(doc)
         
    # use matplotlib to plot words
    for word, topics in word_topics:
        ax.text(word_pos, 0.8, model.id2word[word],
                horizontalalignment='center',
                verticalalignment='center',
                fontsize=20, color=topic_colors[topics[0]],  # choose just the most likely topic
                transform=ax.transAxes)
        word_pos += 0.2 # to move the word for the next iter

    ax.set_axis_off()
    plt.show()

In [21]:
color_words_dict(model, dictionary)


As we can see, the red words are to do with finance, and the blue ones are to do with water.

You can also notice that some words, like mud, shore and borrow seem to be incorrectly colored - however, they are correctly colored according to the LDA model used for coloring. A small corpus means that the LDA algorithm might not assign 'ideal' topic proportions to each word. Fine tuning the model and having a larger corpus would improve the model, and improve the results of the word coloring.