In [20]:
from gensim.corpora import Dictionary
from gensim.models import ldamodel
import numpy
%matplotlib inline
import logging
logging.basicConfig(level=logging.INFO)
We're setting up our corpus now. We want to show off the new get_term_topics
and get_document_topics
functionalities, and a good way to do so is to play around with words which might have different meanings in different context.
The word bank
is a good candidate here, where it can mean either the financial institution or a river bank.
In the toy corpus presented, there are 11 documents, 5 river
related and 6 finance
related.
In [21]:
import gensim.downloader
corpus = gensim.downloader.load("20-newsgroups")
In [22]:
import collections
from gensim.parsing.preprocessing import preprocess_string
texts = [
preprocess_string(text['data'])
for text in corpus
if text['topic'] in ('soc.religion.christian', 'talk.politics.guns')
]
In [23]:
dictionary = Dictionary(texts)
dictionary.filter_extremes(no_above=0.1, no_below=10)
corpus = [dictionary.doc2bow(text) for text in texts]
We set up the LDA model in the corpus. We set the number of topics to be 2, and expect to see one which is to do with river banks, and one to do with financial banks.
In [27]:
numpy.random.seed(1) # setting random seed to get the same results each time.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, alpha='asymmetric', minimum_probability=1e-8)
In [25]:
model.show_topics()
Out[25]:
And like we expected, the LDA model has given us near perfect results. Bank is the most influential word in both the topics, as we can see. The other words help define what kind of bank we are talking about. Let's now see where our new methods fit in.
The function get_term_topics
returns the odds of that particular word belonging to a particular topic.
A few examples:
In [28]:
model.get_term_topics('hell')
Out[28]:
Makes sense, the value for it belonging to topic_0
is a lot more.
In [30]:
model.get_term_topics('firearm')
Out[30]:
This also works out well, the word finance is more likely to be in topic_1 to do with financial banks.
In [35]:
model.get_term_topics('car')
Out[35]:
And this is particularly interesting. Since the word bank is likely to be in both the topics, the values returned are also very similar.
get_document_topics
is an already existing gensim functionality which uses the inference
function to get the sufficient statistics and figure out the topic distribution of the document.
The addition to this is the ability for us to now know the topic distribution for each word in the document. Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.
The get_document_topics
method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when per_word_topics
is set as true.
In [8]:
bow_water = ['bank','water','bank']
bow_finance = ['bank','finance','bank']
In [9]:
bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
word_topics
Out[9]:
Now what does that output mean? It means that like word_type 1
, our word_type
3
, which is the word bank
, is more likely to be in topic_0
than topic_1
.
You must have noticed that while we unpacked into doc_topics
and word_topics
, there is another variable - phi_values
. Like the name suggests, phi_values contains the phi values for each topic for that particular word, scaled by feature length. Phi is essentially the probability of that word in that document belonging to a particular topic. The next few lines should illustrate this.
In [10]:
phi_values
Out[10]:
This means that word_type
0 has the following phi_values for each of the topics.
What is intresting to note is word_type
3 - because it has 2 occurences (i.e, the word bank
appears twice in the bow), we can see that the scaling by feature length is very evident. The sum of the phi_values is 2, and not 1.
Now that we know exactly what get_document_topics
does, let us now do the same with our second document, bow_finance
.
In [11]:
bow = model.id2word.doc2bow(bow_finance) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
word_topics
Out[11]:
And lo and behold, because the word bank is now used in the financial context, it immedietly swaps to being more likely associated with topic_1
.
We've seen quite clearly that based on the context, the most likely topic associated with a word can change.
This differs from our previous method, get_term_topics
, where it is a 'static' topic distribution.
It must also be noted that because the gensim implementation of LDA uses Variational Bayes sampling, a word_type
in a document is only given one topic distribution. For example, the sentence 'the bank by the river bank' is likely to be assigned to topic_0
, and each of the bank word instances have the same distribution.
In [12]:
all_topics = model.get_document_topics(corpus, per_word_topics=True)
for doc_topics, word_topics, phi_values in all_topics:
print('New Document \n')
print('Document topics:', doc_topics)
print('Word topics:', word_topics)
print('Phi values:', phi_values)
print(" ")
print('-------------- \n')
In case you want to store doc_topics
, word_topics
and phi_values
for all the documents in the corpus in a variable and later access details of a particular document using its index, it can be done in the following manner:
In [13]:
topics = model.get_document_topics(corpus, per_word_topics=True)
all_topics = [(doc_topics, word_topics, word_phis) for doc_topics, word_topics, word_phis in topics]
Now, I can access details of a particular document, say Document #3, as follows:
In [14]:
doc_topic, word_topics, phi_values = all_topics[2]
print('Document topic:', doc_topics, "\n")
print('Word topic:', word_topics, "\n")
print('Phi value:', phi_values)
We can print details for all the documents (as shown above), in the following manner:
In [15]:
for doc in all_topics:
print('New Document \n')
print('Document topic:', doc[0])
print('Word topic:', doc[1])
print('Phi value:', doc[2])
print(" ")
print('-------------- \n')
These methods can come in handy when we want to color the words in a corpus or a document. If we wish to color the words in a corpus (i.e, color all the words in the dictionary of the corpus), then get_term_topics
would be a better choice. If not, get_document_topics
would do the trick.
We'll now attempt to color these words and plot it using matplotlib
.
This is just one way to go about plotting words - there are more and better ways.
WordCloud is such a python package which also does this.
For our simple illustration, let's keep topic_1
as red, and topic_0
as blue.
In [16]:
# this is a sample method to color words. Like mentioned before, there are many ways to do this.
def color_words(model, doc):
import matplotlib.pyplot as plt
import matplotlib.patches as patches
# make into bag of words
doc = model.id2word.doc2bow(doc)
# get word_topics
doc_topics, word_topics, phi_values = model.get_document_topics(doc, per_word_topics=True)
# color-topic matching
topic_colors = { 1:'red', 0:'blue'}
# set up fig to plot
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
# a sort of hack to make sure the words are well spaced out.
word_pos = 1/len(doc)
# use matplotlib to plot words
for word, topics in word_topics:
ax.text(word_pos, 0.8, model.id2word[word],
horizontalalignment='center',
verticalalignment='center',
fontsize=20, color=topic_colors[topics[0]], # choose just the most likely topic
transform=ax.transAxes)
word_pos += 0.2 # to move the word for the next iter
ax.set_axis_off()
plt.show()
Let us revisit our old examples to show some examples of document coloring
In [17]:
# our river bank document
bow_water = ['bank','water','bank']
color_words(model, bow_water)
In [18]:
bow_finance = ['bank','finance','bank']
color_words(model, bow_finance)
What is fun to note here is that while bank was colored blue in our first example, it is now red because of the financial context - something which the numbers proved to us before.
In [19]:
# sample doc with a somewhat even distribution of words among the likely topics
doc = ['bank', 'water', 'bank', 'finance', 'money','sell','river','fast','tree']
color_words(model, doc)
We see that the document word coloring is done just the way we expected. :)
We can do the same for the entire vocabulary, statically. The only difference would be in using get_term_topics
, and iterating over the dictionary.
We will use a modified version of the coloring code when passing an entire dictionary.
In [20]:
def color_words_dict(model, dictionary):
import matplotlib.pyplot as plt
import matplotlib.patches as patches
word_topics = []
for word_id in dictionary:
word = str(dictionary[word_id])
# get_term_topics returns static topics, as mentioned before
probs = model.get_term_topics(word)
# we are creating word_topics which is similar to the one created by get_document_topics
try:
if probs[0][1] >= probs[1][1]:
word_topics.append((word_id, [0, 1]))
else:
word_topics.append((word_id, [1, 0]))
# this in the case only one topic is returned
except IndexError:
word_topics.append((word_id, [probs[0][0]]))
# color-topic matching
topic_colors = { 1:'red', 0:'blue'}
# set up fig to plot
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
# a sort of hack to make sure the words are well spaced out.
word_pos = 1/len(doc)
# use matplotlib to plot words
for word, topics in word_topics:
ax.text(word_pos, 0.8, model.id2word[word],
horizontalalignment='center',
verticalalignment='center',
fontsize=20, color=topic_colors[topics[0]], # choose just the most likely topic
transform=ax.transAxes)
word_pos += 0.2 # to move the word for the next iter
ax.set_axis_off()
plt.show()
In [21]:
color_words_dict(model, dictionary)
As we can see, the red words are to do with finance, and the blue ones are to do with water.
You can also notice that some words, like mud, shore and borrow seem to be incorrectly colored - however, they are correctly colored according to the LDA model used for coloring. A small corpus means that the LDA algorithm might not assign 'ideal' topic proportions to each word. Fine tuning the model and having a larger corpus would improve the model, and improve the results of the word coloring.