Imagine you have a gigantic corpus which spans over a couple of years. You want to find semantically similar documents; one from the very beginning of your time-line, and one in the very end. How would you? This is where Dynamic Topic Models comes in. By having a time-based element to topics, context is preserved while ley-words may change.
Dynamic Topic Models are used to model the evolution of topics in a corpus, over time. The Dynamic Topic Model is part of a class of probabilistic topic models, like the LDA.
While most traditional topic mining algorithms do not expect time-tagged data or take into account any prior ordering, Dynamic Topic Models (DTM) leverages the knowledge of different documents belonging to a different time-slice in an attempt to map how the words in a topic change over time.
David Blei does a good job explaining the theory behind this in this Google talk. If you prefer to directly read the paper on DTM by Blei and Lafferty, that should get you upto speed too.
But - why even undertake this, especially when Gensim itself have a wrapper? The main motivation was the lack of documentation in the original code - and the fact that doing an only python version makes it easier to use gensim building blocks. For example, for setting up the Sufficient Statistics to initialize the DTM, you can just pass a pre-trained gensim LDA model!
There is some clarity on how they built their code now - Variational Inference using Kalman Filters. I've tried to make things as clear as possible in the code, but it still needs some polishing.
Any help through PRs would be greatly appreciated!
I have been regularly blogging about my progress with implementing this, which you can find here.
If you would have seen the video or read the paper, it's use case would be pretty clear - and the example of modelling it on Science research papers gives us some pretty interesting results. It was used to not only catch how various themes of research such as Physics or Neuroscience evolved over the decades but also in identifying similar documents in a way not many other modelling algorithms can. While words may change over time, the fact that DTM can identify topics over time can help us find semantically similar documents over a long time-period.
This blog post is also useful in breaking down the ideas and theory behind DTM.
Gensim already has a wrapper for original C++ DTM code, but the LdaSeqModel
class is an effort to have a pure python implementation of the same.
Using it is very similar to using any other gensim topic-modelling algorithm, with all you need to start is an iterable gensim corpus, id2word and a list with the number of documents in each of your time-slices.
In [1]:
# setting up our imports
from gensim.models import ldaseqmodel
from gensim.corpora import Dictionary, bleicorpus
import numpy
from gensim.matutils import hellinger
We will be loading the corpus and dictionary from disk. Here our corpus in the Blei corpus format, but it can be any iterable corpus. The data set here consists of news reports over 3 months downloaded from here and cleaned.
TODO: better, more interesting data-set.
A very important input for DTM to work is the time_slice
input. It should be a list which contains the number of documents in each time-slice. In our case, the first month had 438 articles, the second 430 and the last month had 456 articles. This means we'd need an input which looks like this: time_slice = [438, 430, 456]
.
Once you have your corpus, id2word and time_slice ready, we're good to go!
In [2]:
# loading our corpus and dictionary
dictionary = Dictionary.load('Corpus/news_dictionary')
corpus = bleicorpus.BleiCorpus('Corpus/news_corpus')
# it's very important that your corpus is saved in order of your time-slices!
time_slice = [438, 430, 456]
For DTM to work it first needs the Sufficient Statistics from a trained LDA model on the same dataset. By default LdaSeqModel trains it's own model and passes those values on, but can also accept a pre-trained gensim LDA model, or a numpy matrix which contains the Suff Stats.
We will be training our model in default mode, so LDA will be first performed on the dataset. The passes
parameter is to instruct LdaModel on the number of passes.
In [10]:
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5, passes=20)
Now that our model is trained, let's see what our results look like.
Much like LDA, the points of interest would be in what the topics are and how the documents are made up of these topics. In DTM we have the added interest of seeing how these topics evolve over time.
Let's go through some of the functions to print Topics and analyse documents.
In [11]:
# to print all topics, use `print_topics`.
# the input parameter to `print_topics` is only a time-slice option. By passing `0` we are seeing the topics in the 1st time-slice.
ldaseq.print_topics(time=0)
Out[11]:
In [22]:
# to fix a topic and see it evolve, use `print_topic_times`
ldaseq.print_topic_times(topic=1) # evolution of 1st topic
Out[22]:
If you look at the lower frequencies; the word broadband is creeping itself up into prominence in topic number 1. We've had our fun looking at topics, now let us see how to analyse documents.
the function doc_topics
checks the topic proportions on documents already trained on. It accepts the document number in the corpus as an input.
Let's pick up document number 558 arbitrarily and have a look.
In [37]:
# to check Document - Topic proportions, use `doc-topics`
words = [dictionary[word_id] for word_id, count in ldaseq.corpus.corpus[558]]
print (words)
It's pretty clear that it's a news article about football. What topics will it likely be comprised of?
In [49]:
doc_1 = ldaseq.doc_topics(558) # check the 244th document in the corpuses topic distribution
print (doc_1)
It's largely made of topics 3 and 5 - and if we go back and inspect our topics, it's quite a good match.
If we wish to analyse a document not in our training set, we can use simply pass the doc to the model similar to the __getitem__
funciton for LdaModel
.
Let's let our document be a hypothetical news article about the effects of Ryan Giggs buying mobiles affecting the British economy.
In [53]:
doc_2 = ['economy', 'bank', 'mobile', 'phone', 'markets', 'buy', 'football', 'united', 'giggs']
doc_2 = dictionary.doc2bow(doc_2)
doc_2 = ldaseq[doc_2]
print (doc_2)
Pretty neat! Topics 2 and 3 are about technology, the market and football, so this works well for us.
One of the more handy uses of DTMs topic modelling is that we can compare documents across different time-frames and see how similar they are topic-wise. When words may not necessarily overlap over these time-periods, this is very useful.
The current dataset doesn't provide us the diversity for this to be an effective example; but we will nevertheless illustrate how to do the same.
In [54]:
hellinger(doc_1, doc_2)
Out[54]:
The topic distributions are quite similar, so we get a high value. For more information on how to use the gensim distance metrics, check out this notebook.
The code currently runs between 5 to 7 times slower than the original C++ DTM code. The bottleneck is in the scipy optimize.fmin_cg
method for updating obs. Speeding this up would fix things up!
Since it uses iterable gensim corpuses, the memory stamp is also cleaner. The corpus size doesn't matter.
The advantages of the python port are that unlike the C++ code we needn't treat it like a black-box; PRs to help make the code better are welcomed, as well as help to make the documentation clearer and improve performance. It is also in pure python and doesn't need any dependancy outside of what gensim already needs. The added functionality of being able to analyse new documents is also a plus!
Let's now compare these results with the DTM wrapper.
In [56]:
from gensim.models.wrappers.dtmmodel import DtmModel
dtm_path = "/Users/bhargavvader/Downloads/dtm_release/dtm/main"
dtm_model = DtmModel(dtm_path, corpus, time_slice, num_topics=5, id2word=dictionary, initialize_lda=True)
dtm_model.save('dtm_news')
ldaseq.save('ldaseq_news')
# if we've saved before simply load the model
dtm_model = DtmModel.load('dtm_news')
In [58]:
# setting up the DTM wrapper for
from gensim import matutils
num_topics = 5
topic_term = dtm_model.lambda_[:,:,0] # the lambda matrix contains
def validate(topic_term):
topic_term = numpy.exp(topic_term)
topic_term = topic_term / topic_term.sum()
topic_term = topic_term * num_topics
return topic_term
def get_topics(topic_terms, topic_number):
topic_terms = topic_terms[topic_number]
bestn = matutils.argsort(topic_terms, 20, reverse=True)
beststr = [dictionary[id_] for id_ in bestn]
return beststr
topic_term = validate(topic_term)
# next is doc_topic_dist
doc_topic = dtm_model.gamma_
# next is the vocabulary, which we already have
vocab = []
for i in range(0, len(dictionary)):
vocab.append(dictionary[i])
# we now need term-frequency and doc_lengths
def term_frequency(corpus, dictionary):
term_frequency = [0] * len(dictionary)
doc_lengths = []
for doc in corpus:
doc_lengths.append(len(doc))
for pair in doc:
term_frequency[pair[0]] += pair[1]
return term_frequency, doc_lengths
topics_wrapper = []
for i in range(0, num_topics):
topics_wrapper.append(get_topics(topic_term, i))
term_frequency, doc_lengths = term_frequency(corpus, dictionary)
In [59]:
import pyLDAvis
vis_wrapper = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)
pyLDAvis.display(vis_wrapper)
Out[59]:
In [60]:
# now let us visualize the DTM python port.
# getting a list of just words for each topics
dtm_tp = ldaseq.print_topics()
dtm_topics = []
for topic in dtm_tp:
topics = []
for prob, word in topic:
topics.append(word)
dtm_topics.append(topics)
# getting dtm python doc-topic proportions
doc_topic = numpy.copy(ldaseq.gammas)
doc_topic /= doc_topic.sum(axis=1)[:, numpy.newaxis]
# getting dtm topic_word proportions for first time_slice
def get_topic_term(ldaseq, topic, time=0):
topic = numpy.transpose(ldaseq.topic_chains[topic].e_log_prob)
topic = topic[time]
topic = numpy.exp(topic)
topic = topic / topic.sum()
return topic
# get_topic_term(ldaseq, 0).shape
topic_term =numpy.array(numpy.split(numpy.concatenate((get_topic_term(ldaseq, 0), get_topic_term(ldaseq, 1), get_topic_term(ldaseq, 2), get_topic_term(ldaseq, 3), get_topic_term(ldaseq, 4))), 5))
vis_dtm = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)
pyLDAvis.display(vis_dtm)
Out[60]:
In [61]:
from gensim.models.coherencemodel import CoherenceModel
import pickle
cm_wrapper = CoherenceModel(topics=topics_wrapper, corpus=corpus, dictionary=dictionary, coherence='u_mass')
cm_DTM = CoherenceModel(topics=dtm_topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')
print (cm_wrapper.get_coherence())
print (cm_DTM.get_coherence())
# to use 'c_v' we need texts, which we have saved to disk.
texts = pickle.load(open('Corpus/texts', 'rb'))
cm_wrapper = CoherenceModel(topics=topics_wrapper, texts=texts, dictionary=dictionary, coherence='c_v')
cm_DTM = CoherenceModel(topics=dtm_topics, texts=texts, dictionary=dictionary, coherence='c_v')
print (cm_wrapper.get_coherence())
print (cm_DTM.get_coherence())
So while u_mass
coherence prefers the wrapper topics, c_v
seems to favor our python port better. :)
So while there is already a python wrapper of DTM, a pure python implementation will be useful to better understand what goes on undcer the hood and better the code. When it comes to performance, the C++ is undoubtably faster, but we can continue to work on ours to make it as fast. As for evaluating the results, our topics are on par if not better than the wrapper!