Generally speaking, a topic model is a probability distribution of words over a document. But what does this mean? In topic modeling, we are starting from the assumption that there exist a number of topics $N$ -- a topic is a random variable which generates words from a vocabulary with a certain probability. A document is assumed to be produced from a mixture of samples from one or many topics. A document would be generated as follow:
Given $N$ topics and $W = { w_n }$ weights for each topic.
for i in lengthOfDocument
n = selectTopicFrom(W)
doc[i] = selectAWord(topic[n])
Notice that a document here is expressed as a bag-of-words, and therefore the ordering of words is not taken into account.
In [1]:
%pylab inline
In [2]:
import scipy.stats as ss
In [67]:
n = 10
w = random.random(size=n)
w = w / sum(w)
topics = []
for i in range(n):
mu = random.uniform(-5,5)
t = ss.norm(mu,1)
topics.append(t)
So, what we observe from a document, is (more or less) a distribution over words:
In [85]:
x = linspace(-10,10,1000)
y = np.zeros(x.shape)
for ti,wi in zip(topics,w):
y += wi * ti.pdf(x)
plot(x,y);
But in reality, the document is generated by a more complicated process which we cannot observe directly (a mixture of distributions):
In [84]:
for ti,wi in zip(topics,w):
yi = wi * ti.pdf(x)
plot(x,yi)
Traditionally, data preprocessing needs to take place before applying topic modeling. In these examples, the preprocessing transforms a sentence from this format:
Hoe kan je, binnen de context van mondiale veranderingsprocessen, inclusieve ontwikkeling bevorderen? Is het opportuun om te blijven focussen op Nederland, in deze tijden van massale migratie, wereldwijde instabiliteit en economische onzekerheid? 2015 is uitgeroepen tot 'het Europees Jaar voor Ontwikkeling'. Hoe kan Nederland bijdragen aan inclusieve mondiale ontwikkeling?
to a bag-of-words like this:
[u'context', u'mondiale', u'veranderingsprocessen', u'inclusieve', u'ontwikkeling', u'bevorderen', u'opportuun', u'focussen', u'nederland', u'tijden', u'massale', u'migratie', u'wereldwijde', u'instabiliteit', u'economische', u'onzekerheid', u'uitgeroepen', u'europees', u'jaar', u'ontwikkeling', u'nederland', u'bijdragen', u'inclusieve', u'mondiale', u'ontwikkeling']
In this case the preprocessing involves the following:
The problem is that preprocessing implies/requires a-priori knowledge of what we are looking for.
In the example above, we assumed that there are 10 topics present in the corpus. But there is no guarantee that exactly 10 topics exist in the data.
There is a paper which addresses this issue and attempts to determine automatically the number of topics present in a given data set.
Determining whether a topic is interesting or useful seems like a subjective question. It would be interesting to idenfify an objective measure of topic quality.
In the surveyed literature, it is not exactly clear how to visualize topics obtained from topic modeling. One possibility is to come up with some way to measure the distance between two topics. What is the correct way of visualizing is still an open question.
Sometimes it may be useful to work on different splits of a data set. In such cases, a different topic model needs to be trained for different splits of the data. It would be interesting to investigate whether a topic model trained on one split can be used as a starting point for another split.