Recently, complex probabilistic models are increasingly prevalent in various of domains. However, there are several challenges that should be dealt with due to their open-ended nature. That is, the data sets often grow over time, as they growing, they bring new entities and new structures to the fore. Take the problem of learning a topic hierarchy from data for example. Given a collection of documents, each of which contains a set of words and the goal is to discover common usage patterns or topics in the documents, and to organize these topics into a hierarchy.
This paper proposes a new method that specified a generative probabilistic model for hierarchical structures and adopt Bayesian perspective to learn such structures from data. The hierarchies in this case could be considered as random variables and specified procedurally. In addition, the underlying approach of constructing the probabilistic object is Chinese restaurant process (CRP), a distribution on partitions of integers. In this paper, they extend CRP to a hierarchy of partitions, known as nested Chinese restaruant process (nCRP), and apply it as a representation of prior and posterior distributions for topic hierarchies. To be more specific, each node in the hierarchy is associated with a topic, where a topic is a distribution across words. A document is generated by choosing a path from the root to a leaf, repeatedly sampling topics along that path, and sampling the words from the selected topics. Thus the orga- nization of topics into a hierarchy aims to capture the breadth of usage of topics across the corpus, reflecting underlying syntactic and semantic notions of generality and specificity.
CRP is an analogous to seating customers at tables in a Chinese restaurant. Imagine there is a Chinese restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 sits at the first table. The next customer either sits at the same table as customer 1, or the next table. The $m$th subsequent customer sits at a table drawn from the following distribution:
\begin{align*} p(\text{occupied table}\hspace{0.5ex}i\hspace{0.5ex}\text{ | previous customers}) = \frac{m_i}{\gamma+m-1}\\ p(\text{next unoccupied table | previous customers}) = \frac{\gamma}{\gamma + m -1} \end{align*}where $m_i$ is the number of previous customers at table $i$, and $\gamma$ is a parameter. After $M$ customers sit down, the seating plan gives a partition of $M$ items. This distribution gives the same partition structure as draws from a Dirichlet process.
A nested Chinese restaurant process (nCRP) is an extended version of CRP. Suppose that there are an infinite number of infinite-table Chinese restaurants in a city. A restaurant is determined to be the root restaurant and on each of its infinite tables is a card with the name of another restaurant. On each of the tables in those restaurants are cards that refer to other restaurants, and this structure repeats infinitely. Each restaurant is referred to exactly once. As a result, the whole process could be imagined as an infinitely-branched tree.
Now, consider a tourist arrives in the city for a culinary vacation. On the first first day, he select a root Chinese restaurant and selects a table from the equation above. On the second day, he enters to the restaurant refered by previous restaurant , again from the above equation. This process was repeated for $L$ days, and at the end, the tourist has sat at L restaurants which constitute a path from the root to a restaurant at the $L$th level in the infinite tree. After M tourists take L-day vacations, the collection of paths describe a particular L-level subtree of the infinite tree.
The hierarchical latent Dirichlet allocation model (hLDA) together with nested Chinese restaruant process (nCRP) illustrate the pattern of words from the collection of documents. There are 3 procedures in hLDA: (1) Draw a path from root-node to a leaf; (2) Select a specific path, draw a vector of topic along the path; (3) Draw the words from the topic. In addition, all documents share the topic associated with the root restaurant.
Gibbs sampling will sample from the posterior nCRP and corresponding topics in the hLDA model. The sampler are divided into 2 parts -- $z_{m,n}$ and $ c_{m,\ell}$. In addition, variables $\theta$ and $\beta$ are integrated out.
In [7]:
import numpy as np
from scipy.special import gammaln
import random
from collections import Counter
import string
import graphviz
import pygraphviz
import pydot
In [8]:
def CRP(topic, phi):
'''
CRP gives the probability of topic assignment for specific vocabulary
Return a 1 * j array, where j is the number of topic
Parameter
---------
topic: a list of lists, contains assigned words in each sublist (topic)
phi: double, parameter for CRP
Return
------
p_crp: the probability of topic assignments for new word
'''
p_crp = np.empty(len(topic)+1)
m = sum([len(x) for x in topic])
p_crp[0] = phi / (phi + m)
for i, word in enumerate(topic):
p_crp[i+1] = len(word) / (phi + m)
return p_crp
In [9]:
def node_sampling(corpus_s, phi):
'''
Node sampling samples the number of topics, L
return a j-layer list of lists, where j is the number of topics
Parameter
---------
corpus_s: a list of lists, contains words in each sublist (document)
phi: double, parameter for CRP
Return
------
topic: a list of lists, contains assigned words in each sublist (topic)
'''
topic = []
for corpus in corpus_s:
for word in corpus:
cm = CRP(topic, phi)
theta = np.random.multinomial(1, (cm/sum(cm))).argmax()
if theta == 0:
topic.append([word])
else:
topic[theta-1].append(word)
return topic
In [4]:
def Z(corpus_s, topic, alpha, beta):
'''
Z samples from LDA model
Return two j-layer list of lists, where j is the number of topics
Parameter
---------
corpus_s: a list of lists, contains words in each sublist (document)
topic: a L-dimensional list of lists, sample from node_sampling
alpha: double, parameter
beta: double, parameter
Return
------
z_topic: a j-dimensional list of lists, drawn from L-dimensioanl topic, j<L
z_doc: a j-dimensioanl list of lists, report from which document the word is assigned to each topic
'''
n_vocab = sum([len(x) for x in corpus_s])
t_zm = np.zeros(n_vocab).astype('int')
z_topic = [[] for _ in topic]
z_doc = [[] for _ in topic]
z_tmp = np.zeros((n_vocab, len(topic)))
assigned = np.zeros((len(corpus_s), len(topic)))
n = 0
for i in range(len(corpus_s)):
for d in range(len(corpus_s[i])):
wi = corpus_s[i][d]
for j in range(len(topic)):
lik = (z_topic[j].count(wi) + beta) / (assigned[i, j] + n_vocab * beta)
pri = (len(z_topic[j]) + alpha) / ((len(corpus_s[i]) - 1) + len(topic) * alpha)
z_tmp[n, j] = lik * pri
t_zm[n] = np.random.multinomial(1, (z_tmp[n,:]/sum(z_tmp[n,:]))).argmax()
z_topic[t_zm[n]].append(wi)
z_doc[t_zm[n]].append(i)
assigned[i, t_zm[n]] += 1
n += 1
z_topic = [x for x in z_topic if x != []]
z_doc = [x for x in z_doc if x != []]
return z_topic, z_doc
In [5]:
def CRP_prior(corpus_s, doc, phi):
'''
CRP_prior implies by nCRP
Return a m*j array, whre m is the number of documents and j is the number of topics
Parameter
---------
corpus_s: a list of lists, contains words in each sublist (document)
doc: a j-dimensioanl list of lists, drawn from Z function (z_doc)
phi: double, parameter for CRP
Return
------
c_p: a m*j array, for each document the probability of the topics
'''
c_p = np.empty((len(corpus_s), len(doc)))
for i, corpus in enumerate(corpus_s):
p_topic = [[x for x in doc[j] if x != i] for j in range(len(doc))]
tmp = CRP(p_topic, phi)
c_p[i,:] = tmp[1:]
return c_p
In [6]:
def likelihood(corpus_s, topic, eta):
'''
likelihood gives the propability of data given a particular choice of c
Return a m*j array, whre m is the number of documents and j is the number of topics
Parameter
---------
corpus_s: a list of lists, contains words in each sublist (document)
topic: a j-dimensional list of lists, drawn from Z function (z_assigned)
eta: double, parameter
Return
------
w_m: a m*j array
'''
w_m = np.empty((len(corpus_s), len(topic)))
allword_topic = [word for t in topic for word in t]
n_vocab = sum([len(x) for x in corpus_s])
for i, corpus in enumerate(corpus_s):
prob_result = []
for j in range(len(topic)):
current_topic = topic[j]
n_word_topic = len(current_topic)
prev_dominator = 1
later_numerator = 1
prob_word = 1
overlap = [val for val in set(corpus) if val in current_topic]
prev_numerator = gammaln(len(current_topic) - len(overlap) + n_vocab * eta)
later_dominator = gammaln(len(current_topic) + n_vocab * eta)
for word in corpus:
corpus_list = corpus
if current_topic.count(word) - corpus_list.count(word) < 0 :
a = 0
else:
a = current_topic.count(word) - corpus_list.count(word)
prev_dominator += gammaln(a + eta)
later_numerator += gammaln(current_topic.count(word) + eta)
prev = prev_numerator - prev_dominator
later = later_numerator - later_dominator
like = prev + later
w_m[i, j] = like
w_m[i, :] = w_m[i, :] + abs(min(w_m[i, :]) + 0.1)
w_m = w_m/w_m.sum(axis = 1)[:, np.newaxis]
return w_m
In [7]:
def post(w_m, c_p):
'''
Parameter
---------
w_m: likelihood, drawn from likelihood function
c_p: prior, drawn from CRP_prior function
Return
------
c_m, a m*j list of lists
'''
c_m = (w_m * c_p) / (w_m * c_p).sum(axis = 1)[:, np.newaxis]
return np.array(c_m)
In [8]:
def wn(c_m, corpus_s):
'''
wn return the assignment of words for topics, drawn from multinomial distribution
Return a n*1 array, where n is the total number of word
Parameter
---------
c_m: a m*j list of lists, drawn from post function
corpus_s: a list of lists, contains words in each sublist (document)
Return
------
wn_ass: a n*1 array, report the topic assignment for each word
'''
wn_ass = []
for i, corpus in enumerate(corpus_s):
for word in corpus:
theta = np.random.multinomial(1, c_m[i]).argmax()
wn_ass.append(theta)
return np.array(wn_ass)
In [9]:
most_common = lambda x: Counter(x).most_common(1)[0][0]
In [10]:
def gibbs(corpus_s, topic, alpha, beta, phi, eta, ite):
'''
gibbs will return the distribution of words for topics
Return a j-dimensional list of lists, where j is the number of topics
Parameter
---------
corpus_s: a list of lists, contains words in each sublist (document)
topic: a j-dimensional list of lists, drawn from Z function (z_assigned)
alpha: double, parameter for Z function
beta: double, parameter for Z function
phi: double, parameter fro CRP_prior function
eta: double, parameter for w_n function
ite: int, number of iteration
Return
------
wn_topic: a j-dimensional list of lists, the distribution of words for topics
'''
n_vocab = sum([len(x) for x in corpus_s])
gibbs = np.empty((n_vocab, ite)).astype('int')
for i in range(ite):
z_topic, z_doc = Z(corpus_s, topic, alpha, beta)
c_p = CRP_prior(corpus_s, z_doc, phi)
w_m = likelihood(corpus_s, z_topic, eta)
c_m = post(w_m, c_p)
gibbs[:, i] = wn(c_m, corpus_s)
# drop first 1/10 data
gibbs = gibbs[:, int(ite/10):]
theta = [most_common(gibbs[x]) for x in range(n_vocab)]
n_topic = max(theta)+1
wn_topic = [[] for _ in range(n_topic)]
wn_doc_topic = [[] for _ in range(n_topic)]
doc = 0
n = 0
for i, corpus_s in enumerate(corpus_s):
if doc == i:
for word in corpus_s:
wn_doc_topic[theta[n]].append(word)
n += 1
for j in range(n_topic):
if wn_doc_topic[j] != []:
wn_topic[j].append(wn_doc_topic[j])
wn_doc_topic = [[] for _ in range(n_topic)]
doc += 1
wn_topic = [x for x in wn_topic if x != []]
return wn_topic
Gibbs sampling in section IV
distributes the input vocabularies from documents in corpus to available topics, which sampled from $L$-dimensional topics. In section V
, an $n$-level tree will be presented by tree plot, which the root-node will be more general and the leaves will be more specific. In addition, tree plot will return the words sorted by their frequencies for each node.
In [11]:
def hLDA(corpus_s, alpha, beta, phi, eta, ite, level):
'''
hLDA generates an n*1 list of lists, where n is the number of level
Parameter
---------
corpus_s: a list of lists, contains words in each sublist (document)
alpha: double, parameter for Z function
beta: double, parameter for Z function
phi: double, parameter fro CRP_prior function
eta: double, parameter for w_n function
ite: int, number of iteration
level: int, number of level
Return
hLDA_tree: an n*1 list of lists, each sublist represents a level, the sublist in each level represents a topic
node: an n*1 list of lists, returns how many nodes there are in each level
'''
topic = node_sampling(corpus_s, phi)
print(len(topic))
hLDA_tree = [[] for _ in range(level)]
tmp_tree = []
node = [[] for _ in range(level+1)]
node[0].append(1)
for i in range(level):
if i == 0:
wn_topic = gibbs(corpus_s, topic, alpha, beta, phi, eta, ite)
node_topic = [x for word in wn_topic[0] for x in word]
hLDA_tree[0].append(node_topic)
tmp_tree.append(wn_topic[1:])
tmp_tree = tmp_tree[0]
node[1].append(len(wn_topic[1:]))
else:
for j in range(sum(node[i])):
if tmp_tree == []:
break
wn_topic = gibbs(tmp_tree[0], topic, alpha, beta, phi, eta, ite)
node_topic = [x for word in wn_topic[0] for x in word]
hLDA_tree[i].append(node_topic)
tmp_tree.remove(tmp_tree[0])
if wn_topic[1:] != []:
tmp_tree.extend(wn_topic[1:])
node[i+1].append(len(wn_topic[1:]))
return hLDA_tree, node[:level]
In [12]:
def HLDA_plot(hLDA_object, Len = 8, save = False):
from IPython.display import Image, display
def viewPydot(pdot):
plt = Image(pdot.create_png())
display(plt)
words = hLDA_object[0]
struc = hLDA_object[1]
graph = pydot.Dot(graph_type='graph')
end_index = [np.insert(np.cumsum(i),0,0) for i in struc]
for level in range(len(struc)-1):
leaf_level = level + 1
leaf_word = words[leaf_level]
leaf_struc = struc[leaf_level]
word = words[level]
end_leaf_index = end_index[leaf_level]
for len_root in range(len(word)):
root_word = '\n'.join([x[0] for x in Counter(word[len_root]).most_common(Len)])
leaf_index = leaf_struc[len_root]
start = end_leaf_index[len_root]
end = end_leaf_index[len_root+1]
lf = leaf_word[start:end]
for l in lf:
leaf_w = '\n'.join([x[0] for x in Counter(list(l)).most_common(Len)])
edge = pydot.Edge(root_word, leaf_w)
graph.add_edge(edge)
if save == True:
graph.write_png('graph.png')
viewPydot(graph)
For simulated data example, each document, $d$, in corpus is generated by normal distribution with different size of words, $w_{d,n}$, where $n\in\{10,...,200\}$ and ${\bf w}_{d}\sim N(0, 1)$. In this example, by generating 35 documents in the corpus, we are able to see the simulated tree with the number near mean, $0$, such as {w0, w1, w-1}
in the root node and the number far from mean such as {w10, w-10, w15}
in the leaves.
In [13]:
def sim_corpus(n):
n_rows = n
corpus = [[] for _ in range(n_rows)]
for i in range(n_rows):
n_cols = np.random.randint(10, 200, 1, dtype = 'int')[0]
for j in range(n_cols):
num = np.random.normal(0, 1, n_cols)
word = 'w%s' % int(round(num[j], 1)*10)
corpus[i].append(word)
return corpus
In [14]:
corpus_0 = sim_corpus(35)
In [15]:
tree_0 = hLDA(corpus_0, 0.1, 0.01, 2, 0.01, 100, 3)
In [16]:
HLDA_plot(tree_0, 5, False)
For real data example, the corpus of documents is generated from Blei's sample data. The documents are splitted by paragraph; that is, each paragraph reprents one document. We take first 11 documents to form the sample corpus used in the hLDA model. To form the corpus, we read the corpus as a large list of lists. The sublists in the nested list represent the documents; the elements in each sublist represent the words in specific document. Note that the punctuations are removed from the corpus.
In [1]:
def read_corpus(corpus_path):
punc = ['`', ',', "'", '.', '!', '?']
corpus = []
with open(corpus_path, 'r') as f:
for line in f:
for x in punc:
line = line.replace(x, '')
line = line.strip('\n')
word = line.split(' ')
corpus.append(word)
return(corpus)
In [2]:
corpus_1 = read_corpus('sample.txt')
In [19]:
tree_1 = hLDA(corpus_1, 0.1, 0.01, 1, 0.01, 100, 3)
In [20]:
HLDA_plot(tree_1, 5, False)
The hLDA code of the paper Hierarchical Topic Models and the Nested Chinese Restaurant Process is released on github with the package named hLDA (click to clone). One can easily download (click to download) and install by running python setup.py install
. The package provides 4 functions:
hLDA.sim_corpus(n)
: return a simulated corpus with $n$ number of documentsn
: int
, number of documents in the corpushLDA.read_corpus(corpus_path)
: return a list of lists of corpus with length $n$, where $n$ is the number of documents.hLDA.hLDA(corpus, alpha, beta, phi, eta, iteration, level)
: return a $n$-level tree, where $n$ is the input levelhLDA.read_corpus
or simulated from sim_corpus
double
, parameter for Z
functiondouble
, parameter for Z
functiondouble
, parameter fro CRP_prior
functiondouble
, parameter for w_n
functionint
, number of iteration for gibbs samplingint
, number of levelhLDA.HLDA_plot(hLDA_result, n_words, save)
: return a tree plot from hLDA topic modelhLDA.hLDA
int
, how many words to show in each node (sorted by frequency), default with 5boolean
, save the plot or not, default with False
Note that the requirement packages for hLDA are: (1) numpy
; (2) scipy
; (3) collections
; (4) string
; (5) pygraphviz
; (6) pydot
.
In [1]:
import hLDA
In [2]:
sim = hLDA.sim_corpus(5)
print(sim[0])
In [3]:
corpus = hLDA.read_corpus('sample.txt')
print(corpus[0])
In [4]:
tree = hLDA.hLDA(corpus, 0.1, 0.01, 1, 0.01, 10, 3)
In [5]:
hLDA.HLDA_plot(tree)
To optimize the hLDA model, we choose cython to speed the functions up, since the only matrix calculation function, c_m
, was already vectorized. However, after applying cython, the code is not able to speed up efficiently. The possible reasons are shown as follows.
First, if we simply speed up single function, cython does it well. Take the first function, node_sampling
, for example, the run time decreased from 52.2 ms to 47.2ms, which menas cython is about 10% faster than python code. On the other hand, if we try to speed up all the functions used in gibbs sampling function, gibbs
, the run time is similar or even slower, since it has to import external cython function to complete the work.
Second, most of the variables used in hLDA are lists. When coding cython in python, we fail to initialize the data type for the list variables efficiently.
In [4]:
%load_ext Cython
In [ ]:
%%cython -a
cimport cython
cimport numpy as np
import numpy as np
@cython.cdivision
@cython.boundscheck(False)
@cython.wraparound(False)
def CRP_c(list topic, double phi):
cdef double[:] cm = np.empty(len(topic)+1)
cdef int m = sum([len(x) for x in topic])
cm[0] = phi / (phi + m)
cdef int i
cdef list word
for i, word in enumerate(topic):
cm[i+1] = len(word) / (phi + m)
return np.array(cm)
def node_sampling_c(list corpus_s, double phi):
cdef list topic = []
cdef int theta
cdef list corpus
cdef str word
for corpus in corpus_s:
for word in corpus:
cm = CRP_c(topic, phi)
theta = np.random.multinomial(1, (cm/sum(cm))).argmax()
if theta == 0:
topic.append([word])
else:
topic[theta-1].append(word)
return topic
In [6]:
%timeit node_sampling_c(corpus_1, 1)
In [10]:
%timeit node_sampling(corpus_1, 1)
This section will introduce LDA model as the comparison with hLDA model. The LDA model needs user to specify the number of topics and returns the probability of the words in each topic, which are the most different parts compares to hLDA model. The hLDA model applies nonparametric prior which allows arbitrary factors and readily accommodates growing data collections. That is , the hLDA model will sample the number of topics by nCRP and return a topic hierarchy tree.
The lda_topic
function returns a single-layer word distributions for topics, which number is specified as parameter in the function. In each topic, the LDA model gives the probability distribution of possible words. In LDA model, it treats corpus as a big document, instead of consider each document by it own. Furthermore, the model is not able to illustrate the relationship between topics and words which are provided in hLDA model.
In [2]:
import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
In [3]:
def lda_topic(corpus_s, dic, n_topics, ite):
lda = gensim.models.ldamodel.LdaModel(corpus = corpus_s,
id2word = dic,
num_topics = n_topics,
update_every = 1,
chunksize = 1,
passes = 1,
iterations = ite)
return lda.print_topics()
In [4]:
corpus = read_corpus('sample.txt')
In [5]:
def lda_corpus(corpus_s):
texts = []
tokenizer = RegexpTokenizer(r'\w+')
for doc in corpus_s:
for word in doc:
raw = word.lower()
tokens = tokenizer.tokenize(raw)
texts.append(tokens)
dictionary = corpora.Dictionary(texts)
n_corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('sample.mm', n_corpus)
sample = gensim.corpora.MmCorpus('sample.mm')
return sample, dictionary
In [6]:
sample, dic = lda_corpus(corpus)
In [7]:
lda_topic(sample, dic, 3, 5000)
Out[7]:
By introducing nCRP as the nonparametric prior for hierarchical extension to the LDA, here forms the hLDA. First, in the hLDA topic model, the words are allocated by Gibbs sampling of two critical variable -- ${\bf z}$ and ${\bf c}_{m}$. The formor variable, ${\bf z}$, illustrates how words are allocated to each topic, thus finding the number of topics for each parent node. The later variable, ${\bf c}_{m}$, the posterior of likelihood (${\bf w}_{m}$) and nCRP prior (${\bf c}_{m}$), is a set of possible values correspondings to the topics simulated from ${\bf z}$ for each document $m$. After setting up ${\bf z}$ and ${\bf c}_{m}$, the hLDA then runs the Gibbs sampling to draw $w_{n}$, the distribution of the words to the topics drawn from ${\bf z}$ and ${\bf c}_{m}$. Last, we write the hLDA
function and HLDA_plot
function to print the result in list and plot it as a topic tree.
[1] Griffiths, Thomas L., and Mark Steyvers. "A probabilistic approach to semantic representation." Proceedings of the 24th annual conference of the cognitive science society. 2002.
[2] Griffiths, D. M. B. T. L., and M. I. J. J. B. Tenenbaum. "Hierarchical topic models and the nested chinese restaurant process." Advances in neural information processing systems 16 (2004): 17.
[3] Blei, David M., Thomas L. Griffiths, and Michael I. Jordan. "The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies." Journal of the ACM (JACM) 57.2 (2010): 7.