In this lecture, we discuss the variational inference for LDA.
Let's assume that we have a Bayesian model that has a set of observation $X$, latent variables $Z$, and parameters of latent variables $\Phi$. We are usually intesrested in the posterior distribution of this model
$$ p(Z|X,\Phi) = \frac{p(X|Z)p(Z|\Phi)}{p(X | \Phi)} \\ = \frac{p(X|Z)p(Z|\Phi)}{\int p(X, Z| \Phi) dZ}. $$However, for many interesting models, the posterior is intractable due to the normalization constant $P(X|\Phi)$.
The key idea of variational inference is to approximate the posterior of latent variables by introducing variational distribution over the latent variables with its own variational parameters
$$ p(Z|X,\Phi) \approx q(Z|\Psi), $$where $\Psi$ is a set of parameters for the variational distribution $q$.
Then, we minimize the difference between true posterior and variational distribution in terms of KL-divergence.
$$ KL(q||p) = \sum_Z q(Z)\log\frac{q(Z)}{p(Z|X)} \\ = \sum_Z q(Z)\log\frac{q(Z)}{p(Z,X)} - \log p(X) \\ = E_q[\log\frac{q(Z)}{p(Z,X)}] - \log p(X) \\ $$(we omit the parameters $\Phi$ and $\Psi$ for notational simplicity) The first term is called negative ELBO (Evidence Lower BOund), and we can ignore the second term since that term has no relationship with the variational distribution. Therefore, minimizing KL divergence is equal to maximizing ELBO $(E_q[\log p(Z,X)] - E_q[\log q(Z)])$.
To maximize ELBO, we can use several optimization techniques such as coodinate ascent or newton methods.
Now we discuss how the variational inference can be applied to LDA model. In this section, we use the original LDA model of which joint likelihood is:
$$p(x,z,\theta|\alpha,\beta) = p(x|z,\beta) p(z|\theta) p(\theta|\alpha).$$We place fully factorized variational distributions over latent variable $z$ and $\theta$:
$$q(z|\phi) = \prod_{dn} q(z_{dn}|\phi_{dn})$$$$q(\theta|\gamma) = \prod_{d} q(\theta_{d}|\gamma_{d}).$$Then, the ELBO is
$$\text{ELBO}(\phi,\gamma) = E_q[\log p(\theta|\alpha)] + E_q[\log p(z|\theta)] + E_q[\log p(w|z,\beta)]\\ -E_q[\log q(\theta|\gamma)] - E_q[\log q(z|\phi)] $$We can obtain the update rule for $\phi$ by taking the partial derivative with respect to $\phi_{dnk}$ (Let's derive the update rule!!!)
$$\phi_{dnk} \propto \beta_{kv} \exp (\Psi(\gamma_k) - \Psi(\sum_k \gamma_k)).$$and obtain the update rule for $\gamma$ by taking the partial derivative with respect to $\gamma_{dk}$ (Let's derive the update rule!!!)
$$\gamma_{dk} = \alpha_k + \sum \phi_{dnk}$$
In [111]:
# read sample corpus from nltk.corpus.brown corpus
# install nltk package, import nltk, and run nltk.download() to get corpora provided by nltk
import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
from scipy.special import gammaln
st = set(stopwords.words())
st.add(u'.')
st.add(u',')
st.add(u'\'\'')
st.add(u'``')
st.add(u':')
st.add(u'--')
ndoc = 500
_docs = brown.sents()
docs = list()
for di in xrange(ndoc):
doc = _docs[di]
new_doc = list()
for word in doc:
if word.lower() not in st:
new_doc.append(word.lower())
docs.append(new_doc)
# construct vocabulary
_voca = set()
for doc in docs:
_voca = _voca.union(set(doc))
nvoca = len(_voca)
voca = dict()
for word in _voca:
voca[word] = len(voca)
voca_list = np.array(list(_voca))
# convert word list to vector
word_ids = list() # word appearance
word_cnt = list() # word count
for doc in docs:
ids = np.zeros(nvoca, dtype=int)
cnt = np.zeros(nvoca, dtype=int)
for word in doc:
ids[voca[word]] = 1
cnt[voca[word]] += 1
word_ids.append(ids)
word_cnt.append(cnt)
In [112]:
# set parameters
alpha = 0.1
ntopics = 10
eps = 1e-100
# initialize topic
beta = np.random.dirichlet([0.1]*nvoca, size=ntopics)
# initialize variational parameters
gamma = np.random.dirichlet([1]*ntopics, size=ndoc)
# accumulate phi for updateing beta
sstat = np.zeros([ntopics, nvoca]) + eps
In [113]:
from scipy.special import gammaln, psi
#make it simple!
max_iter = 100
for it in xrange(max_iter):
for di in xrange(ndoc):
phi = beta[:, word_ids[di]] * np.exp(psi(gamma[di,:]) - psi(np.sum(gamma[di,:])))[:,np.newaxis]
phi[:,word_ids[di]] /= np.sum(phi[:,word_ids[di]],0)
phi *= word_cnt[di]
gamma[di,:] = np.sum(phi,1) + alpha
sstat += phi
beta = sstat
sstat = np.zeros([ntopics, nvoca]) + eps
In [116]:
# print top words for each topic
for topic in xrange(ntopics):
print 'topic %d : %s' % (topic, ' '.join(voca_list[np.argsort(beta[topic,:])[::-1][0:10]]))
Are there any problem with above code? Let's improve it!
and derive the variational algorithm for smoothed LDA (place prior over $\beta$)
In [ ]: