Variational Inference for LDA

In this lecture, we discuss the variational inference for LDA.

Variational inference in general

Let's assume that we have a Bayesian model that has a set of observation $X$, latent variables $Z$, and parameters of latent variables $\Phi$. We are usually intesrested in the posterior distribution of this model

$$ p(Z|X,\Phi) = \frac{p(X|Z)p(Z|\Phi)}{p(X | \Phi)} \\ = \frac{p(X|Z)p(Z|\Phi)}{\int p(X, Z| \Phi) dZ}. $$

However, for many interesting models, the posterior is intractable due to the normalization constant $P(X|\Phi)$.

The key idea of variational inference is to approximate the posterior of latent variables by introducing variational distribution over the latent variables with its own variational parameters

$$ p(Z|X,\Phi) \approx q(Z|\Psi), $$

where $\Psi$ is a set of parameters for the variational distribution $q$.

Then, we minimize the difference between true posterior and variational distribution in terms of KL-divergence.

$$ KL(q||p) = \sum_Z q(Z)\log\frac{q(Z)}{p(Z|X)} \\ = \sum_Z q(Z)\log\frac{q(Z)}{p(Z,X)} - \log p(X) \\ = E_q[\log\frac{q(Z)}{p(Z,X)}] - \log p(X) \\ $$

(we omit the parameters $\Phi$ and $\Psi$ for notational simplicity) The first term is called negative ELBO (Evidence Lower BOund), and we can ignore the second term since that term has no relationship with the variational distribution. Therefore, minimizing KL divergence is equal to maximizing ELBO $(E_q[\log p(Z,X)] - E_q[\log q(Z)])$.

To maximize ELBO, we can use several optimization techniques such as coodinate ascent or newton methods.

Variational inference for LDA

Now we discuss how the variational inference can be applied to LDA model. In this section, we use the original LDA model of which joint likelihood is:

$$p(x,z,\theta|\alpha,\beta) = p(x|z,\beta) p(z|\theta) p(\theta|\alpha).$$

We place fully factorized variational distributions over latent variable $z$ and $\theta$:

$$q(z|\phi) = \prod_{dn} q(z_{dn}|\phi_{dn})$$$$q(\theta|\gamma) = \prod_{d} q(\theta_{d}|\gamma_{d}).$$

Then, the ELBO is

$$\text{ELBO}(\phi,\gamma) = E_q[\log p(\theta|\alpha)] + E_q[\log p(z|\theta)] + E_q[\log p(w|z,\beta)]\\ -E_q[\log q(\theta|\gamma)] - E_q[\log q(z|\phi)] $$

We can obtain the update rule for $\phi$ by taking the partial derivative with respect to $\phi_{dnk}$ (Let's derive the update rule!!!)

$$\phi_{dnk} \propto \beta_{kv} \exp (\Psi(\gamma_k) - \Psi(\sum_k \gamma_k)).$$

and obtain the update rule for $\gamma$ by taking the partial derivative with respect to $\gamma_{dk}$ (Let's derive the update rule!!!)

$$\gamma_{dk} = \alpha_k + \sum \phi_{dnk}$$

In [111]:
# read sample corpus from nltk.corpus.brown corpus
# install nltk package, import nltk, and run nltk.download() to get corpora provided by nltk
import nltk
from nltk.corpus import brown
from nltk.corpus import stopwords
from scipy.special import gammaln

st = set(stopwords.words())
st.add(u'.')
st.add(u',')
st.add(u'\'\'')
st.add(u'``')
st.add(u':')
st.add(u'--')

ndoc = 500

_docs = brown.sents()
docs = list()
for di in xrange(ndoc):
    doc = _docs[di]
    new_doc = list()
    for word in doc:
        if word.lower() not in st:
            new_doc.append(word.lower())
    docs.append(new_doc)
    
# construct vocabulary
_voca = set()
for doc in docs:
    _voca = _voca.union(set(doc))
    
nvoca = len(_voca)
voca = dict()

for word in _voca:
    voca[word] = len(voca)
voca_list = np.array(list(_voca))

# convert word list to vector
word_ids = list()   # word appearance
word_cnt = list()   # word count
for doc in docs:
    ids = np.zeros(nvoca, dtype=int)
    cnt = np.zeros(nvoca, dtype=int)
    for word in doc:
        ids[voca[word]] = 1
        cnt[voca[word]] += 1
    word_ids.append(ids)
    word_cnt.append(cnt)

In [112]:
# set parameters
alpha = 0.1
ntopics = 10
eps = 1e-100

# initialize topic
beta = np.random.dirichlet([0.1]*nvoca, size=ntopics)
# initialize variational parameters
gamma = np.random.dirichlet([1]*ntopics, size=ndoc)

# accumulate phi for updateing beta
sstat = np.zeros([ntopics, nvoca]) + eps

In [113]:
from scipy.special import gammaln, psi

#make it simple!
max_iter = 100
for it in xrange(max_iter):
    for di in xrange(ndoc):    
        phi = beta[:, word_ids[di]] * np.exp(psi(gamma[di,:]) - psi(np.sum(gamma[di,:])))[:,np.newaxis]
        phi[:,word_ids[di]] /= np.sum(phi[:,word_ids[di]],0)
        phi *= word_cnt[di]
        gamma[di,:] = np.sum(phi,1) + alpha
        sstat += phi
    beta = sstat
    sstat = np.zeros([ntopics, nvoca]) + eps

In [116]:
# print top words for each topic
for topic in xrange(ntopics):
    print 'topic %d : %s' % (topic, ' '.join(voca_list[np.argsort(beta[topic,:])[::-1][0:10]]))


topic 0 : even communists limited nato despite alliance today capital heart line
topic 1 : said would state president mr. city administration year one jury
topic 2 : said would state president mr. city administration year one jury
topic 3 : said would state president mr. city administration year one jury
topic 4 : said would state president mr. city administration year one jury
topic 5 : said would state president mr. city administration year one jury
topic 6 : limited provide needs community essential basis vital volume shopping operate
topic 7 : said would state president mr. city administration year one jury
topic 8 : limited two cases adc problems reported community martin employment discrimination
topic 9 : said would state president mr. city administration year one jury

Are there any problem with above code? Let's improve it!

and derive the variational algorithm for smoothed LDA (place prior over $\beta$)


In [ ]: