$$ \newcommand{\cond}{{\mkern+2mu} \vert {\mkern+2mu}} \newcommand{\SetDiff}{\mathrel{\backslash}} \DeclareMathOperator{\BetaFunc}{Β} \DeclareMathOperator{\GammaFunc}{Γ} \DeclareMathOperator{\prob}{p} \DeclareMathOperator{\dcategorical}{Categorical} \DeclareMathOperator{\dcategorical}{Categorical} \DeclareMathOperator{\ddirichlet}{Dirichlet} $$

Collapsed Gibbs Sampling for LDA

Notation

  • $\ddirichlet_N(\alpha)$ is the symmetric $N$–dimensional Dirichlet distribution, all of whose $N$ parameters are $\alpha$

  • $\dcategorical_N(\alpha)$ is the (asymmetric) $N$–dimensional categorical distribution whose $N$ parameters are given by the vector $\alpha$

  • $\BetaFunc_N(\alpha)$, with a vector argument, is the (asymmetric) $N$–dimensional beta function defined by

$$\BetaFunc_N(\alpha) = \frac{\prod_n \GammaFunc(\alpha_n)}{\GammaFunc(\sum_n \alpha_n)}$$
  • $\BetaFunc_N(\alpha)$, with a scalar argument, is the symmetric $N$–dimensional beta function defined by
$$\BetaFunc_N(\alpha) = \frac{\prod_n \GammaFunc(\alpha)}{\GammaFunc(\sum_n \alpha)} = \frac{\GammaFunc^N(\alpha)}{\GammaFunc(N\alpha)}$$

Glossary

  • $D$ documents

  • $K$ topics

  • $W$ words in the vocabulary

  • $\theta_d$ is a per document $K$–vector of probabilities over the topics

  • $\phi_k$ is a per topic $W$–vector of probabilities over the words in the vocabulary

  • $z_{id}$ are per document word unobserved latent variables giving the topic assignments

  • $x_{id}$ are per document word observed variables giving the vocabulary choices

  • $Z$ is the collection of all the latent topic assignment variables

  • $X$ is the collection of all the observed vocabulary assignment variables

  • $\Theta$ is the collection of per document topic distributions

  • $\Phi$ is the collection of per topic word distributions

  • $Z^{\SetDiff id}$ is the collection of all the latent topic assignment variables excepting that for topic assignment $z_{id}$

  • $\Theta^{\SetDiff d}$ is the collection of per document topic distributions excepting that for document $d$

  • $\Phi^{\SetDiff k}$ is the collection of per topic word distributions excepting that for topic $k$

Statistics

  • $N_d$, the number of instances of words in document $d$

  • $A_{dk}$, the number of $z_{id}$ variables assigned to topic $k$ in document $d$

$$ A_{dk} = \sum_{i=1}^{N_d} \delta(z_{id} = k) $$
  • $B_{kw}$, the number of times an instance of word $w$ is assigned to topic $k$
$$ B_{kw} = \sum_{d=1}^D \sum_{i=1}^{N_d} \delta(x_{id} = w) \delta(z_{id} = k) $$
  • $A_{dk}^{\SetDiff id}$, the $A_{dk}$ statistic taken over all variables except $z_{id}$
$$ \begin{align} A_{dk}^{\SetDiff id} &= \sum_{i\prime \ne i} \delta(z_{i\prime d} = k) \\ &= A_{dk} - \delta(z_{id} = k) \end{align} $$
  • $B_{kw}^{\SetDiff id}$, the $B_{kw}$ statistic taken over all variables except $z_{id}$
$$ \begin{align} B_{kw}^{\SetDiff id} &= B_{kw} - \delta(x_{id} = w) \delta(z_{id} = k) \end{align} $$
  • $M_k$, the total number of instances of words assigned to topic k, and the matching statistic, $M_k^{\SetDiff id}$
$$ \begin{align} M_k &= \sum_{w=1}^W B_{kw} \\ M_k^{\SetDiff id} &= \sum_{w=1}^W B_{kw}^{\SetDiff id} \end{align} $$

Probabilities

The LDA model is then given by

$$ \begin{align} \theta_d \cond \alpha &\sim \ddirichlet_K(\alpha) \\ \phi_k \cond \beta &\sim \ddirichlet_W(\beta) \\ z_{id} \cond \theta_d &\sim \dcategorical_K(\theta_d) \\ x_{id} \cond z_{id}, \phi_{z_{id}} &\sim \dcategorical_W(\phi_{z_{id}}) \end{align} $$

Assuming that the only dependencies are those given above, the joint probability of the latent and observed variables along with the parameters is given by

$$ \prob(Z, X, \Theta, \Phi) = \frac{1}{\BetaFunc_K^D(\alpha)} \prod_{d=1}^D \prod_{k=1}^K \theta_{dk}^{A_{dk} + \alpha - 1} \cdot \frac{1}{\BetaFunc_W^K(\beta)} \prod_{k=1}^K \prod_{w=1}^W \phi_{kw}^{B_{kw}+ \beta - 1}. $$

Probabilities required for Gibbs sampling

The required conditional probabilites for this model are

$$ \begin{align} \prob(z_{id} \cond Z^{\SetDiff id}, X, \Theta, \Phi) &= \frac{\theta_{d,z_{id}} \phi_{z_{id}, x_{id}}}{\sum_{k=1}^K \theta_{dk} \phi_{k,x_{id}}} \\ \theta_d \cond Z, X, \Theta^{\SetDiff d}, \Phi &\sim \ddirichlet_K(A_d + \alpha) \\ \phi_k \cond Z, X, \Theta, \Phi^{\SetDiff k} &\sim \ddirichlet_W(B_k + \beta) \end{align} $$

where $A_d = (A_{d1}, \dotsc, A_{dK})$ and $B_k = (B_{k1}, \dotsc, B_{kW})$.

Probabilities required for collapsed Gibbs sampling

We need to integrate out the $\theta_d$'s and the $\phi_k$'s from the model's joint probability. This gives us

$$ \prob(Z, X) = \prod_{d=1}^D \frac{\BetaFunc_K(A_d + \alpha)}{\BetaFunc_K(\alpha)} \cdot \prod_{k=1}^K \frac{\BetaFunc_W(B_k + \beta)}{\BetaFunc_W(\beta)} $$

Then the conditional probability from which we need to sample is

$$ \prob(z_{id} \cond Z^{\SetDiff id}, X) \propto \frac{ \left( A_{d,z_{id}}^{\SetDiff id} + \alpha \right) \left( B_{z_{id},x_{id}}^{\SetDiff id} + \beta \right) }{ M_{z_{id}}^{\SetDiff id} + W\beta } $$

Since this is a discrete distribution over $k = 1, \dotsc, K$ it is easy to normalise as required, either explicitly or implicitly.