Collapsed Gibbs sampler for Generalized Relational Topic Models with Data Augmentation

$ \DeclareMathOperator{\dir}{Dirichlet} \DeclareMathOperator{\dis}{Discrete} \DeclareMathOperator{\normal}{Normal} \DeclareMathOperator{\ber}{Bernoulli} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\Betaf}{B} \DeclareMathOperator{\Gammaf}{\Gamma} \DeclareMathOperator{\PG}{PG} \DeclareMathOperator{\v}{vec} \newcommand{\norm}[1]{\left\| #1 \right\|} \newcommand{\cp}[2]{p \left( #1 \middle| #2 \right)} \newcommand{\cN}[2]{\mathscr{N} \left( #1 \middle| #2 \right)} \newcommand{\cpsi}[2]{\psi \left( #1 \middle| #2 \right)} \newcommand{\cPsi}[2]{\Psi \left( #1 \middle| #2 \right)} \newcommand{\etd}[1]{\mathbf{z}^{(#1)}} \newcommand{\etdT}[1]{\left. \mathbf{z}^{(#1)} \right.^T} \newcommand{\Etd}[2]{\mathbf{z}^{(#1, #2)}} \newcommand{\sumetd}{\mathbf{z}} \newcommand{\one}{\mathbf{1}} \newcommand{\Eta}{H} \newcommand{\eHe}{\etdT{d} \Eta \etd{d'}} $

Here is the collapsed Gibbs sampler for Chen et al.'s generalized relational topic models with data augmentation. I am building on the collapsed Gibbs sampler I wrote for binary logistic supervised latent Dirichlet allocation.

The generative model for RTMs is as follows:

$$\begin{align} \theta^{(d)} &\sim \dir(\alpha) &\text{(topic distribution for document $d \in \{1, \ldots, D\}$)} \\ \phi^{(k)} &\sim \dir(\beta) &\text{(term distribution for topic $k \in \{1, \ldots, K\}$)} \\ z_n^{(d)} \mid \theta^{(d)} &\sim \dis \left( \theta^{(d)} \right) &\text{(topic of $n$th token of document $d$, $n \in \{1, \ldots, N^{(d)}\}$)} \\ w_n^{(d)} \mid \phi^{(z_n^{(d)})} &\sim \dis \left( \phi^{(z_n^{(d)})} \right) &\text{(term of $n$th token of document $d$, $n \in \{1, \ldots, N^{(d)}\}$)} \\ \Eta_{k, k'} &\sim \normal \left( \mu, \nu^2 \right) &\text{(regression coefficients for topic pairs $k, k' \in \{1, \ldots, K\}$)} \\ y^{(d, d')} \mid \Eta, \etd{d}, \etd{d'} &\sim \ber \left( \frac{ \exp \left( \eHe \right) }{ 1 + \exp \left( \eHe \right) } \right) &\text{(link indicator for documents $d, d' \in \{1, \ldots, D\}$)} \end{align}$$

where each token can be any one of $V$ terms in our vocabulary, $\etd{d}$ is the empirical topic distribution of document $d$, and $\circ$ is the Hadamard (element-wise) product.

Plate notation for relational topic models.
This diagram should replace $\beta_k$ with $\phi^{(k)}$, and each $\phi^{(k)}$ should be dependent on a single $\beta$.

Following Chen et al. 2013, the regularized pseudo-likelihood for the link variable $y^{(d, d')}$, with regularization parameter $b \ge 0$, can be written

$$\begin{align} \cpsi{y^{(d, d')}}{\Eta, \etd{d}, \etd{d'}, b} &= \cp{y^{(d, d')}}{\Eta, \etd{d}, \etd{d'}}^b \\ &= \left( \frac{\exp \left( \eHe \right)^{y^{(d, d')}}}{ 1 + \exp \left( \eHe \right)} \right)^b \\ &= \frac{\exp \left( b y^{(d, d')} \eHe \right)} { \left( \exp \left( -\frac{\eHe}{2} \right) + \exp \left( \frac{\eHe}{2} \right) \right)^b \exp \left( \frac{b}{2} \eHe \right) } \\ &= 2^{-b} \exp \left( b \left( y^{(d, d')} - \frac{1}{2} \right) \left( \eHe \right) \right) \cosh \left( \frac{ \eHe }{2} \right)^{-b} \\ &= 2^{-b} \exp \left( b \left( y^{(d, d')} - \frac{1}{2} \right) \left( \eHe \right) \right) \int_0^\infty \exp \left( -\frac{ \left( \eHe \right)^2 }{2} \omega^{(d, d')} \right) \cp{\omega^{(d, d')}}{b, 0} d\omega^{(d, d')} \end{align}$$

where $\omega^{(d, d')}$ is a Polya-Gamma distributed variable with parameters $b = b$ and $c = 0$ (see Polson et al. 2012 for details). This means that, for each pair of documents $d$ and $d'$, the pseudo-likelihood of $y^{(d, d')}$ is actually a mixture of Gaussians with respect to the Polya-Gamma distribution $\PG(b, 0)$. Therefore, the joint pseudo-likelihood of $y^{(d, d')}$ and $\omega^{(d, d')}$ can be written

$$\cPsi{y^{(d, d')}, \omega^{(d, d')}}{\Eta, \etd{d}, \etd{d'}, b} = 2^{-b} \exp \left( \kappa^{(d, d')} \zeta^{(d, d')} - \frac{ \omega^{(d, d')} }{2} (\zeta^{(d, d')})^2 \right) \cp{\omega^{(d, d')}}{b, 0}.$$

where $\kappa^{(d, d')} = b(y^{(d, d')} - 1/2)$ and $\zeta^{(d, d')} = \eHe$. The joint probability distribution can therefore be factored as follows:

$$\begin{align} \cp{\theta, \phi, z, w, \Eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b} &= \prod_{k=1}^{K} \cp{\phi^{(k)}}{\beta} \prod_{d=1}^{D} \cp{\theta^{(d)}}{\alpha} \prod_{n=1}^{N^{(d)}} \cp{z_n^{(d)}}{\theta^{(d)}} \cp{w_n^{(d)}}{\phi^{(z_n^{(d)})}} \\ & \quad \times \prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cp{\Eta_{k_1, k_2}}{\mu, \nu^2} \prod_{d_1=1}^D \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^D \cPsi{y^{(d_1, d_2)}, \omega^{(d_1, d_2)}}{\Eta, \etd{d_1}, \etd{d_2}, b} \\ &= \prod_{k=1}^{K} \frac{\Betaf(b^{(k)} + \beta)}{\Betaf(\beta)} \cp{\phi^{(k)}}{b^{(k)} + \beta} \prod_{d=1}^{D} \frac{\Betaf(a^{(d)} + \alpha)}{\Betaf(\alpha)} \cp{\theta^{(d)}}{a^{(d)} + \alpha} \\ &\quad \times \prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cN{\Eta_{k_1, k_2}}{\mu, \nu^2} \prod_{d_1=1}^D \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^D 2^{-b} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \cp{\omega^{(d_1, d_2)}}{b, 0} \end{align}$$

where $a_k^{(d)}$ is the number of tokens in document $d$ assigned to topic $k$, $b_v^{(k)}$ is the number of tokens equal to term $v$ and assigned to topic $k$, and $\Betaf$ is the multivariate Beta function. Marginalizing out $\theta$ and $\phi$ by integrating with respect to each $\theta^{(d)}$ and $\phi^{(k)}$ over their respective sample spaces yields

$$\begin{align} \cp{z, w, \Eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b} &= \prod_{k=1}^{K} \frac{\Betaf(b^{(k)} + \beta)}{\Betaf(\beta)} \prod_{d=1}^{D} \frac{\Betaf(a^{(d)} + \alpha)}{\Betaf(\alpha)} \\ &\quad\quad \times \prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cN{\Eta_{k_1, k_2}}{\mu, \nu^2} \prod_{d_1=1}^D \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^D 2^{-b} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \cp{\omega^{(d_1, d_2)}}{b, 0} \\ &= \cp{w}{z, \beta} \cp{z}{\alpha} \cp{\Eta}{\mu, \nu^2} \cPsi{y, \omega}{\Eta, z, b}. \end{align}$$

See my LDA notebook for step-by-step details of the previous two calculations.

Our goal is to calculate the posterior distribution

$$\cp{z, \Eta, \omega}{w, y, \alpha, \beta, \mu, \nu^2, b} = \frac{\cp{z, w, \Eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b}} {\sum_{z'} \iint \cp{z', w, \Eta', y, \omega{'}}{\alpha, \beta, \mu, \nu^2, b} d\Eta' d\omega{'}}$$

in order to infer the topic assignments $z$ and regression coefficients $\Eta$ from the given term assignments $w$ and link data $y$. Since calculating this directly is infeasible, we resort to collapsed Gibbs sampling. The sampler is "collapsed" because we marginalized out $\theta$ and $\phi$, and will estimate them from the topic assignments $z$:

$$\hat\theta_k^{(d)} = \frac{a_k^{(d)} + \alpha_k}{\sum_{k'=1}^K \left(a_{k'}^{(d)} + \alpha_{k'} \right)},\quad \hat\phi_v^{(k)} = \frac{b_v^{(k)} + \beta_v}{\sum_{v'=1}^V \left(b_{v'}^{(k)} + \beta_{v'} \right)}.$$

Gibbs sampling requires us to compute the full conditionals for each $z_n^{(d)}$, $\omega^{(d, d')}$ and $\Eta_{k, k'}$. For example, we need to calculate, for all $n$, $d$ and $k$,

$$\begin{align} \cp{z_n^{(d)} = k}{z \setminus z_n^{(d)}, w, H, y, \omega, \alpha, \beta, \mu, \nu^2, b} &\propto \cp{z_n^{(d)} = k, z \setminus z_n^{(d)}, w, H, y, \omega}{\alpha, \beta, \mu, \nu^2, b} \\ &\propto \frac{b_{w_n^{(d)}}^{(k)} \setminus z_n^{(d)} + \beta_{w_n^{(d)}}}{ \sum_{v=1}^V \left( b_v^{(k)} \setminus z_n^{(d)} + \beta_v\right)} \left( a_k^{(d)} \setminus z_n^{(d)} + \alpha_k \right) \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \\ &\propto \frac{b_{w_n^{(d)}}^{(k)} \setminus z_n^{(d)} + \beta_{w_n^{(d)}}}{ \sum_{v=1}^V \left( b_v^{(k)} \setminus z_n^{(d)} + \beta_v\right)} \left( a_k^{(d)} \setminus z_n^{(d)} + \alpha_k \right) \\ &\quad\quad\times \exp \left( \sum_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \left[ \left( \kappa^{(d_1, d)} - \omega^{(d_1, d)} ( \zeta^{(d_1, d)} \setminus z_n^{(d)}) \right) \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} - \frac{ \omega^{(d_1, d)} }{2} \left( \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} \right)^2 \right] \right. \\ &\quad\quad\quad\quad + \left. \sum_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \left[ \left( \kappa^{(d, d_2)} - \omega^{(d, d_2)} ( \zeta^{(d, d_2)} \setminus z_n^{(d)}) \right) \frac{H_{k, :} \etd{d_2}}{N^{(d)}} - \frac{ \omega^{(d, d_2)} }{2} \left( \frac{H_{k, :} \etd{d_2}}{N^{(d)}} \right)^2 \right] \right) \end{align}$$

where the "set-minus" notation $\cdot \setminus z_n^{(d)}$ denotes the variable the notation is applied to with the entry $z_n^{(d)}$ removed (again, see my LDA notebook for details). This final proportionality is true since

$$\begin{align} \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) &= \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \left( \zeta^{(d_1, d_2)} \setminus z_n^{(d)} + \Delta_{d, d_1, d_2}^{(k)} \right) - \frac{ \omega^{(d_1, d_2)} }{2} \left( \zeta^{(d_1, d_2)} \setminus z_n^{(d)} + \Delta_{d, d_1, d_2}^{(k)} \right)^2 \right) \\ &\propto \prod_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \exp \left( \kappa^{(d_1, d)} \left( \zeta^{(d_1, d)} \setminus z_n^{(d)} + \Delta_{d, d_1, d}^{(k)} \right) - \frac{ \omega^{(d_1, d)} }{2} \left( \zeta^{(d_1, d)} \setminus z_n^{(d)} + \Delta_{d, d_1, d}^{(k)} \right)^2 \right) \\ &\quad\quad\times \prod_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \exp \left( \kappa^{(d, d_2)} \left( \zeta^{(d, d_2)} \setminus z_n^{(d)} + \Delta_{d, d, d_2}^{(k)} \right) - \frac{ \omega^{(d, d_2)} }{2} \left( \zeta^{(d, d_2)} \setminus z_n^{(d)} + \Delta_{d, d, d_2}^{(k)} \right)^2 \right) \\ &\propto \exp \left( \sum_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \left[ \left( \kappa^{(d_1, d)} - \omega^{(d_1, d)} ( \zeta^{(d_1, d)} \setminus z_n^{(d)}) \right) \frac{H_{:, k}^T (\etd{d_1} \setminus z_n^{(d)})}{N^{(d)}} - \frac{ \omega^{(d_1, d)} }{2} \left( \frac{H_{:, k}^T (\etd{d_1} \setminus z_n^{(d)})}{N^{(d)}} \right)^2 \right] \right. \\ &\quad\quad + \left. \sum_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \left[ \left( \kappa^{(d, d_2)} - \omega^{(d, d_2)} ( \zeta^{(d, d_2)} \setminus z_n^{(d)}) \right) \frac{H_{k, :} (\etd{d_2} \setminus z_n^{(d)})}{N^{(d)}} - \frac{ \omega^{(d, d_2)} }{2} \left( \frac{H_{k, :} (\etd{d_2} \setminus z_n^{(d)})}{N^{(d)}} \right)^2 \right] \right) \\ &= \exp \left( \sum_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \left[ \left( \kappa^{(d_1, d)} - \omega^{(d_1, d)} ( \zeta^{(d_1, d)} \setminus z_n^{(d)}) \right) \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} - \frac{ \omega^{(d_1, d)} }{2} \left( \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} \right)^2 \right] \right. \\ &\quad\quad + \left. \sum_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \left[ \left( \kappa^{(d, d_2)} - \omega^{(d, d_2)} ( \zeta^{(d, d_2)} \setminus z_n^{(d)}) \right) \frac{H_{k, :} \etd{d_2}}{N^{(d)}} - \frac{ \omega^{(d, d_2)} }{2} \left( \frac{H_{k, :} \etd{d_2}}{N^{(d)}} \right)^2 \right] \right) \end{align}$$

where

$$\Delta_{d, d_1, d_2}^{(k)} = \delta_{d, d_1} \frac{H_{k, :} (\etd{d_2} \setminus z_n^{(d)})}{N^{(d)}} + \delta_{d, d_2} \frac{H_{:, k}^T (\etd{d_1} \setminus z_n^{(d)})}{N^{(d)}},$$

$\delta_{d, d'}$ is the Kronecker delta, and $H_{k, :}$ and $H_{:, k}$ are the $k$th row and column of $H$, respectively. The first proportionality is a result of the fact that $\Delta_{d, d_1, d_2}^{(k)}$ is nonzero only when $d = d_1$ or $d = d_2$. The last equality follows from the fact that $d \neq d_1$ in the first summation and $d \neq d_2$ in the second.

In order to calculate the full conditional for $H$, let $\eta = (H_{:,1}^T \cdots H_{:, K}^T)^T$ be the vector of concatenated columns of $H$, $Z = (\etd{1, 1} \cdots \etd{D, D})$ be the matrix whose columns are the vectors $\etd{d, d'} = \etd{d'} \otimes \etd{d}$, where $\otimes$ is the Kronecker product, $\Omega = \diag(\omega^{(1,1)}, \ldots, \omega^{(D,D)})$ be the diagonal matrix whose diagonal entries are $\omega^{(d, d')}$, $I$ be the identity matrix, and $\one$ be the vector of ones, and note that

$$\prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cN{H_{k_1, k_2}}{\mu, \nu^2} = \cN{\eta}{\mu \one, \nu^2 I}$$$$\prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) = \exp \left( \eta^T Z \kappa - \frac{1}{2} \eta^T Z \Omega Z^T \eta \right) \propto \cN{\eta}{(Z \Omega Z^T)^{-1} Z \kappa, (Z \Omega Z^T)^{-1}}.$$

Therefore

$$\begin{align} \cp{\eta}{z, w, y, \omega, \alpha, \beta, \mu, \nu^2, b} &\propto \cp{z, w, \eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b} \\ &\propto \cN{\eta}{\mu \one, \nu^2 I} \cN{\eta}{(Z \Omega Z^T)^{-1} Z \kappa, (Z \Omega Z^T)^{-1}} \\ &\propto \cN{\eta}{\Sigma \left( \frac{\mu}{\nu^2} \one + Z \kappa \right), \Sigma} \end{align}$$

where $\Sigma^{-1} = \nu^{-2} I + Z \Omega Z^T$ (see Section 8.1.8 of the Matrix Cookbook).

We also need to calculate the full conditional for $\omega$. We calculate

$$\begin{align} \cp{\omega}{z, w, H, y, \alpha, \beta, \mu, \nu^2, b} &\propto \cp{z, w, H, y, \omega}{\alpha, \beta, \mu, \nu^2, b} \\ &\propto \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \cp{\omega^{(d_1, d_2)}}{b, 0} \\ &= \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \cp{\omega^{(d_1, d_2)}}{b, \zeta^{(d_1, d_2)}} \end{align}$$

that is, $\omega^{(d_1, d_2)} \sim \PG(b, \eHe)$ for each pair of documents $d_1$ and $d_2$. We sample from the Polya-Gamma distribution according to the method of Polson et al. 2012, implemented for Python 3 in this code repo.

Graphical test


In [22]:
%matplotlib inline

from modules.helpers import plot_images
from functools import partial
from sklearn.metrics import (roc_auc_score, roc_curve)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

imshow = partial(plt.imshow, cmap='gray', interpolation='nearest', aspect='auto')
sns.set(style='white')

Generate topics

We assume a vocabulary of 25 terms, and create ten "topics", where each topic assigns exactly 5 consecutive terms equal probability.


In [48]:
V = 25
K = 10
N = 100
D = 1000

In [49]:
topics = []
topic_base = np.concatenate((np.ones((1, 5)) * 0.2, np.zeros((4, 5))), axis=0).ravel()
for i in range(5):
    topics.append(np.roll(topic_base, i * 5))
topic_base = np.concatenate((np.ones((5, 1)) * 0.2, np.zeros((5, 4))), axis=1).ravel()
for i in range(5):
    topics.append(np.roll(topic_base, i))
topics = np.array(topics)
plt.figure(figsize=(10, 5))
plot_images(plt, topics, (5, 5), layout=(2, 5), figsize=(10, 5))


<matplotlib.figure.Figure at 0x1174a3e80>

Generate documents from topics

We generate 1,000 documents from these 10 topics by sampling 1,000 topic distributions, one for each document, from a Dirichlet distribution with parameter $\alpha = (1, \ldots, 1)$.


In [50]:
alpha = np.ones(K)
np.random.seed(42)
thetas = np.random.dirichlet(alpha, size=D)
topic_assignments = np.array([np.random.choice(range(K), size=100, p=theta)
                              for theta in thetas])
word_assignments = np.array([[np.random.choice(range(V), size=1, p=topics[topic_assignments[d, n]])[0]
                              for n in range(N)] for d in range(D)])
doc_term_matrix = np.array([np.histogram(word_assignments[d], bins=V, range=(0, V - 1))[0] for d in range(D)])
imshow(doc_term_matrix)


Out[50]:
<matplotlib.image.AxesImage at 0x11c2357b8>

Generate document network

Create a document network from the documents by applying $\psi$ and applying a threshold $\psi_0$.


In [51]:
from itertools import product
from sklearn.cross_validation import StratifiedKFold

In [52]:
# choose parameter values
mu = 0.
nu2 = 1.
np.random.seed(14)
H = np.random.normal(loc=mu, scale=nu2, size=(K, K))
zeta = pd.DataFrame([(i, j, np.dot(np.dot(thetas[i], H), thetas[j])) for i, j in product(range(D), repeat=2)],
                    columns=('tail', 'head', 'zeta'))
_ = zeta.zeta.hist(bins=50)



In [53]:
# choose parameter values
zeta['y'] = (zeta.zeta >= 0).astype(int)
# plot histogram of responses
print('positive examples {} ({:.1f}%)'.format(zeta.y.sum(), zeta.y.sum() / D / D * 100))
_ = zeta.y.hist()


positive examples 151324 (15.1%)

In [54]:
y = zeta[['tail', 'head', 'y']].values

In [55]:
skf = StratifiedKFold(y[:, 2], n_folds=100)
_, train_idx = next(iter(skf))
train_idx.shape


Out[55]:
(10001,)

Estimate parameters


In [56]:
from slda.topic_models import GRTM

In [58]:
_K = 10
_alpha = alpha[:_K]
_beta = np.repeat(0.01, V)
_mu = mu
_nu2 = nu2
_b = 1.
n_iter = 500
grtm = GRTM(_K, _alpha, _beta, _mu, _nu2, _b, n_iter, seed=42)

In [59]:
%%time
grtm.fit(doc_term_matrix, y[train_idx])


CPU times: user 9min 9s, sys: 917 ms, total: 9min 10s
Wall time: 9min 11s

In [60]:
plot_images(plt, grtm.phi, (5, 5), (2, 5), figsize=(10, 5))



In [ ]:
topic_order = [4, 7, 3, 1, 0, 9, 5, 2, 8]
plot_images(plt, grtm.phi[topic_order], (5, 5), (2, 5), figsize=(10, 5))

In [24]:
burnin = -1
mean_final_lL = grtm.loglikelihoods[burnin:].mean()
print(mean_final_lL)
plt.plot(grtm.loglikelihoods, label='mean final LL {:.2f}'.format(mean_final_lL))
_ = plt.legend()


-9814.52603434

In [25]:
imshow(grtm.theta)


Out[25]:
<matplotlib.image.AxesImage at 0x1184e6710>

In [26]:
H_pred = grtm.H[burnin:].mean(axis=0)
_ = plt.hist(H_pred.ravel(), bins=20)



In [27]:
_ = plt.hist(H.ravel(), bins=20)


Predict edges on pairs of test documents

Create 1,000 test documents using the same generative process as our training documents.


In [28]:
np.random.seed(42^2)
thetas_test = np.random.dirichlet(alpha, size=D)
topic_assignments_test = np.array([np.random.choice(range(K), size=100, p=theta)
                                   for theta in thetas_test])
word_assignments_test = np.array([[np.random.choice(range(V), size=1, p=topics[topic_assignments_test[d, n]])[0]
                                   for n in range(N)] for d in range(D)])
doc_term_matrix_test = np.array([np.histogram(word_assignments_test[d], bins=V, range=(0, V - 1))[0] for d in range(D)])
imshow(doc_term_matrix_test)


Out[28]:
<matplotlib.image.AxesImage at 0x118793cc0>

Learn their topic distributions using the model trained on the training documents, then calculate the actual and predicted values of $\psi$. For predicted $\psi$, estimate $\eta$ as the mean of our samples of $\eta$ after burn-in.


In [29]:
def bern_param(theta1, theta2, H):
    zeta = np.dot(np.dot(theta1, H), theta2)
    return np.exp(zeta) / (1 + np.exp(zeta))

In [30]:
thetas_test_grtm = grtm.transform(doc_term_matrix_test)
p_test = np.zeros(D * D)
p_test_grtm = np.zeros(D * D)
for n, i in enumerate(product(range(D), range(D))):
    p_test[n] = bern_param(thetas_test[i[0]], thetas_test[i[1]], H)
    p_test_grtm[n] = bern_param(thetas_test_grtm[i[0]], thetas_test_grtm[i[1]], H_pred)

Measure the goodness of our prediction by the area under the associated ROC curve.


In [31]:
y_test = (p_test > 0.5).astype(int)
y_grtm = p_test_grtm
fpr, tpr, _ = roc_curve(y_test, y_grtm)
plt.plot(fpr, tpr, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_grtm))))
_ = plt.legend(loc='best')


Learn topics, then learn classifier


In [32]:
from slda.topic_models import LDA

In [33]:
lda = LDA(_K, _alpha, _beta, n_iter, seed=42)

In [34]:
%%time
lda.fit(doc_term_matrix)


CPU times: user 2.24 s, sys: 9.37 ms, total: 2.24 s
Wall time: 2.24 s

In [35]:
plot_images(plt, lda.phi, (5, 5), (1, 5), figsize=(10, 5))



In [36]:
plt.plot(lda.loglikelihoods)


Out[36]:
[<matplotlib.lines.Line2D at 0x11c8a7c50>]

In [37]:
imshow(lda.theta)


Out[37]:
<matplotlib.image.AxesImage at 0x10b7474a8>

In [38]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

Compute Hadamard products between learned topic distributions for training and test documents.


In [39]:
%%time
thetas_test_lda = lda.transform(doc_term_matrix_test)


CPU times: user 112 ms, sys: 1.34 ms, total: 114 ms
Wall time: 113 ms

In [40]:
lda_train = np.zeros((D * D, _K * _K))
lda_test = np.zeros((D * D, _K * _K))
for n, i in enumerate(product(range(D), range(D))):
    lda_train[n] = np.kron(lda.theta[i[0]], lda.theta[i[1]])
    lda_test[n] = np.kron(thetas_test_lda[i[0]], thetas_test_lda[i[1]])

Logistic regression

  • Train logistic regression on training data
  • calculate probability of an edge for each pair of test documents, and
  • measure the goodness of our prediction by computing the area under the ROC curve.

In [41]:
_C_grid = np.arange(1, 202, 10)
roc_auc_scores = []
for _C in _C_grid:
    print('Training Logistic Regression with C = {}'.format(_C))
    _lr = LogisticRegression(fit_intercept=False, C=_C)
    _lr.fit(lda_train, zeta.y)
    _y_lr = _lr.predict_proba(lda_test)[:, 1]
    roc_auc_scores.append(roc_auc_score(y_test, _y_lr))
    print('  roc_auc_score = {}'.format(roc_auc_scores[-1]))
print(_C_grid[np.argmax(roc_auc_scores)], np.max(roc_auc_scores))
plt.plot(_C_grid, roc_auc_scores)


Training Logistic Regression with C = 1
  roc_auc_score = 0.7838142410105514
Training Logistic Regression with C = 11
  roc_auc_score = 0.7838162876311603
Training Logistic Regression with C = 21
  roc_auc_score = 0.7838162539769836
Training Logistic Regression with C = 31
  roc_auc_score = 0.7838162561895421
Training Logistic Regression with C = 41
  roc_auc_score = 0.783816266545313
Training Logistic Regression with C = 51
  roc_auc_score = 0.7838162653891266
Training Logistic Regression with C = 61
  roc_auc_score = 0.78381627423936
Training Logistic Regression with C = 71
  roc_auc_score = 0.7838162677098173
Training Logistic Regression with C = 81
  roc_auc_score = 0.7838162687453946
Training Logistic Regression with C = 91
  roc_auc_score = 0.7838162616086463
Training Logistic Regression with C = 101
  roc_auc_score = 0.783816259533333
Training Logistic Regression with C = 111
  roc_auc_score = 0.7838162552745744
Training Logistic Regression with C = 121
  roc_auc_score = 0.7838162560398197
Training Logistic Regression with C = 131
  roc_auc_score = 0.7838162547089578
Training Logistic Regression with C = 141
  roc_auc_score = 0.7838162548378852
Training Logistic Regression with C = 151
  roc_auc_score = 0.7838162539728246
Training Logistic Regression with C = 161
  roc_auc_score = 0.7838162565264164
Training Logistic Regression with C = 171
  roc_auc_score = 0.7838162545218054
Training Logistic Regression with C = 181
  roc_auc_score = 0.7838162532865988
Training Logistic Regression with C = 191
  roc_auc_score = 0.7838162509825439
Training Logistic Regression with C = 201
  roc_auc_score = 0.7838162537066522
11 0.783816287631
Out[41]:
[<matplotlib.lines.Line2D at 0x11c35c9e8>]

In [42]:
lr = LogisticRegression(fit_intercept=False, C=11)
lr.fit(lda_train, zeta.y)
y_lr = lr.predict_proba(lda_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_lr)
plt.plot(fpr, tpr, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_lr))))
_ = plt.legend(loc='best')


Gradient boosted trees

  • Train gradient boosted trees on training data
  • calculate probability of an edge for each pair of test documents, and
  • measure the goodness of our prediction by computing the area under the ROC curve.

In [43]:
_C_grid = np.arange(1, 4)
roc_auc_scores = []
for _C in _C_grid:
    print('Training Logistic Regression with C = {}'.format(_C))
    _gbc = GradientBoostingClassifier(max_depth=_C)
    _gbc.fit(lda_train, zeta.y)
    _y_gbc = _gbc.predict_proba(lda_test)[:, 1]
    roc_auc_scores.append(roc_auc_score(y_test, _y_gbc))
    print('  roc_auc_score = {}'.format(roc_auc_scores[-1]))
print(_C_grid[np.argmax(roc_auc_scores)], np.max(roc_auc_scores))
plt.plot(_C_grid, roc_auc_scores)


Training Logistic Regression with C = 1
  roc_auc_score = 0.7738177479973433
Training Logistic Regression with C = 2
  roc_auc_score = 0.7782572102416223
Training Logistic Regression with C = 3
  roc_auc_score = 0.780067856917805
3 0.780067856918
Out[43]:
[<matplotlib.lines.Line2D at 0x11725f668>]

In [44]:
gbc = GradientBoostingClassifier(max_depth=3)
gbc.fit(lda_train, zeta.y)
y_gbc = gbc.predict_proba(lda_test)[:, 1]
fpr_gbc, tpr_gbc, _ = roc_curve(y_test, y_gbc)
plt.plot(fpr_gbc, tpr_gbc, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_gbc))))
plt.legend(loc='best')


Out[44]:
<matplotlib.legend.Legend at 0x11c394da0>

Use GRTM topics


In [45]:
grtm_train = np.zeros((D * D, _K * _K))
grtm_test = np.zeros((D * D, _K * _K))
for n, i in enumerate(product(range(D), range(D))):
    grtm_train[n] = np.kron(grtm.theta[i[0]], grtm.theta[i[1]])
    grtm_test[n] = np.kron(thetas_test_grtm[i[0]], thetas_test_grtm[i[1]])

In [46]:
_C_grid = np.arange(1, 52, 10)
roc_auc_scores = []
for _C in _C_grid:
    print('Training Logistic Regression with C = {}'.format(_C))
    _lr = LogisticRegression(fit_intercept=False, C=_C)
    _lr.fit(grtm_train, zeta.y)
    _y_lr = _lr.predict_proba(grtm_test)[:, 1]
    roc_auc_scores.append(roc_auc_score(y_test, _y_lr))
    print('  roc_auc_score = {}'.format(roc_auc_scores[-1]))
print(_C_grid[np.argmax(roc_auc_scores)], np.max(roc_auc_scores))
plt.plot(_C_grid, roc_auc_scores)


Training Logistic Regression with C = 1
  roc_auc_score = 0.8536008244937028
Training Logistic Regression with C = 11
  roc_auc_score = 0.853643659256871
Training Logistic Regression with C = 21
  roc_auc_score = 0.8536456248737231
Training Logistic Regression with C = 31
  roc_auc_score = 0.8536463316821341
Training Logistic Regression with C = 41
  roc_auc_score = 0.8536466818652372
Training Logistic Regression with C = 51
  roc_auc_score = 0.8536469080618956
51 0.853646908062
Out[46]:
[<matplotlib.lines.Line2D at 0x118738128>]

In [47]:
lr = LogisticRegression(fit_intercept=False, C=51)
lr.fit(grtm_train, zeta.y)
y_lr = lr.predict_proba(grtm_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_lr)
plt.plot(fpr, tpr, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_lr))))
_ = plt.legend(loc='best')


Conclusion

The relational topic model performs better than either logistic regression or gradient boosted trees trained on the LDA topics, but worse than a logistic regression trained on the GRTM topics, which just shows how much better the GRTM topics are!


In [ ]: