Collapsed Gibbs sampler for Generalized Relational Topic Models with Data Augmentation

Here is the collapsed Gibbs sampler for Chen et al.'s generalized relational topic models with data augmentation. I am building on the collapsed Gibbs sampler I wrote for binary logistic supervised latent Dirichlet allocation.

The generative model for RTMs is as follows:

$$\begin{align} \theta^{(d)} &\sim \dir(\alpha) &\text{(topic distribution for document $d \in \{1, \ldots, D\}$)} \\ \phi^{(k)} &\sim \dir(\beta) &\text{(term distribution for topic $k \in \{1, \ldots, K\}$)} \\ z_n^{(d)} \mid \theta^{(d)} &\sim \dis \left( \theta^{(d)} \right) &\text{(topic of $n$th token of document $d$, $n \in \{1, \ldots, N^{(d)}\}$)} \\ w_n^{(d)} \mid \phi^{(z_n^{(d)})} &\sim \dis \left( \phi^{(z_n^{(d)})} \right) &\text{(term of $n$th token of document $d$, $n \in \{1, \ldots, N^{(d)}\}$)} \\ \Eta_{k, k'} &\sim \normal \left( \mu, \nu^2 \right) &\text{(regression coefficients for topic pairs $k, k' \in \{1, \ldots, K\}$)} \\ y^{(d, d')} \mid \Eta, \etd{d}, \etd{d'} &\sim \ber \left( \frac{ \exp \left( \eHe \right) }{ 1 + \exp \left( \eHe \right) } \right) &\text{(link indicator for documents $d, d' \in \{1, \ldots, D\}$)} \end{align}$$

where each token can be any one of $V$ terms in our vocabulary, $\etd{d}$ is the empirical topic distribution of document $d$, and $\circ$ is the Hadamard (element-wise) product.

Plate notation for relational topic models.
This diagram should replace $\beta_k$ with $\phi^{(k)}$, and each $\phi^{(k)}$ should be dependent on a single $\beta$.

Following Chen et al. 2013, the regularized pseudo-likelihood for the link variable $y^{(d, d')}$, with regularization parameter $b \ge 0$, can be written

$$\begin{align} \cpsi{y^{(d, d')}}{\Eta, \etd{d}, \etd{d'}, b} &= \cp{y^{(d, d')}}{\Eta, \etd{d}, \etd{d'}}^b \\ &= \left( \frac{\exp \left( \eHe \right)^{y^{(d, d')}}}{ 1 + \exp \left( \eHe \right)} \right)^b \\ &= \frac{\exp \left( b y^{(d, d')} \eHe \right)} { \left( \exp \left( -\frac{\eHe}{2} \right) + \exp \left( \frac{\eHe}{2} \right) \right)^b \exp \left( \frac{b}{2} \eHe \right) } \\ &= 2^{-b} \exp \left( b \left( y^{(d, d')} - \frac{1}{2} \right) \left( \eHe \right) \right) \cosh \left( \frac{ \eHe }{2} \right)^{-b} \\ &= 2^{-b} \exp \left( b \left( y^{(d, d')} - \frac{1}{2} \right) \left( \eHe \right) \right) \int_0^\infty \exp \left( -\frac{ \left( \eHe \right)^2 }{2} \omega^{(d, d')} \right) \cp{\omega^{(d, d')}}{b, 0} d\omega^{(d, d')} \end{align}$$

where $\omega^{(d, d')}$ is a Polya-Gamma distributed variable with parameters $b = b$ and $c = 0$ (see Polson et al. 2012 for details). This means that, for each pair of documents $d$ and $d'$, the pseudo-likelihood of $y^{(d, d')}$ is actually a mixture of Gaussians with respect to the Polya-Gamma distribution $\PG(b, 0)$. Therefore, the joint pseudo-likelihood of $y^{(d, d')}$ and $\omega^{(d, d')}$ can be written

$$\cPsi{y^{(d, d')}, \omega^{(d, d')}}{\Eta, \etd{d}, \etd{d'}, b} = 2^{-b} \exp \left( \kappa^{(d, d')} \zeta^{(d, d')} - \frac{ \omega^{(d, d')} }{2} (\zeta^{(d, d')})^2 \right) \cp{\omega^{(d, d')}}{b, 0}.$$

where $\kappa^{(d, d')} = b(y^{(d, d')} - 1/2)$ and $\zeta^{(d, d')} = \eHe$. The joint probability distribution can therefore be factored as follows:

$$\begin{align} \cp{\theta, \phi, z, w, \Eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b} &= \prod_{k=1}^{K} \cp{\phi^{(k)}}{\beta} \prod_{d=1}^{D} \cp{\theta^{(d)}}{\alpha} \prod_{n=1}^{N^{(d)}} \cp{z_n^{(d)}}{\theta^{(d)}} \cp{w_n^{(d)}}{\phi^{(z_n^{(d)})}} \\ & \quad \times \prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cp{\Eta_{k_1, k_2}}{\mu, \nu^2} \prod_{d_1=1}^D \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^D \cPsi{y^{(d_1, d_2)}, \omega^{(d_1, d_2)}}{\Eta, \etd{d_1}, \etd{d_2}, b} \\ &= \prod_{k=1}^{K} \frac{\Betaf(b^{(k)} + \beta)}{\Betaf(\beta)} \cp{\phi^{(k)}}{b^{(k)} + \beta} \prod_{d=1}^{D} \frac{\Betaf(a^{(d)} + \alpha)}{\Betaf(\alpha)} \cp{\theta^{(d)}}{a^{(d)} + \alpha} \\ &\quad \times \prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cN{\Eta_{k_1, k_2}}{\mu, \nu^2} \prod_{d_1=1}^D \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^D 2^{-b} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \cp{\omega^{(d_1, d_2)}}{b, 0} \end{align}$$

where $a_k^{(d)}$ is the number of tokens in document $d$ assigned to topic $k$, $b_v^{(k)}$ is the number of tokens equal to term $v$ and assigned to topic $k$, and $\Betaf$ is the multivariate Beta function. Marginalizing out $\theta$ and $\phi$ by integrating with respect to each $\theta^{(d)}$ and $\phi^{(k)}$ over their respective sample spaces yields

$$\begin{align} \cp{z, w, \Eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b} &= \prod_{k=1}^{K} \frac{\Betaf(b^{(k)} + \beta)}{\Betaf(\beta)} \prod_{d=1}^{D} \frac{\Betaf(a^{(d)} + \alpha)}{\Betaf(\alpha)} \\ &\quad\quad \times \prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cN{\Eta_{k_1, k_2}}{\mu, \nu^2} \prod_{d_1=1}^D \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^D 2^{-b} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \cp{\omega^{(d_1, d_2)}}{b, 0} \\ &= \cp{w}{z, \beta} \cp{z}{\alpha} \cp{\Eta}{\mu, \nu^2} \cPsi{y, \omega}{\Eta, z, b}. \end{align}$$

See my LDA notebook for step-by-step details of the previous two calculations.

Our goal is to calculate the posterior distribution

$$\cp{z, \Eta, \omega}{w, y, \alpha, \beta, \mu, \nu^2, b} = \frac{\cp{z, w, \Eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b}} {\sum_{z'} \iint \cp{z', w, \Eta', y, \omega{'}}{\alpha, \beta, \mu, \nu^2, b} d\Eta' d\omega{'}}$$

in order to infer the topic assignments $z$ and regression coefficients $\Eta$ from the given term assignments $w$ and link data $y$. Since calculating this directly is infeasible, we resort to collapsed Gibbs sampling. The sampler is "collapsed" because we marginalized out $\theta$ and $\phi$, and will estimate them from the topic assignments $z$:

$$\hat\theta_k^{(d)} = \frac{a_k^{(d)} + \alpha_k}{\sum_{k'=1}^K \left(a_{k'}^{(d)} + \alpha_{k'} \right)},\quad \hat\phi_v^{(k)} = \frac{b_v^{(k)} + \beta_v}{\sum_{v'=1}^V \left(b_{v'}^{(k)} + \beta_{v'} \right)}.$$

Gibbs sampling requires us to compute the full conditionals for each $z_n^{(d)}$, $\omega^{(d, d')}$ and $\Eta_{k, k'}$. For example, we need to calculate, for all $n$, $d$ and $k$,

$$\begin{align} \cp{z_n^{(d)} = k}{z \setminus z_n^{(d)}, w, H, y, \omega, \alpha, \beta, \mu, \nu^2, b} &\propto \cp{z_n^{(d)} = k, z \setminus z_n^{(d)}, w, H, y, \omega}{\alpha, \beta, \mu, \nu^2, b} \\ &\propto \frac{b_{w_n^{(d)}}^{(k)} \setminus z_n^{(d)} + \beta_{w_n^{(d)}}}{ \sum_{v=1}^V \left( b_v^{(k)} \setminus z_n^{(d)} + \beta_v\right)} \left( a_k^{(d)} \setminus z_n^{(d)} + \alpha_k \right) \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \\ &\propto \frac{b_{w_n^{(d)}}^{(k)} \setminus z_n^{(d)} + \beta_{w_n^{(d)}}}{ \sum_{v=1}^V \left( b_v^{(k)} \setminus z_n^{(d)} + \beta_v\right)} \left( a_k^{(d)} \setminus z_n^{(d)} + \alpha_k \right) \\ &\quad\quad\times \exp \left( \sum_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \left[ \left( \kappa^{(d_1, d)} - \omega^{(d_1, d)} ( \zeta^{(d_1, d)} \setminus z_n^{(d)}) \right) \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} - \frac{ \omega^{(d_1, d)} }{2} \left( \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} \right)^2 \right] \right. \\ &\quad\quad\quad\quad + \left. \sum_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \left[ \left( \kappa^{(d, d_2)} - \omega^{(d, d_2)} ( \zeta^{(d, d_2)} \setminus z_n^{(d)}) \right) \frac{H_{k, :} \etd{d_2}}{N^{(d)}} - \frac{ \omega^{(d, d_2)} }{2} \left( \frac{H_{k, :} \etd{d_2}}{N^{(d)}} \right)^2 \right] \right) \end{align}$$

where the "set-minus" notation $\cdot \setminus z_n^{(d)}$ denotes the variable the notation is applied to with the entry $z_n^{(d)}$ removed (again, see my LDA notebook for details). This final proportionality is true since

$$\begin{align} \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) &= \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \left( \zeta^{(d_1, d_2)} \setminus z_n^{(d)} + \Delta_{d, d_1, d_2}^{(k)} \right) - \frac{ \omega^{(d_1, d_2)} }{2} \left( \zeta^{(d_1, d_2)} \setminus z_n^{(d)} + \Delta_{d, d_1, d_2}^{(k)} \right)^2 \right) \\ &\propto \prod_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \exp \left( \kappa^{(d_1, d)} \left( \zeta^{(d_1, d)} \setminus z_n^{(d)} + \Delta_{d, d_1, d}^{(k)} \right) - \frac{ \omega^{(d_1, d)} }{2} \left( \zeta^{(d_1, d)} \setminus z_n^{(d)} + \Delta_{d, d_1, d}^{(k)} \right)^2 \right) \\ &\quad\quad\times \prod_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \exp \left( \kappa^{(d, d_2)} \left( \zeta^{(d, d_2)} \setminus z_n^{(d)} + \Delta_{d, d, d_2}^{(k)} \right) - \frac{ \omega^{(d, d_2)} }{2} \left( \zeta^{(d, d_2)} \setminus z_n^{(d)} + \Delta_{d, d, d_2}^{(k)} \right)^2 \right) \\ &\propto \exp \left( \sum_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \left[ \left( \kappa^{(d_1, d)} - \omega^{(d_1, d)} ( \zeta^{(d_1, d)} \setminus z_n^{(d)}) \right) \frac{H_{:, k}^T (\etd{d_1} \setminus z_n^{(d)})}{N^{(d)}} - \frac{ \omega^{(d_1, d)} }{2} \left( \frac{H_{:, k}^T (\etd{d_1} \setminus z_n^{(d)})}{N^{(d)}} \right)^2 \right] \right. \\ &\quad\quad + \left. \sum_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \left[ \left( \kappa^{(d, d_2)} - \omega^{(d, d_2)} ( \zeta^{(d, d_2)} \setminus z_n^{(d)}) \right) \frac{H_{k, :} (\etd{d_2} \setminus z_n^{(d)})}{N^{(d)}} - \frac{ \omega^{(d, d_2)} }{2} \left( \frac{H_{k, :} (\etd{d_2} \setminus z_n^{(d)})}{N^{(d)}} \right)^2 \right] \right) \\ &= \exp \left( \sum_{\substack{d_1=1 \\ d_1 \neq d}}^{D} \left[ \left( \kappa^{(d_1, d)} - \omega^{(d_1, d)} ( \zeta^{(d_1, d)} \setminus z_n^{(d)}) \right) \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} - \frac{ \omega^{(d_1, d)} }{2} \left( \frac{H_{:, k}^T \etd{d_1}}{N^{(d)}} \right)^2 \right] \right. \\ &\quad\quad + \left. \sum_{\substack{d_2=1 \\ d_2 \neq d}}^{D} \left[ \left( \kappa^{(d, d_2)} - \omega^{(d, d_2)} ( \zeta^{(d, d_2)} \setminus z_n^{(d)}) \right) \frac{H_{k, :} \etd{d_2}}{N^{(d)}} - \frac{ \omega^{(d, d_2)} }{2} \left( \frac{H_{k, :} \etd{d_2}}{N^{(d)}} \right)^2 \right] \right) \end{align}$$

where

$$\Delta_{d, d_1, d_2}^{(k)} = \delta_{d, d_1} \frac{H_{k, :} (\etd{d_2} \setminus z_n^{(d)})}{N^{(d)}} + \delta_{d, d_2} \frac{H_{:, k}^T (\etd{d_1} \setminus z_n^{(d)})}{N^{(d)}},$$

$\delta_{d, d'}$ is the Kronecker delta, and $H_{k, :}$ and $H_{:, k}$ are the $k$th row and column of $H$, respectively. The first proportionality is a result of the fact that $\Delta_{d, d_1, d_2}^{(k)}$ is nonzero only when $d = d_1$ or $d = d_2$. The last equality follows from the fact that $d \neq d_1$ in the first summation and $d \neq d_2$ in the second.

In order to calculate the full conditional for $H$, let $\eta = (H_{:,1}^T \cdots H_{:, K}^T)^T$ be the vector of concatenated columns of $H$, $Z = (\etd{1, 1} \cdots \etd{D, D})$ be the matrix whose columns are the vectors $\etd{d, d'} = \etd{d'} \otimes \etd{d}$, where $\otimes$ is the Kronecker product, $\Omega = \diag(\omega^{(1,1)}, \ldots, \omega^{(D,D)})$ be the diagonal matrix whose diagonal entries are $\omega^{(d, d')}$, $I$ be the identity matrix, and $\one$ be the vector of ones, and note that

$$\prod_{k_1=1}^{K} \prod_{k_2=1}^{K} \cN{H_{k_1, k_2}}{\mu, \nu^2} = \cN{\eta}{\mu \one, \nu^2 I}$$$$\prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( \kappa^{(d_1, d_2)} \zeta^{(d_1, d_2)} - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) = \exp \left( \eta^T Z \kappa - \frac{1}{2} \eta^T Z \Omega Z^T \eta \right) \propto \cN{\eta}{(Z \Omega Z^T)^{-1} Z \kappa, (Z \Omega Z^T)^{-1}}.$$

Therefore

$$\begin{align} \cp{\eta}{z, w, y, \omega, \alpha, \beta, \mu, \nu^2, b} &\propto \cp{z, w, \eta, y, \omega}{\alpha, \beta, \mu, \nu^2, b} \\ &\propto \cN{\eta}{\mu \one, \nu^2 I} \cN{\eta}{(Z \Omega Z^T)^{-1} Z \kappa, (Z \Omega Z^T)^{-1}} \\ &\propto \cN{\eta}{\Sigma \left( \frac{\mu}{\nu^2} \one + Z \kappa \right), \Sigma} \end{align}$$

where $\Sigma^{-1} = \nu^{-2} I + Z \Omega Z^T$ (see Section 8.1.8 of the Matrix Cookbook).

We also need to calculate the full conditional for $\omega$. We calculate

$$\begin{align} \cp{\omega}{z, w, H, y, \alpha, \beta, \mu, \nu^2, b} &\propto \cp{z, w, H, y, \omega}{\alpha, \beta, \mu, \nu^2, b} \\ &\propto \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \exp \left( - \frac{ \omega^{(d_1, d_2)} }{2} (\zeta^{(d_1, d_2)})^2 \right) \cp{\omega^{(d_1, d_2)}}{b, 0} \\ &= \prod_{d_1=1}^{D} \prod_{\substack{d_2=1 \\ d_2 \neq d_1}}^{D} \cp{\omega^{(d_1, d_2)}}{b, \zeta^{(d_1, d_2)}} \end{align}$$

that is, $\omega^{(d_1, d_2)} \sim \PG(b, \eHe)$ for each pair of documents $d_1$ and $d_2$. We sample from the Polya-Gamma distribution according to the method of Polson et al. 2012, implemented for Python 3 in this code repo.

Graphical test



In [22]:

    
%matplotlib inline

from modules.helpers import plot_images
from functools import partial
from sklearn.metrics import (roc_auc_score, roc_curve)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

imshow = partial(plt.imshow, cmap='gray', interpolation='nearest', aspect='auto')
sns.set(style='white')

Generate topics

We assume a vocabulary of 25 terms, and create ten "topics", where each topic assigns exactly 5 consecutive terms equal probability.



In [48]:

    
V = 25
K = 10
N = 100
D = 1000



In [49]:

    
topics = []
topic_base = np.concatenate((np.ones((1, 5)) * 0.2, np.zeros((4, 5))), axis=0).ravel()
for i in range(5):
    topics.append(np.roll(topic_base, i * 5))
topic_base = np.concatenate((np.ones((5, 1)) * 0.2, np.zeros((5, 4))), axis=1).ravel()
for i in range(5):
    topics.append(np.roll(topic_base, i))
topics = np.array(topics)
plt.figure(figsize=(10, 5))
plot_images(plt, topics, (5, 5), layout=(2, 5), figsize=(10, 5))









    





<matplotlib.figure.Figure at 0x1174a3e80>

Generate documents from topics

We generate 1,000 documents from these 10 topics by sampling 1,000 topic distributions, one for each document, from a Dirichlet distribution with parameter $\alpha = (1, \ldots, 1)$.



In [50]:

    
alpha = np.ones(K)
np.random.seed(42)
thetas = np.random.dirichlet(alpha, size=D)
topic_assignments = np.array([np.random.choice(range(K), size=100, p=theta)
                              for theta in thetas])
word_assignments = np.array([[np.random.choice(range(V), size=1, p=topics[topic_assignments[d, n]])[0]
                              for n in range(N)] for d in range(D)])
doc_term_matrix = np.array([np.histogram(word_assignments[d], bins=V, range=(0, V - 1))[0] for d in range(D)])
imshow(doc_term_matrix)









    Out[50]:





<matplotlib.image.AxesImage at 0x11c2357b8>

Generate document network

Create a document network from the documents by applying $\psi$ and applying a threshold $\psi_0$.



In [51]:

    
from itertools import product
from sklearn.cross_validation import StratifiedKFold



In [52]:

    
# choose parameter values
mu = 0.
nu2 = 1.
np.random.seed(14)
H = np.random.normal(loc=mu, scale=nu2, size=(K, K))
zeta = pd.DataFrame([(i, j, np.dot(np.dot(thetas[i], H), thetas[j])) for i, j in product(range(D), repeat=2)],
                    columns=('tail', 'head', 'zeta'))
_ = zeta.zeta.hist(bins=50)



In [53]:

    
# choose parameter values
zeta['y'] = (zeta.zeta >= 0).astype(int)
# plot histogram of responses
print('positive examples {} ({:.1f}%)'.format(zeta.y.sum(), zeta.y.sum() / D / D * 100))
_ = zeta.y.hist()









    



positive examples 151324 (15.1%)



In [54]:

    
y = zeta[['tail', 'head', 'y']].values



In [55]:

    
skf = StratifiedKFold(y[:, 2], n_folds=100)
_, train_idx = next(iter(skf))
train_idx.shape









    Out[55]:





(10001,)

Estimate parameters



In [56]:

    
from slda.topic_models import GRTM



In [58]:

    
_K = 10
_alpha = alpha[:_K]
_beta = np.repeat(0.01, V)
_mu = mu
_nu2 = nu2
_b = 1.
n_iter = 500
grtm = GRTM(_K, _alpha, _beta, _mu, _nu2, _b, n_iter, seed=42)



In [59]:

    
%%time
grtm.fit(doc_term_matrix, y[train_idx])









    



CPU times: user 9min 9s, sys: 917 ms, total: 9min 10s
Wall time: 9min 11s



In [60]:

    
plot_images(plt, grtm.phi, (5, 5), (2, 5), figsize=(10, 5))



In [ ]:

    
topic_order = [4, 7, 3, 1, 0, 9, 5, 2, 8]
plot_images(plt, grtm.phi[topic_order], (5, 5), (2, 5), figsize=(10, 5))



In [24]:

    
burnin = -1
mean_final_lL = grtm.loglikelihoods[burnin:].mean()
print(mean_final_lL)
plt.plot(grtm.loglikelihoods, label='mean final LL {:.2f}'.format(mean_final_lL))
_ = plt.legend()









    



-9814.52603434



In [25]:

    
imshow(grtm.theta)









    Out[25]:





<matplotlib.image.AxesImage at 0x1184e6710>



In [26]:

    
H_pred = grtm.H[burnin:].mean(axis=0)
_ = plt.hist(H_pred.ravel(), bins=20)



In [27]:

    
_ = plt.hist(H.ravel(), bins=20)

Predict edges on pairs of test documents

Create 1,000 test documents using the same generative process as our training documents.



In [28]:

    
np.random.seed(42^2)
thetas_test = np.random.dirichlet(alpha, size=D)
topic_assignments_test = np.array([np.random.choice(range(K), size=100, p=theta)
                                   for theta in thetas_test])
word_assignments_test = np.array([[np.random.choice(range(V), size=1, p=topics[topic_assignments_test[d, n]])[0]
                                   for n in range(N)] for d in range(D)])
doc_term_matrix_test = np.array([np.histogram(word_assignments_test[d], bins=V, range=(0, V - 1))[0] for d in range(D)])
imshow(doc_term_matrix_test)









    Out[28]:





<matplotlib.image.AxesImage at 0x118793cc0>

Learn their topic distributions using the model trained on the training documents, then calculate the actual and predicted values of $\psi$. For predicted $\psi$, estimate $\eta$ as the mean of our samples of $\eta$ after burn-in.



In [29]:

    
def bern_param(theta1, theta2, H):
    zeta = np.dot(np.dot(theta1, H), theta2)
    return np.exp(zeta) / (1 + np.exp(zeta))



In [30]:

    
thetas_test_grtm = grtm.transform(doc_term_matrix_test)
p_test = np.zeros(D * D)
p_test_grtm = np.zeros(D * D)
for n, i in enumerate(product(range(D), range(D))):
    p_test[n] = bern_param(thetas_test[i[0]], thetas_test[i[1]], H)
    p_test_grtm[n] = bern_param(thetas_test_grtm[i[0]], thetas_test_grtm[i[1]], H_pred)

Measure the goodness of our prediction by the area under the associated ROC curve.



In [31]:

    
y_test = (p_test > 0.5).astype(int)
y_grtm = p_test_grtm
fpr, tpr, _ = roc_curve(y_test, y_grtm)
plt.plot(fpr, tpr, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_grtm))))
_ = plt.legend(loc='best')

Learn topics, then learn classifier



In [32]:

    
from slda.topic_models import LDA



In [33]:

    
lda = LDA(_K, _alpha, _beta, n_iter, seed=42)



In [34]:

    
%%time
lda.fit(doc_term_matrix)









    



CPU times: user 2.24 s, sys: 9.37 ms, total: 2.24 s
Wall time: 2.24 s



In [35]:

    
plot_images(plt, lda.phi, (5, 5), (1, 5), figsize=(10, 5))



In [36]:

    
plt.plot(lda.loglikelihoods)









    Out[36]:





[<matplotlib.lines.Line2D at 0x11c8a7c50>]



In [37]:

    
imshow(lda.theta)









    Out[37]:





<matplotlib.image.AxesImage at 0x10b7474a8>



In [38]:

    
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

Compute Hadamard products between learned topic distributions for training and test documents.



In [39]:

    
%%time
thetas_test_lda = lda.transform(doc_term_matrix_test)









    



CPU times: user 112 ms, sys: 1.34 ms, total: 114 ms
Wall time: 113 ms



In [40]:

    
lda_train = np.zeros((D * D, _K * _K))
lda_test = np.zeros((D * D, _K * _K))
for n, i in enumerate(product(range(D), range(D))):
    lda_train[n] = np.kron(lda.theta[i[0]], lda.theta[i[1]])
    lda_test[n] = np.kron(thetas_test_lda[i[0]], thetas_test_lda[i[1]])

Logistic regression

Train logistic regression on training data
calculate probability of an edge for each pair of test documents, and
measure the goodness of our prediction by computing the area under the ROC curve.



In [41]:

    
_C_grid = np.arange(1, 202, 10)
roc_auc_scores = []
for _C in _C_grid:
    print('Training Logistic Regression with C = {}'.format(_C))
    _lr = LogisticRegression(fit_intercept=False, C=_C)
    _lr.fit(lda_train, zeta.y)
    _y_lr = _lr.predict_proba(lda_test)[:, 1]
    roc_auc_scores.append(roc_auc_score(y_test, _y_lr))
    print('  roc_auc_score = {}'.format(roc_auc_scores[-1]))
print(_C_grid[np.argmax(roc_auc_scores)], np.max(roc_auc_scores))
plt.plot(_C_grid, roc_auc_scores)









    



Training Logistic Regression with C = 1
  roc_auc_score = 0.7838142410105514
Training Logistic Regression with C = 11
  roc_auc_score = 0.7838162876311603
Training Logistic Regression with C = 21
  roc_auc_score = 0.7838162539769836
Training Logistic Regression with C = 31
  roc_auc_score = 0.7838162561895421
Training Logistic Regression with C = 41
  roc_auc_score = 0.783816266545313
Training Logistic Regression with C = 51
  roc_auc_score = 0.7838162653891266
Training Logistic Regression with C = 61
  roc_auc_score = 0.78381627423936
Training Logistic Regression with C = 71
  roc_auc_score = 0.7838162677098173
Training Logistic Regression with C = 81
  roc_auc_score = 0.7838162687453946
Training Logistic Regression with C = 91
  roc_auc_score = 0.7838162616086463
Training Logistic Regression with C = 101
  roc_auc_score = 0.783816259533333
Training Logistic Regression with C = 111
  roc_auc_score = 0.7838162552745744
Training Logistic Regression with C = 121
  roc_auc_score = 0.7838162560398197
Training Logistic Regression with C = 131
  roc_auc_score = 0.7838162547089578
Training Logistic Regression with C = 141
  roc_auc_score = 0.7838162548378852
Training Logistic Regression with C = 151
  roc_auc_score = 0.7838162539728246
Training Logistic Regression with C = 161
  roc_auc_score = 0.7838162565264164
Training Logistic Regression with C = 171
  roc_auc_score = 0.7838162545218054
Training Logistic Regression with C = 181
  roc_auc_score = 0.7838162532865988
Training Logistic Regression with C = 191
  roc_auc_score = 0.7838162509825439
Training Logistic Regression with C = 201
  roc_auc_score = 0.7838162537066522
11 0.783816287631






    Out[41]:





[<matplotlib.lines.Line2D at 0x11c35c9e8>]



In [42]:

    
lr = LogisticRegression(fit_intercept=False, C=11)
lr.fit(lda_train, zeta.y)
y_lr = lr.predict_proba(lda_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_lr)
plt.plot(fpr, tpr, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_lr))))
_ = plt.legend(loc='best')

Gradient boosted trees

Train gradient boosted trees on training data
calculate probability of an edge for each pair of test documents, and
measure the goodness of our prediction by computing the area under the ROC curve.



In [43]:

    
_C_grid = np.arange(1, 4)
roc_auc_scores = []
for _C in _C_grid:
    print('Training Logistic Regression with C = {}'.format(_C))
    _gbc = GradientBoostingClassifier(max_depth=_C)
    _gbc.fit(lda_train, zeta.y)
    _y_gbc = _gbc.predict_proba(lda_test)[:, 1]
    roc_auc_scores.append(roc_auc_score(y_test, _y_gbc))
    print('  roc_auc_score = {}'.format(roc_auc_scores[-1]))
print(_C_grid[np.argmax(roc_auc_scores)], np.max(roc_auc_scores))
plt.plot(_C_grid, roc_auc_scores)









    



Training Logistic Regression with C = 1
  roc_auc_score = 0.7738177479973433
Training Logistic Regression with C = 2
  roc_auc_score = 0.7782572102416223
Training Logistic Regression with C = 3
  roc_auc_score = 0.780067856917805
3 0.780067856918






    Out[43]:





[<matplotlib.lines.Line2D at 0x11725f668>]



In [44]:

    
gbc = GradientBoostingClassifier(max_depth=3)
gbc.fit(lda_train, zeta.y)
y_gbc = gbc.predict_proba(lda_test)[:, 1]
fpr_gbc, tpr_gbc, _ = roc_curve(y_test, y_gbc)
plt.plot(fpr_gbc, tpr_gbc, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_gbc))))
plt.legend(loc='best')









    Out[44]:





<matplotlib.legend.Legend at 0x11c394da0>

Use GRTM topics



In [45]:

    
grtm_train = np.zeros((D * D, _K * _K))
grtm_test = np.zeros((D * D, _K * _K))
for n, i in enumerate(product(range(D), range(D))):
    grtm_train[n] = np.kron(grtm.theta[i[0]], grtm.theta[i[1]])
    grtm_test[n] = np.kron(thetas_test_grtm[i[0]], thetas_test_grtm[i[1]])



In [46]:

    
_C_grid = np.arange(1, 52, 10)
roc_auc_scores = []
for _C in _C_grid:
    print('Training Logistic Regression with C = {}'.format(_C))
    _lr = LogisticRegression(fit_intercept=False, C=_C)
    _lr.fit(grtm_train, zeta.y)
    _y_lr = _lr.predict_proba(grtm_test)[:, 1]
    roc_auc_scores.append(roc_auc_score(y_test, _y_lr))
    print('  roc_auc_score = {}'.format(roc_auc_scores[-1]))
print(_C_grid[np.argmax(roc_auc_scores)], np.max(roc_auc_scores))
plt.plot(_C_grid, roc_auc_scores)









    



Training Logistic Regression with C = 1
  roc_auc_score = 0.8536008244937028
Training Logistic Regression with C = 11
  roc_auc_score = 0.853643659256871
Training Logistic Regression with C = 21
  roc_auc_score = 0.8536456248737231
Training Logistic Regression with C = 31
  roc_auc_score = 0.8536463316821341
Training Logistic Regression with C = 41
  roc_auc_score = 0.8536466818652372
Training Logistic Regression with C = 51
  roc_auc_score = 0.8536469080618956
51 0.853646908062






    Out[46]:





[<matplotlib.lines.Line2D at 0x118738128>]



In [47]:

    
lr = LogisticRegression(fit_intercept=False, C=51)
lr.fit(grtm_train, zeta.y)
y_lr = lr.predict_proba(grtm_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_lr)
plt.plot(fpr, tpr, label=('AUC = {:.3f}'.format(roc_auc_score(y_test, y_lr))))
_ = plt.legend(loc='best')

Conclusion

The relational topic model performs better than either logistic regression or gradient boosted trees trained on the LDA topics, but worse than a logistic regression trained on the GRTM topics, which just shows how much better the GRTM topics are!



In [ ]: