Generative Models: Restricted Boltzmann Machines

The main difference:

  • Previous models: directed graph $x \to z \to \tilde{x}$

  • RBM: $x -- z$

Energy Function

  • Notation: here we use $h$ for latent variable
$$E(x,h) = -h^T Wx - c^Tx - b^Th \\ = -\sum \sum W_{i,j}h_jx_k - \sum c_k x_k- \sum b_j h_j$$

Distribution:

  • Joint distribution

    $$p(x,h) = \frac{\exp(-E(x,h))}{Z}$$

  • Partition function: $Z = \sum_{x\in \{0,1\}^n} \sum_{h\in\{0,1\}^m} p(x,h)$

  • Intractable: because we need to sum of $2^n$ $x$s and $2^m$ $h$s, therefore total of $2^{n+m}$.

Graphical model

$$p(x,h) = \frac{\exp(-E(x,h))}{Z} = \frac{\exp(h^TWx + c^Tx + b^th)}{Z} \\ = \frac{\exp() \exp() \exp()}{Z}$$

Connection to physics and nature:

  • Interactions between atoms and molecules
  • Differnet energy functions

    $$p(x,h) = \frac{1}{Z} \prod \prod \exp() \times \prod \exp() \times \prod \exp()$$

Inference

  • Conditional dist

    $$p(h|x) = \prod_j p(h_j|x)$$

    $p(h_j=1|x) = \frac{1}{1 + \exp(-(b_j + W_{j.}x))} = sigmpoid(b_j + W_{j.}x)$

$$p(x|h) = \prod p(x_k|h)$$

$p(x_k=1|h) = \frac{1}{1 + \exp()} = sigmoid(x_k + h^T W_{.k})$

  • Derivation:
$$p(h|x) = \frac{}{\sum_\tilde{h} p(x,\tilde{h})}$$
  • Marginal distribution
$$p(x) = \sum_h p(x,h) = \sum_h \frac{\exp(-E(x,h))}{Z}$$$$p(x) = \frac{\exp(-F(x))/}{Z}$$

$F(x)$ is the free energy.

This is called softplus(.). Softplus is a smooth version of ReLU.

Training

  • Loss function: negative log-likelihood

    $$\frac{1}{T} \sum_{t\in \text{training}} l(f(x^t)) = \frac{1}{T} \sum_t -\log(p(x^t))$$

  • Training: stochastic gradient descent

    where $\mathbf{E}_x$ is the expectaiton under distrubution of $x$ and $E$ is the energy function.

Recall: computing expectation:

  • Compute expectation $\mathbf{E}_p[f(X)]$?

    $$\mathbf{E}[f(x)] = \int_x p(x) f(x) dx$$

    We sample from the distribution of $X$ : ($(p(x)) and then we take the average:

    $$\mathbf{E}[f(x)] = \int_x p(x) f(x) dx = \frac{1}{|S|} \sum_s\in p(x)f(s)$$

Contrastive Divergence

Sampling the negative samples.

  • Replace expectation by a point estimate $$

  • Obtain the point $$ by Gibbs sampling

  • Start sampling chain at $x^t$

Adding more layers: Deep Boltzmann Machines

Discussion: comparing modeling using directed graph vs. undirected

  • Directed: $z \to x$ we model $p(x|z)$

  • Undirected: $z -- x$ we model $p(x,z)$ and $p(z)$

    The directed version is easier, since we use some input.

    The undirected graph should in theory give more accurate model since during the iterative process, we repeatedly make both $x$ and $z$ better.


In [ ]:


In [ ]:


In [ ]:

Gaussian Bernoulli RBM

  • For the case when input $x$ is real and unbounded:

    • add a quadratic termto the energy function $$E(x,h) = -h^TWx - c^Tx - b^Th +/frac{1}{2} x^Tx$$

    • $p(x|h)$ is now a Guassian dist

      • $\mu = c + W^Tx$
      • $\Sigma = I$

In [ ]: