In [38]:
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy.random as rng
np.set_printoptions(precision = 2)
plt.rcParams['figure.figsize'] = (10.0, 5.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

The vanilla RBM.

  • weights $W_{ji}$ means the weight between the j-th hidden unit and the i-th visible unit.
  • $W_{0i}$ is "bias" into the i-th visible unit.
  • $W_{j0}$ is "bias" into the j-th hidden unit. The joint probability under the RBM factorisation is: $$ P^\star(h,v) = \prod_i \prod_j e^{h_j W_{ji} v_i} \;\; \times \;\; \prod_{i^\prime} e^{W_{0i^\prime} v_{i^\prime}} \;\; \times \;\; \prod_{j^\prime} e^{W_{j^\prime 0} h_{j^\prime}} $$ and its logarithm is $$ \log P^\star(h,v) = \sum_i \sum_j h_j W_{ji} v_i \;\; + \;\; \sum_i W_{0i} v_i \;\; + \;\; \sum_j W_{j0} h_j $$

Gibbs in vanilla RBM

To sample from this distribution we can figure out the Gibbs update step.

The probability $p(h_j=1|v)$ that the j-th hidden unit generates a 1 can be written as $ \sigma(\psi_j) $ with $\sigma(x)=1/(1+e^{-x})$ and $\psi_j = \log P^*(h,v | h_j =1 ) - \log P^*(h,v | h_j = 0)$. And from the $\log P^\star(h,v)$ given above, this is easily seen to be $ \psi_j(v) = \sum_i W_{ji} v_i + W_{j0}$

Similarly, for the visible units the probability $p(v_i=1|h)$ that the i-th visible generates a 1 can be written as $\sigma(\phi_i(h)) $ with $\phi_i(h) = \log P^*(h,v | v_i =1 ) - \log P^*(h,v | v_i = 0)$. And similarly this can be seen to be $ \phi_i(h) = \sum_j W_{ji} h_j + W_{0i}$ for the i-th visible unit.

For convenience we will sometimes write $\sigma(\phi_i(h))$ as just $\sigma_i(h)$.

Gibbs in an equivalent network

Note this is equivalent to a 3 layer network in which the top 2 layers form an RBM and the lower one is a sigmoid belief network. Sampling from the latter model involves drawing Bernoulli variables repeatedly (eg. via alternating Gibbs sampling) from the RBM until equilibrium and then sampling $v$ from the visibles given the hiddens $h$, ie. "ancestral sampling" in the belief net, but from an $h$ generated by an RBM.

We know (from above) how to generate the $h$ sample: Gibbs sampling from the RBM will do it. Then, the conditional probability of $v$ under a sigmoid belief net is (by definition) $p(v_i=1|h) = \sigma_i(h)$. Thus Gibbs sampling from a simple RBM ending in a sample for $v$ is the same as sampling $h$ from the same RBM and then using a sigmoid belief net for the last step.

However, there's another way to draw such samples. Write (product rule) $\log P^\star(h,v) = \log P^\star(h) + \log P(v|h)$. We have the second term already: $$ \log P(v|h) = \sum_i v_i \log \sigma_i(h) + (1-v_i) \log (1 - \sigma_i(h)$$

To find $P^\star(h)$ we need to marginalise that joint over all $\mathbf{v}$ configurations: $$\begin{align} P^\star(h) &= \sum_{v_1=0}^1 \cdots \sum_{v_n=0}^1 \exp \bigg[ \log P^{\star}(h,v) \bigg] \\ &= \sum_{v_1=0}^1 \cdots \sum_{v_n=0}^1 \exp \bigg[ \sum_i \sum_j h_j W_{ji} v_i \;\; + \;\; \sum_i W_{0i} v_i \;\; + \;\; \sum_j W_{j0} h_j \bigg] \\ &= \sum_{v_1=0}^1 \cdots \sum_{v_n=0}^1 \exp \bigg[ \sum_i v_i \phi_i(h) \;\; + \;\; \sum_j W_{j0} h_j \bigg] \\ \text{where } \phi_i(h) &= \sum_j W_{ji} h_j + W_{0i} \\ &= \exp\left[ \sum_j h_j W_{j0} \right] \;\; \times \sum_{v_1=0}^1 \cdots \sum_{v_n=0}^1 \prod_i \exp\bigg[ v_i \phi_i(h) \bigg] \\ &= \exp\left[\sum_j h_j W_{j0}\right] \;\; \times \prod_i \bigg( 1 + e^{\phi_i(h) } \bigg) \\ \text{and so} \log P^\star(h) &= \sum_j h_j W_{j0} \;\; + \sum_i \log \bigg( 1 + e^{\phi_i(h) } \bigg) \\ &= \sum_j h_j W_{j0} \;\; + \; \sum_i \phi_i(h) \; - \; \sum_i \log \sigma_i(h) \end{align} $$

So far we've figured out $\log P^\star(h)$ for the RBM that is the "top layer".

Therefore another way to write $\log P^\star(h,v)$ is $$ \log P^\star(h,v) = \underbrace{\sum_j h_j W_{j0} \;\; + \; \sum_i \phi_i(h) \; - \; \sum_i \log \sigma_i(h)}_{\log P^\star(h)} \;\;+\;\; \underbrace{\sum_i v_i \log \sigma_i(h) + (1-v_i) \log (1 - \sigma_i(h))}_{\log P(v \mid h)} $$ By collecting terms and simplifying one can readily those that this matches the earlier form.

a two-causes model

Now we'd like to change this slightly, so that 2 RBMs that are independent are used to model two causes, which are then combined at the last moment via a sigmoid belief net to form $v$. Suppose that at a given moment the 2nd RBM is in a state which contributes an extra activation $\epsilon_i$ respectively to each of the visible units. We're interested in how this will affect the Gibbs updates to the hidden units $h$ in the first RBM.

Note: $\epsilon_i$ is in general going to be a weighted sum of inputs from the second RBM's hidden layer, plus a new bias arising from the second RBM. Although the two biases both going into the visible units seems (and might be) redundant, in the generative model it seems sensible that visible activations under the two "causes" would have different background rates if taken separately (ie. different biases). So for now I think we should leave them in, but maybe hope to eliminate / merge if possible in future, if it helps anything...

Before going on to the Gibbs Sampler version in which visible units are clampled, consider "ancestral" sampling from the model: each RBM independently does alternating Gibbs Sampling for a (longish) period, and then both combine to generate a sample $v$ vector, by adding both their weighted sums ($\phi$) and adding in both their visible biases too . That's a much more efficient way to do the "sleep" phase than doing what follows (which is mandatory for the "wake" phase samples however).

The Gibbs update step is given by $p_j = \sigma(\phi_j) $ with $ \psi_j = \log P^*(h,v ; h_j =1 ) - \log P^*(h,v ; h_j = 0)$.

However this time we don't have exact correspondence with an RBM because only the final step involves $\epsilon$, not the reverberations in the RBM above it that generates $h$. So it's not enough to consider just the RBM alone, with it's joint being just the product of factors in the first line of math above. We need to incorporate the last step explicitly, with its slight difference in the form of $\phi^B_i$. We know the joint decomposes into this: $$ \log P^* (h,v) = \log P^*(h) + \log P(v|h)$$ where the first term is the vanilla RBM probability but the second is the final layer's probability, now given by $$ \log P(v|h^A,h^B) = \sum_i v_i \log \sigma (\phi^A_i(h) + \phi^B_i) + (1-v_i) \log (1 - \sigma(\phi^A_i(h) + \phi^B_i)$$

To carry out Gibbs sampling in the hidden layer of this architecture we need to calculate $\psi^A_j = \log P^*(h,v ; h^A_j =1 ) - \log P^*(h,v ; h^A_j = 0)$. We'll use the fact that $\phi^A_i(h ; h^A_j=1) = \phi^A_i(h ; h^A_j=0) + W^A_{ji}$.

And we'll abbreviate $\phi^A_i(h ; h_j=0)$ to $\phi^{Aj0}_i$.

We obtain: $$\psi^A_j = \sum_i v_i \log \left( \frac{1+ e^{-\phi^{Aj0}_i - \phi^B_i}}{1+e^{-\phi^{Aj0}_i - W_{ji} -\phi^B_i}} \frac{1+ e^{\phi^{Aj0}_i + W_{ji} + \phi^B_i}}{1+e^{\phi_i^{Aj0} + \phi^B_i}}\right) \;\;+ \;\;\sum_i \log \left(\frac{1+e^{\phi_i^{Aj0} + W_{ji}}}{1+ e^{\phi_i^{Aj0}}} \frac{1+e^{\phi_i^{Aj0} + \phi^B_i}}{1+ e^{\phi_i^{Aj0} + W_{ji} + \phi^B_i}} \right)$$

Now $\phi = \log \frac{1+e^{\phi}}{1+e^{-\phi}}$ (Marcus' magic identity), which is$ = \log \frac{\sigma(\phi)}{\sigma(-\phi)}$. So the first term simplifies to $ \sum_i v_i W_{ji}$, which is the same as that in a "vanilla RBM".

The second term can also be simplified, using the identity $\log(1-\sigma(\phi)) = \phi - \log(1+e^\phi)$.

This leads to the following Gibbs Sampler probability of the j-th hidden unit in network $A$ being 1: $p_j = \sigma(\psi_j^A)$ with

$$\psi_j^A = \underbrace{\sum_i W^A_{ji} v_i}_\text{vanilla RBM} \; + \; \underbrace{\sum_i C^A_{ji}}_\text{correction} $$

where $$ \begin{align} C^A_{ji} \; &= \;\log \bigg[ \frac{\sigma (\phi_i^{Aj0})}{\sigma (\phi_i^{Aj0} + W^A_{ji})} . \frac{\sigma (\phi_i^{Aj0} + W_{ji}^A + \phi_i^B) }{\sigma (\phi_i^{Aj0} + \phi_i^B)} \bigg] \\ &= \log \sigma(\phi_i^{Aj0}) \; + \; \log \sigma (\phi_i^{Aj0} + W^A_{ji} + \phi_i^B) \;- \log \sigma (\phi_i^{Aj0} + W^A_{ji}) \; - \; \log \sigma ( \phi_i^{Aj0} + \phi_i^B) \\ &= \log \bigg[ \frac{\sigma(\phi_i^{A} - h^A_i W^A_{ji})}{\sigma (\phi_i^{A} + (1-h^A_i) W^A_{ji})} \bigg] \; - \; \log \bigg[ \frac{ \sigma ( \phi_i^{AB} - h^A_i W^A_{ji})}{\sigma (\phi_i^{AB} + (1-h^A_i) W^A_{ji})} \bigg] \end{align} $$ where $\phi_i^{AB} = \phi_i^{A} + \phi_i^{B}$.

Notice that, weirdly enough, $v$ plays no role in this correction!

It is clear that adding $\phi^B_i$ has introduced a dependency between the whole of $h$, which is "a bit of a worry".

thinking about that computation

I want to think of of the correction "being computed somewhere" in the network - where? The natural place seems to be the connection itself.

Computation of $\psi^A_j$ seems to require the $ji$ connection knowing this:

  • $W^A_{ji}$, the current weight
  • $v_i$, the current output state

So far so good: the vanilla RBM needs those too, in order to compute the usual dot product. What is new is the requirement that the connection also needs to know...

  • $h^A_j$, the activity on the hidden end
  • $\phi^A_i$
  • $\phi^{AB}_i$

On the face of it this isn't so horrible. But it would be great if there were a simpler way to view this gruesome looking expression.

What does the correction look like?

Thought: the "correction" is all hinge functions and should have an approximation as "if../else.." piecewise linear regimes, and this ought to have an intuitive hand-wave type explanation that makes sense.

So let's plot the correction contours on axes $\phi^A_i$ versus $\phi^{AB}_i$, for (say) a positive weight $W^A_{ij}$. We'll need two plots for the two cases $h^A_j=0,1$


In [39]:
def calc_psi_correction(phiA, phiAB, w, h):
    # note here the Phi is the full weighted sum into the visible node. We're explicitly taking care of the hidden activation too. 
    correction = np.log(sigmoid(phiA - h*w)/sigmoid(phiA + (1-h)*w)) - np.log(sigmoid(phiAB - h*w)/sigmoid(phiAB + (1-h)*w))
    return correction

reach = 40
phiA, phiAB = np.mgrid[-reach:reach:100j, -reach:reach:100j]
wgt = 3.0
levels = np.arange(-abs(wgt)-0.5, abs(wgt)+0.5, abs(wgt)/10.)
cmap = cm.RdYlGn

plt.subplot(121)
psi_correction_h0 = calc_psi_correction(phiA, phiAB, wgt, 0)
C = plt.contourf(phiA, phiAB, psi_correction_h0, levels, origin='lower', cmap=cm.get_cmap(cmap, len(levels)-1))
#plt.colorbar()
plt.title('$h_j=0$')
plt.xlabel('$\phi^A_i$ (left network)')
plt.ylabel('$\phi^{AB}_i$ (both networks)')
plt.plot([0,0], [-reach, reach],'-k', [-reach, reach], [0,0], '-k', [-reach, reach], [reach,-reach], '-k')
plt.axis([-reach, reach, -reach, reach])
#plt.axis('off')
plt.axis('equal')
plt.xticks([-reach/2,reach/2])
plt.yticks([-reach/2,reach/2])

plt.subplot(122)
psi_correction_h1 = calc_psi_correction(phiA, phiAB, wgt, 1)
plt.title('$h_j=1$')
C = plt.contourf(phiA, phiAB, psi_correction_h1, levels, origin='lower', cmap=cm.get_cmap(cmap, len(levels)-1))
plt.plot([0,0], [-reach, reach],'-k', [-reach, reach], [0,0], '-k', [-reach, reach], [reach,-reach], '-k')
plt.axis('off')
plt.axis('equal')
plt.xticks([-reach/2,reach/2])
plt.yticks([-reach/2,reach/2])

plt.savefig('correction.png', dpi=200)


Trying to find an approximation that is easier for a hidden unit to calculate

The true correction is composed of the sum of two terms of form $\log(\sigma(\phi)/\sigma(\phi+W))$.

The first term is exactly $\log(\sigma(\phi)/\sigma(\phi+W))$, which goes from being $-W$ at $\phi=-\infty$ to $0$ at $\phi=+\infty$. The second term is $\log(\sigma(\phi+\epsilon+W)/\sigma(\phi+\epsilon))$, which goes from being $W$ at $\phi=-\infty$ to $0$ at $\phi=+\infty$.

Each of these is "roughly sigmoid", so my approximation is going to use a sigmoid in its place. Focussing on the 1st term, the best sigmoid has it's "switching point" at $\phi = -W/2$. It should also have a "gain" of $\beta = \frac{4(2\sigma(W/2)-1)}{W}$ if you want its slope at the switching point to match that of the true correction.

The 2nd term has switching point at $\phi^A = -\phi^B - W/2$, and so the whole approximation becomes... $$ \tilde{C}_{ji} = W^A_{ji} \bigg[ \sigma(\beta_{ji} . (\phi^{A}_i - \hat{h}^A_j W^A_{ji})) \;\; - \;\; \sigma(\beta_{ji} . (\phi_i^{AB} - \hat{h}^A_j W^A_{ji})) \bigg] $$

where $\hat{h} = h- \small{\frac{1}{2}}$.

Interestingly, the fact that it's got a multiplication by $W$ means our correction can be thought of as equivalently a correction to $v_i$, the activity of the visible unit.

With this approximation, computation of $\psi^A_j$ still requires the $ji$ connection knowing the same things , namely:

  • $W^A_{ji}$, the current weight
  • $v_i$, the current output state
  • $h^A_j$, the activity on the hidden end
  • $\phi^A_i$
  • $\phi^{AB}_i$

So one has to ask what the point is, other than the (possibly!) simpler story?


In [42]:
# testing the approximation
W = -10.0
alpha = 4*(2*sigmoid(W/2)-1)/W # this is the scaler that would match the slope at the 'midpoint'
phi = np.linspace(-30, 30, 1001)
plt.plot(phi, np.log(sigmoid(phi)/sigmoid(phi+W)),'k')
plt.plot(phi, -W*sigmoid(-alpha*(phi+W/2)),'b')
print(alpha)


0.394645719261

The learning algorithm

Stochastic ascent of the log likelihood.

We have that $$ \log P^\star(h^A, h^B, v) = \log P^\star(h^A) + \log P^\star(h^B) + \log P^\star(v \mid h^A, h^B) $$

The first term is: $ \log P^\star(h^A) = \sum_i \log (1 + e^{\phi^A_i}) = -\sum_i \log (\sigma(\phi^A_i))$

Second is the same...

Third is $\sum_i v_i\log \sigma(\phi_i^\text{AB}) \; + \; (1-v_i) \log \sigma(- \phi_i^\text{AB}) $

So now we differentiate it w.r.t. some particular weight $W_{ml}^A$ like this:

First term: $\frac{\partial}{\partial W_{ml}^A} \log P^\star(h^A) = \sigma(\phi^A_l) h_m^A$ is the first term, which is just the average of the usual RBM Hebbian change.

The second term: is zero.

Third term:

$\frac{\partial}{\partial W_{ml}^A} \log P^\star(v \mid h^A, h^B) = (v_l-\sigma(\phi_l^\text{AB})) h_m^A$ which is "just the perceptron rool".

so the learning algorithm is...

WAKE PHASE

use our Gibbs chain to get samples from $h$ for each $v \in \mathcal{D}$, and do

$\Delta W_{ji}^A \;\;\; \propto \;\;\; \underbrace{\sigma(\phi^A_i) h_j^A}_\text{mean field Hebbian} \; + \; \underbrace{(v_i - \sigma(\phi_i^\text{AB})) h_j^A}_\text{Perceptron learning rule}$

and similarly for the other RBM. Note that the $\phi$'s are not the same in the two terms: the first only sums input from the A side, but the second sums from both hidden layers.

SLEEP PHASE

use "regular" Gibbs sampling in each model separately to sample independently from the two RBMs, and do

$\Delta W_{ji}^A \;\;\; \propto \;\;\; - \sigma( \phi^{A}_i) \; h_j^A$

and similarly for the other RBM.

notice the free phase is free

In the FREE PHASE the expected $\psi$ is simply: $$\bar{\psi}_j = \sum_i W_{ji} \sigma_{\phi}$$ Does that mean that the free phase in this model corresponds to what it would be if the two RBMs were not even connected? Looks like it (although this is just an approximation, isn't it?). BUT OF COURSE IT DOES! In the "unrolled" version of the net, the visibles are not clamped in the free phase and hence there's no explaining away going on between the two RBMs: they're independent RBMs doing their own thing, in the free phase. It's only the wake phase that needs any update to the vanilla Gibbs $\psi$.

Thought: maybe even the wake phase could be implemented via samples instead of having to calculate and propogate those sigmoid floats? THINK ABOUT THIS SOME MORE.


In [41]:
import sympy as sp
x,y,z = sp.symbols('x y z')
sp.simplify(sp.log( (1+sp.exp(-x-y)) * (1+sp.exp(x+y+z)) / (1+sp.exp(-x-y-z)) / (1+sp.exp(x+y))) )
sp.simplify(sp.log( (1+sp.exp(x)) * (1+sp.exp(x+y+z)) / (1+sp.exp(x+y)) / (1+sp.exp(x+z)) ))


Out[41]:
log((exp(x) + 1)*(exp(x + y + z) + 1)/((exp(x + y) + 1)*(exp(x + z) + 1)))