Restricted Boltzmann Machine (RBM): RBMs are shallow neural nets that learn to reconstruct data by themselves in an unsupervised fashion.
Simply, RBM takes the inputs and translates them to a set of numbers that represents them. Then, these numbers can be translated back to reconstruct the inputs. Through several forward and backward passes, the RBM will be trained, and a trained RBM can reveal which features are the most important ones when detecting patterns.
It can automatically extract meaningful features from a given input.
RBM is useful for Collaborative Filtering, dimensionality reduction, classification, regression, feature learning, topic modeling and even Deep Belief Networks.
RBM is a generative model. What is a generative model?
First, lets see what is different betwee discriminative and generative model:
Discriminative:Consider a classification problem in which we want to learn to distinguish between Sedan cars (y = 1) and SUV cars (y = 0), based on some features of an cars. Given a training set, an algorithm like logistic regression tries to find a straight line—that is, a decision boundary—that separates the suv and sedan. Generative: looking at cars, we can build a model of what Sedan cars look like. Then, looking at SUVs, we can build a separate model of what SUV cars look like. Finally, to classify a new car, we can match the new car against the Sedan model, and match it against the SUV model, to see whether the new car looks more like the SUV or Sedan.
Generative Models specify a probability distribution over a dataset of input vectors. we can do both supervise and unsupervise tasks with generative models:
bayes rule
to estimate P(y|x), because: p(y|x) = p(x|y)p(y)/p(x)Can we build a generative model, and then use it to create synthetic data by directly sampling from the modelled probability distributions? Lets see.
In [1]:
from urllib import request
response = request.urlopen('http://deeplearning.net/tutorial/code/utils.py')
content = response.read()
with open('utils.py', 'wb') as f:
f.write(content)
Now, we load in all the packages that we use to create the net including the TensorFlow package:
In [2]:
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
#import Image
from utils import tile_raster_images
import matplotlib.pyplot as plt
%matplotlib inline
We will be using the MINST dataset to practice the usage of RBMs. The following cell loads the MINST dataset.
In [3]:
mnist = input_data.read_data_sets("../../data/MNIST/", one_hot=True)
trX, trY, teX, teY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels
An RBM has two layers. The first layer of the RBM is called the visible (or input layer). MNIST images have 784 pixels, so the visible layer must have 784 input nodes. The second layer is the hidden layer, which possesses i neurons in our case. Each hidden unit has a binary state, which we’ll call it si, and turns either on or off (i.e., si = 1 or si = 0) with a probability that is a logistic function of the inputs it receives from the other j visible units, called for example, p(si = 1). For our case, we'll use 500 nodes in the hidden layer, so i = 500.
Each node in the first layer also has a bias. We will denote the bias as “vb” for the visible units. The vb is shared among all visible units.
Here we define the bias of second layer as well. We will denote the bias as “hb” for the hidden units. The hb is shared among all visible units
In [4]:
vb = tf.placeholder("float", [784])
hb = tf.placeholder("float", [500])
We have to define weights among the input layer and hidden layer nodes. In the weight matrix, the rows are equal to the input nodes, and the columns are equal to the output nodes. Let W be the Tensor of 784x500 (784 - number of visible neurons, 500 - number of hidden neurons) that represents weights between neurons.
In [5]:
W = tf.placeholder("float", [784, 500])
Think RBM as a model that have been trained, and now it can calculate the probability of observing a case (e.g. wet road) given some hidden/latent values (e.g. raining). That is, the RBM can be viewed as a generative model that assigns a probability to each possible binary state vectors
over its visible units (v).
What are the possible binary states vectors?
So, for example if we have 784 units in the visible layer, it will generates a probability distribution over all the ${2^{784}}$ possible visible vectors, i.e, p(v).
Now, it would be really cool, if a model (after training) can calculate the probablity of visible layer, given hidden layer values.
RBM has two phases: 1) Forward Pass, and 2) Backward Pass or Reconstruction:
Phase 1) Forward pass: Processing happens in each node in the hidden layer. That is, input data from all visible nodes are being passed to all hidden nodes. This computation begins by making stochastic decisions about whether to transmit that input or not (i.e. to determine the state of each hidden layer). At the hidden layer's nodes, X is multiplied by a W and added to h_bias. The result of those two operations is fed into the sigmoid function, which produces the node’s output/state. As a result, one output is produced for each hidden node. So, for each row in the training set, a tensor of probabilities is generated, which in our case it is of size [1x500], and totally 55000 vectors (h0=[55000x500]).
Then, we take the tensor of probabilities (as from a sigmoidal activation) and make samples from all the distributions, h0. That is, we sample the activation vector from the probability distribution of hidden layer values. Samples are used to estimate the negative phase gradient which will be explained later.
In [6]:
X = tf.placeholder("float", [None, 784])
_h0= tf.nn.sigmoid(tf.matmul(X, W) + hb) #probabilities of the hidden units
h0 = tf.nn.relu(tf.sign(_h0 - tf.random_uniform(tf.shape(_h0)))) #sample_h_given_X
Before we go further, let's look at an example of sampling:
In [7]:
with tf.Session() as sess:
a= tf.constant([0.7, 0.1, 0.8, 0.2])
print(sess.run(a))
b=sess.run(tf.random_uniform(tf.shape(a)))
print(b)
print(sess.run(a-b))
print(sess.run(tf.sign( a - b)))
print(sess.run(tf.nn.relu(tf.sign( a - b))))
Phase 2) Backward Pass (Reconstruction): The RBM reconstructs data by making several forward and backward passes between the visible and hidden layers.
So, in the second phase (i.e. reconstruction phase), the samples from the hidden layer (i.e. h0) play the role of input. That is, h0 becomes the input in the backward pass. The same weight matrix and visible layer biases are used to go through the sigmoid function. The produced output is a reconstruction which is an approximation of the original input.
In [8]:
_v1 = tf.nn.sigmoid(tf.matmul(h0, tf.transpose(W)) + vb)
v1 = tf.nn.relu(tf.sign(_v1 - tf.random_uniform(tf.shape(_v1)))) #sample_v_given_h
h1 = tf.nn.sigmoid(tf.matmul(v1, W) + hb)
Reconstruction steps:
In order to train an RBM, we have to maximize the product of probabilities assigned to the training set V (a matrix, where each row of it is treated as a visible vector v):
Or equivalently, maximize the expected log probability of V:
Equivalently, we can define the objective function as the average negative log-likelihood and try to minimize it. To achieve this, we need the partial derivative of this function in respect to all of its parameters. And it can be shown that the above equation is indirectly the weights and biases function, so minimization of the objective function here means modifying/optimizing the weight vector W. So, we can use stochastic gradient descent to find the optimal weight and consequently minimize the objective function. When we derive, it give us 2 terms, called positive and negative gradient. These negative and positive phases reflect their effect on the probability density defined by the model. The positive one depends on observations (X), and the second one depends on only the model.
The Positive phase increases the probability of training data.
The Negative phase decreases the probability of samples generated by the model.
The negative phase is hard to compute, so we use a method called Contrastive Divergence (CD) to approximate it. It is designed in such a way that at least the direction of the gradient estimate is somewhat accurate, even when the size is not (In real world models, more accurate techniques like CD-k or PCD are used to train RBMs). During the calculation of CD, we have to use Gibbs sampling to sample from our model distribution.
Contrastive Divergence is actually matrix of values that is computed and used to adjust values of the W matrix. Changing W incrementally leads to training of W values. Then on each step (epoch), W is updated to a new value W' through the equation below: $W' = W + alpha * CD$
What is Alpha?
Here, alpha is some small step rate and is also known as the "learning rate".
How can we calculate CD?
We can perform single-step Contrastive Divergence (CD-1) taking the following steps:
In forward pass: We randomly set the values of each hi to be 1 with probability $sigmoid(v \otimes W + hb)$.
In reconstruction: We randomly set the values of each vi to be 1 with probability $ sigmoid(h \otimes transpose(W) + vb)$.
In [9]:
alpha = 1.0
w_pos_grad = tf.matmul(tf.transpose(X), h0)
w_neg_grad = tf.matmul(tf.transpose(v1), h1)
CD = (w_pos_grad - w_neg_grad) / tf.to_float(tf.shape(X)[0])
update_w = W + alpha * CD
update_vb = vb + alpha * tf.reduce_mean(X - v1, 0)
update_hb = hb + alpha * tf.reduce_mean(h0 - h1, 0)
Goal: Maximize the likelihood of our data being drawn from that distribution
Calculate error:
In each epoch, we compute the "error" as a sum of the squared difference between step 1 and step n,
e.g the error shows the difference between the data and its reconstruction.
Note: tf.reduce_mean computes the mean of elements across dimensions of a tensor.
In [10]:
err = tf.reduce_mean(tf.square(X - v1))
Let's start a session and initialize the variables:
In [11]:
cur_w = np.zeros([784, 500], np.float32)
cur_vb = np.zeros([784], np.float32)
cur_hb = np.zeros([500], np.float32)
prv_w = np.zeros([784, 500], np.float32)
prv_vb = np.zeros([784], np.float32)
prv_hb = np.zeros([500], np.float32)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
Let look at the error of the first run:
In [12]:
sess.run(err, feed_dict={X: trX, W: prv_w, vb: prv_vb, hb: prv_hb})
Out[12]:
To recall, the whole algorithm works as:
In [13]:
#Parameters
epochs = 5
batchsize = 100
weights = []
errors = []
for epoch in range(epochs):
for start, end in zip( range(0, len(trX), batchsize), range(batchsize, len(trX), batchsize)):
batch = trX[start:end]
cur_w = sess.run(update_w, feed_dict={ X: batch, W: prv_w, vb: prv_vb, hb: prv_hb})
cur_vb = sess.run(update_vb, feed_dict={ X: batch, W: prv_w, vb: prv_vb, hb: prv_hb})
cur_hb = sess.run(update_hb, feed_dict={ X: batch, W: prv_w, vb: prv_vb, hb: prv_hb})
prv_w = cur_w
prv_vb = cur_vb
prv_hb = cur_hb
if start % 10000 == 0:
errors.append(sess.run(err, feed_dict={X: trX, W: cur_w, vb: cur_vb, hb: cur_hb}))
weights.append(cur_w)
print('Epoch: %d' % epoch,'reconstruction error: %f' % errors[-1])
plt.plot(errors)
plt.xlabel("Batch Number")
plt.ylabel("Error")
plt.show()
What is the last weight after training?
In [14]:
uw = weights[-1].T
print(uw) # a weight matrix of shape (500,784)
We can take each hidden unit and visualize the connections between that hidden unit and each element in the input vector.
Let's plot the current weights: tile_raster_images helps in generating a easy to grasp image from a set of samples or weights. It transform the uw (with one flattened image per row of size 784), into an array (of size $25*20$) in which images are reshaped and layed out like tiles on a floor.
In [15]:
tile_imgs = tile_raster_images(X=cur_w.T, img_shape=(28, 28), tile_shape=(25, 20), tile_spacing=(1, 1))
### Plot image
plt.rcParams['figure.figsize'] = (18.0, 18.0)
imgplot = plt.imshow(tile_imgs)
imgplot.set_cmap('gray')
Each tile in the above visualization corresponds to a vector of connections between a hidden unit and visible layer's units.
Let's look at one of the learned weights corresponding to one of hidden units for example. In this particular square, the gray color represents weight = 0, and the whiter it is, the more positive the weights are (closer to 1). Conversely, the darker pixels are, the more negative the weights. The positive pixels will increase the probability of activation in hidden units (after muliplying by input/visible pixels), and negative pixels will decrease the probability of a unit hidden to be 1 (activated). So, why is this important? So we can see that this specific square (hidden unit) can detect a feature (e.g. a "/" shape) and if it exists in the input.
In [16]:
from skimage import io
img = tile_raster_images(X=cur_w.T[10:11], img_shape=(28, 28),tile_shape=(1, 1), tile_spacing=(1, 1))
### Plot image
plt.rcParams['figure.figsize'] = (4.0, 4.0)
imgplot = io.imshow(img)
imgplot.set_cmap('gray')
Let's look at the reconstruction of an image.
First we plot one of images:
In [17]:
sample_case = trX[1:2]
img = tile_raster_images(X=sample_case, img_shape=(28, 28),tile_shape=(1, 1), tile_spacing=(1, 1))
plt.rcParams['figure.figsize'] = (2.0, 2.0)
imgplot = io.imshow(img)
imgplot.set_cmap('gray') #you can experiment different colormaps (Greys,winter,autumn)
Now let's pass this image through the net:
In [18]:
hh0 = tf.nn.sigmoid(tf.matmul(X, W) + hb)
vv1 = tf.nn.sigmoid(tf.matmul(hh0, tf.transpose(W)) + vb)
feed = sess.run(hh0, feed_dict={ X: sample_case, W: prv_w, hb: prv_hb})
rec = sess.run(vv1, feed_dict={ hh0: feed, W: prv_w, vb: prv_vb})
Here we plot the reconstructed image:
In [19]:
img = tile_raster_images(X=rec, img_shape=(28, 28),tile_shape=(1, 1), tile_spacing=(1, 1))
plt.rcParams['figure.figsize'] = (2.0, 2.0)
imgplot = io.imshow(img)
imgplot.set_cmap('gray')
Created by: Saeed Aghabozorgi, Gabriel Garcez Barros Souza