Here you can find a Conditional Restricted Boltzmann Machines, implemented in R and in plain algorithm, for academic and educational purposes.
@authors Josep Ll. Berral, David Buchaca (Barcelona Supercomputing Center)
@date 1st December, 2016
We use these functions for initializing matrices (we will initialize the weights matrix with random values), to do Bernoulli sampling when deciding the activation of the hidden layers, and to pass values through a sigmoid function (also when producing the activation outputs).
In [1]:
## Function to produce Normal Samples
sample_normal <- function(dims, mean = 0, sd = 1)
{
array(rnorm(n = prod(dims), mean = mean, sd = sd), dims);
}
## Function to produce Bernoulli Samples
sample_bernoulli <- function(mat)
{
dims <- dim(mat);
array(rbinom(n = prod(dims), size = 1, prob = c(mat)), dims);
}
## Function to produce the Sigmoid
sigmoid_func <- function(mat)
{
1 / (1 + exp(-mat));
}
## Operator to add dimension-wise vectors to matrices
`%+%` <- function(mat, vec)
{
retval <- NULL;
tryCatch(
expr = { retval <- if (dim(mat)[1] == length(vec)) t(t(mat) + vec) else mat + vec; },
warning = function(w) { print(paste("WARNING: ", w, sep = "")); },
error = function(e) { print(paste("ERROR: Cannot sum mat and vec", e, sep = "\n")); }
);
retval;
}
Functions for creating and operating CRBMs:
This is the constructor of the CRBM. This will initialize the CRBM with random and blank values, or with introduced parameters. If we want to create a new CRBM, we skip passing parameters to have a brand new initialized CRBM. If we want to reassemble a former CRBM from its weights matrices and bias vectors, we pass them as parameters.
The principal elements of the CRBM are:
The CRBM produces Hidden Unit activations from Visible Inputs through $(vis \cdot W + vhist \cdot B + hbias)$, and reconstructs the Visible Units from Hidden activations through $(hid \cdot W^{T} + vhist \cdot A + vbias)$. This way, Visible Inputs go through the "network" producing activations, and activations can be passed back through the "network" to reconstruct the Inputs. The difference between real Inputs and reconstructed Inputs are the error or loss.
The matrices W, B and A, and vectors hbias and vbias will be initialized as random or just blank, and modified using the Gradient Descent technique with to the obtained loss at each iteration.
Input comes from Visible Units (the input has nvis features at time t), and the previous history (the history has nvis features for t-1 to t-d, being d the history window size). So Input is a matrix $1 \times nvis$ and History should be a matrix $nvis \times d$. As we can put data in batches, the Input becomes $batch\_size \times nvis$ and History is $batch\_size \times nvis \times d$. In order to deal with 2 dimension matrices, we can flatten History into $batch\_size \times nvis \cdot d$, so each row has the history of the corresponding batch unit (a vector of $nvis \cdot d$ elements).
The Output (the hidden units) will be then a matrix with the number of hidden units for each element of the batch. So Activations will be a matrix $batch\_size \times nhid$.
Then, W transforms Input into Activations (partially), so W is a matrix shaped as $nvis \times nhid$. Also, B transforms History into Activations (partially), so B is a matrix shaped as $nvis \cdot d \times nhid$. Finally, A transforms History into Inputs (partially) for reconstruction, so A is a matrix shaped as $nvis \cdot d \times nvis$. Not to say that transposing W will allow us to transform Activations into Inputs (partially).
Summary of Matrices:
History : $batch\_size \times nvis \cdot d$
W : $nvis \times nhid$
A : $nvis \cdot d \times nvis$
Activations : $batch\_size \times nhid$
Further, when obtaining the Activations, we should add the bias for each feature in the obtained activation. Then hbias has to be a vector of size $nhid$. R allows us to add a vector repeatedly to all the rows of a matrix. If we couldn't, hbias should become a matrix of the same vector, repeated $batch\_size$ times.
The same happens when reconstructing the Input from Activations and History. A vbias vector, of size $nvis$ must be added to the resulting reconstruction.
In [2]:
## Conditional Restricted Boltzmann Machine (CRBM). Constructor
create_crbm <- function (n_visible = 49, n_hidden = 100, delay = 6, A = NULL, B = NULL, W = NULL,
hbias = NULL, vbias = NULL, batch_size = 100)
{
if (is.null(W)) W <- 0.01 * sample_normal(c(n_visible, n_hidden));
if (is.null(A)) A <- 0.01 * sample_normal(c(n_visible * delay, n_visible));
if (is.null(B)) B <- 0.01 * sample_normal(c(n_visible * delay, n_hidden));
if (is.null(hbias)) hbias <- rep(0, n_hidden);
if (is.null(vbias)) vbias <- rep(0, n_visible);
velocity <- list(W = array(0, dim(W)), A = array(0, dim(A)), B = array(0, dim(B)),
v = rep(0, length(vbias)), h = rep(0, length(hbias)));
list(n_visible = n_visible, n_hidden = n_hidden, delay = delay, W = W,
A = A, B = B, hbias = hbias, vbias = vbias, velocity = velocity,
batch_size = batch_size);
}
This is the free energy function. We will not use it mores than to display its value. It implements the following:
$$wxb \leftarrow (visible\_state \cdot W + visible\_history \cdot B) + hidden\_bias$$$$hidden\_term \leftarrow \sum_{j = 1}^{nrows} [log(1 + exp(wxb))]_j$$$$axb \leftarrow (visible\_history \cdot A) + visible\_bias$$$$visible\_term \leftarrow \sum_{j = 1}^{nrows} [\frac{1}{2} \cdot (visible\_state - axb)^2]_j$$$$free\_energy \leftarrow visible\_term - hidden\_term$$
In [3]:
## Function to compute the free energy of a sample conditional on the history
free_energy_crbm <- function(crbm, visible_state, v_history)
{
wx_b <- (visible_state %*% crbm$W + v_history %*% crbm$B) %+% crbm$hbias;
ax_b <- (v_history %*% crbm$A) %+% crbm$vbias;
visible_term <- rowSums(0.5 * `^`(visible_state - ax_b,2));
hidden_term <- rowSums(log(1 + exp(wx_b)));
(visible_term - hidden_term);
}
The next couple of functions implement the generation of hidden unit activations given visible values, and reconstruction of visible values from hidden activations. The learning process will use these functions to pass visible values forward and backward, to check the loss and gradients of the activation-reconstruction process.
As said before, activation (or forwarding) is done by passing inputs and history towards the hidden units. The transformation is done using Input and History, through W and B (and hbias), obtaining Activations.
$$pre\_activations = input \cdot W + history \cdot B + hbias$$In order to correctly produce the activations, we should pass them through a sigmoid function [TODO - Explain details and reasoning for this step].
$$mean\_activations = sigmoid(pre\_activations)$$And finally, instead of having for each unit its mean of activations, we may generate a sample of "fully activated" and "fully deactivated" hidden units, by performing a bernoully sampling. This way, each $mean\_activation$ is used as the probability of whether the hidden unit is activated or not. The result is a fully binarized activations, conditioned to the probability of each unit to become activated or not (the "expected activation" that we obtained through the "mean activation").
$$sample\_activations = bernoully(mean\_activations)$$When we have the activations for a given batch of input data, we can perform the reconstruction (backward) process, by passing the activations and history towards the input data. The transformation is done using Activations and History, through t(W) and A (and vbias), obtaining ~Input.
$$mean\_reconstruction = activation \cdot W^{T} + history \cdot A + vbias$$Also, we don't need any sampling here, as we are deciding that:
$$sample\_reconstruction = mean\_reconstruction$$
In [4]:
### This function infers state of hidden units given visible units
sample_h_given_v_crbm <- function(crbm, visible_state, v_history)
{
h.mean <- sigmoid_func((visible_state %*% crbm$W + v_history %*% crbm$B) %+% crbm$hbias);
h.sample <- sample_bernoulli(h.mean);
list(mean = h.mean, sample = h.sample);
}
## This function infers state of visible units given hidden units
sample_v_given_h_crbm <- function(crbm, hidden_state, v_history)
{
v.mean <- (hidden_state %*% t(crbm$W) + v_history %*% crbm$A) %+% crbm$vbias;
v.sample <- v.mean;
list(mean = v.mean, sample = v.sample);
}
The "forward - backward" process must be done iteratively, in order to see how our matrices W, B and A (also bias vectors hbias and vbias) degrade our input after repeated "activation - reconstruction" processes. This is done using the Contrastive Divergence method, where we perform this process $k$ times, and then we compute the Gradient Descent.
The process performs a positive phase, where a forward is performed. Then we iterate "backward - forward" (negative phase) $k$ times.
The key elements at the end of the process are:
Gradients are computed for each Matrix, as follows:
Remember that all deltas must be scaled by the learning rate, and the batch size. Also, we can decide whether to add an inertia factor (or momentum), by storing the gradient and using it in the next iteration within a certain weight. Finally, we add the gradients to our matrices and vectors.
Computing the reconstruction error allows us to know how well our training is improving epoch after epoch. Here we compute the cost as
$$cost = \frac{\sum_{i = 1}^{ncols} \sum_{j = 1}^{nrows} (input - (negative)mean\_reconstruction)^2}{ncols}$$or what is the same, the mean error between the inputs and the last reconstruction.
In [5]:
## This functions implements one step of CD-k
## param input: matrix input from batch data (n_seq x n_vis)
## param input_hist: matrix input_history from batch data (n_seq x (n_vis * delay))
## param lr: learning rate used to train the RBM
## param k: number of Gibbs steps to do in CD-k
## param momentum: value for momentum coefficient on learning
## We assume sigma = 1 when computing deltas
get_cost_updates_crbm <- function(crbm, input, input_history, lr, k = 1, momentum = 0.1)
{
# compute positive phase (awake)
ph <- sample_h_given_v_crbm(crbm, input, input_history);
# perform negative phase (asleep)
nh <- ph;
for (i in 1:k)
{
nv <- sample_v_given_h_crbm(crbm, nh[["sample"]], input_history);
nh <- sample_h_given_v_crbm(crbm, nv[["sample"]], input_history);
}
# cost <- mean(free_energy_crbm(crbm, input, input_history)) -
# mean(free_energy_crbm(crbm, nv[["sample"]], input_history));
# determine gradients on CRBM parameters
Delta_W <- t(input) %*% ph[["mean"]] - t(nv[["sample"]]) %*% nh[["mean"]];
Delta_v <- colSums(input - nv[["sample"]]);
Delta_h <- colSums(ph[["mean"]] - nh[["mean"]]);
Delta_A <- t(input_history) %*% input - t(input_history) %*% nv[["sample"]];
Delta_B <- t(input_history) %*% ph[["mean"]] - t(input_history) %*% nh[["mean"]];
crbm$velocity[["W"]] <- crbm$velocity[["W"]] * momentum + lr * Delta_W / crbm$batch_size;
crbm$velocity[["A"]] <- crbm$velocity[["A"]] * momentum + lr * Delta_A / crbm$batch_size;
crbm$velocity[["B"]] <- crbm$velocity[["B"]] * momentum + lr * Delta_B / crbm$batch_size;
crbm$velocity[["v"]] <- crbm$velocity[["v"]] * momentum + lr * Delta_v / crbm$batch_size;
crbm$velocity[["h"]] <- crbm$velocity[["h"]] * momentum + lr * Delta_h / crbm$batch_size;
# update weights
crbm$W <- crbm$W + crbm$velocity[["W"]];
crbm$A <- crbm$A + crbm$velocity[["A"]];
crbm$B <- crbm$B + crbm$velocity[["B"]];
crbm$vbias <- crbm$vbias + crbm$velocity[["v"]];
crbm$hbias <- crbm$hbias + crbm$velocity[["h"]];
# approximation to the reconstruction error: sum over dimensions, mean over cases
list(crbm = crbm, recon = mean(rowSums(`^`(input - nv[["mean"]],2))));
}
Functions to train a CRBM from a loaded DataSet:
This process implies getting batches of data, and passing them through the Contrastive Divergence process, seen before, until we reach a given number of epochs.
Notice that our input dataset can contain sequences, so we should take care that, when we construct the history of a given sample, we are not mixing samples belonging to a previous sequence.
Also, we may decide that providing the data in the same order as we got (probably ordered in time for each sequence), may be dangerous for learning. We can shuffle the order in which we pick samples when building the batches (not shuffling all data, or we would lose the history information).
The return of the training is the CRBM, or what is the same, the collection of weight matrices and vectors W, B, A, hbias and vbias.
In [6]:
## Function to train the CRBM
## param learning_rate: learning rate used for training the CRBM
## param training_epochs: number of epochs used for training
## param dataset: loaded dataset <batchdata, seqlen, data_mean, data_std> for Motion
## param batch_size: size of a batch used to train the CRBM
train_crbm <- function (dataset, learning_rate = 1e-3, momentum = 0.5, training_epochs = 300,
batch_size = 100, n_hidden = 100, delay = 6, rand_seed = 1234,
init_A = NULL, init_B = NULL, init_W = NULL, init_hbias = NULL, init_vbias = NULL)
{
set.seed(rand_seed);
# prepare indexes for dataset
batchdata <- dataset$batchdata;
seqlen <- dataset$seqlen;
# compute number of minibatches for training, validation and testing
n_train_batches <- ceiling(nrow(batchdata) / batch_size);
n_dim <- ncol(batchdata);
# valid starting indices
batchdataindex <- NULL;
last <- 1;
for (s in seqlen)
{
batchdataindex <- c(batchdataindex, (last + delay):(last + s - 1));
last <- last + s;
}
permindex <- batchdataindex[sample(1:length(batchdataindex),length(batchdataindex))];
# construct the CRBM object
crbm <- create_crbm(n_visible = n_dim, n_hidden = n_hidden, delay = delay, batch_size = batch_size,
A = init_A, B = init_B, W = init_W, hbias = init_hbias, vbias = init_vbias);
start_time <- Sys.time();
# go through the training epochs
for (epoch in 1:training_epochs)
{
# go through the training set
mean_cost <- NULL;
for (batch_index in 1:n_train_batches)
{
# linear index to the starting frames for this batch
idx.aux.ini <- (((batch_index - 1) * batch_size) + 1);
idx.aux.fin <- (batch_index * batch_size);
if (idx.aux.fin > length(permindex)) break;
data_idx <- permindex[idx.aux.ini:idx.aux.fin];
# linear index to the frames at each delay tap
hist_idx <- c(t(sapply(1:delay, function(x) data_idx - x)));
# update the CRBM parameters
input <- batchdata[data_idx,];
input_history <- t(array(c(t(batchdata[hist_idx,])), c(delay * n_dim, batch_size)));
# get the cost and the gradient corresponding to one step of CD-k
aux <- get_cost_updates_crbm(crbm, input, input_history, lr = learning_rate, momentum = momentum, k = 1);
this_cost <- aux$recon;
crbm <- aux$crbm;
mean_cost <- c(mean_cost, this_cost);
}
if (epoch %% 50 == 1) print(paste('Training epoch ',epoch,', cost is ',mean(mean_cost, na.rm = TRUE),sep=""));
}
end_time <- Sys.time();
print(paste('Training took', (end_time - start_time), sep = " "));
class(crbm) <- c("crbm", class(crbm));
crbm;
}
Functions to predict a sequence from a CRBM:
When we predict values, we just apply the "input to activation" process, by performing a "forward" operation
$$pre\_activations = input \cdot W + history \cdot B + hbias$$$$mean\_activations = sigmoid(pre\_activations)$$$$sample\_activations = bernoully(mean\_activations)$$Also here, we can decide to perform a Gibbs sampling, thats iterating the "forward - backward" process, to see which activations result after $k$ iterations (like we did in the Contrastive Divergence method).
To predict several steps away from our current data, we just have to predict the activations (and reconstructions) at time $t + 1$, then add sample $t$ into history and sample $t+1$ as "new input". We would expect that in long term, predicting would degrade, but experiments have shown that trends on time series are maintained well enough.
In [7]:
## Given initialization(s) of visibles and matching history, generate n_samples in future.
## orig_data : n_seq by n_visibles array, initialization for first frame
## orig_history : n_seq by delay * n_visibles array, delay-step history
## n_samples : int, number of samples to generate forward
## n_gibbs : int, number of alternating Gibbs steps per iteration
forecast_crbm <- function(crbm, orig_data, orig_history, n_samples, n_gibbs = 30)
{
n_seq <- nrow(orig_data);
persistent_vis_chain <<- orig_data;
persistent_history <<- orig_history;
# construct the function that implements our persistent chain.
sample_fn <- function(crbm, n_gibbs)
{
vis_sample <- persistent_vis_chain;
v_history <- persistent_history;
vis_mf <- NULL;
for (k in 1:n_gibbs)
{
hid <- sample_h_given_v_crbm(crbm, vis_sample, v_history);
vis <- sample_v_given_h_crbm(crbm, hid[["sample"]], v_history);
vis_mf <- vis[["mean"]];
vis_sample <- vis[["sample"]];
}
# add to updates the shared variable that takes care of our persistent chain
persistent_vis_chain <<- vis_sample;
persistent_history <<- cbind(vis_sample, persistent_history[,1:((crbm$delay - 1) * crbm$n_visible), drop = FALSE]);
vis_mf;
}
generated_series <- array(0,c(n_seq, n_samples, crbm$n_visible));
for (t in 1:n_samples)
{
#if (t %% 10 == 1) print(paste("Generating frame ", t, " to ", min(t+9, n_samples), sep = ""));
generated_series[,t,] <- sample_fn(crbm, n_gibbs);
}
generated_series;
}
We are testing now the CRBM with a fragment of the Motion Dataset, from Eugene Hsu Styles of Human Motion; a dataset capturing human motion from different body sensors (108 features), from different styles of walking. As we are not using the full dataset, we are using the fragment that Graham Taylor used in his works when validating CRBMs here.
As the source of data is a MatLab file, we converted it previously into a RDS one, so it can be directly read.
Functions to load data from a file name (for Motion example):
This function will read motion.rds and return a list with:
In [8]:
## Function to Load and Preprocess the Motion Example Data.
## :param filename: The URI of the motion.rds file with the matrix structure from Motion
## :return list: <batchdata, seqlen, data_mean, data_std>
## Source: TODO - motion.rds (version of motion.mat converted into R Data Storage format)
load_data <- function(filename)
{
mat_dict <- readRDS(filename);
Motion <- mat_dict[['Motion']];
n_seq <- length(Motion);
# assume data is MIT format for now
indx <- c(1:9, 14, 19:21, 26, 31:33, 38, 43:45, 50, 55:57, 61:63, 67:69, 73:75,
79:81, 85:87, 91:93, 97:99, 103:105);
##row1 <- Motion[[c(1,1)]][1,];
##offsets <- array(row1[c(10:12, 16:18, 22:24, 28:30, 34:36, 40:42, 46:48, 52:54,
## 58:60, 64:66, 70:72, 76:78, 82:84, 88:90, 94:96, 100:102,
## 106:108)], c(3, length(row1)/3) );
# collapse sequences
batchdata <- rbind(Motion[[c(1,1)]][,indx],
Motion[[c(2,1)]][,indx],
Motion[[c(3,1)]][,indx]);
data_mean <- colMeans(batchdata);
data_std <- apply(batchdata, 2, sd);
batchdata <- t((t(batchdata) - data_mean) / data_std);
# get sequence lengths
seqlen <- sapply(1:3, function(x) nrow(Motion[[c(x,1)]]));
list(batchdata = batchdata, seqlen = seqlen, data_mean = data_mean, data_std = data_std);
}
In [9]:
# Load data and create the CRBM
dataset <- load_data('../datasets/motion.rds'); # List <batchdata, seqlen, data_mean, data_std>
crbm <- train_crbm(dataset); # Trained CRBM
Once trained, we can pass some data sequences, to test how they are forecasted.
In [10]:
# Generate some sequences (in parallel) from CRBM
# Using training data as initialization
batchdata <- dataset$batchdata; # DIMS - Motion.mat (3826, 49)
# pick some starting points for each sequence
data_idx <- c(100, 200, 400, 600);
orig_data <- batchdata[data_idx,]; # DIMS - Motion.mat (3826, data_idx.length)
hist_idx <- c(sapply(data_idx, function(x) x - 1:crbm$delay));
orig_history <- t(array(as.vector(t(batchdata[hist_idx,])), c(crbm$delay * crbm$n_visible, length(data_idx))));
In [11]:
generated_series.aux <- forecast_crbm(crbm, orig_data, orig_history, n_samples = 100, n_gibbs = 30);
# append initialization
library(abind)
oh.temp <- aperm(array(as.vector(orig_history), c(length(data_idx), crbm$n_visible, crbm$delay)),c(1,3,2));
generated_series <- abind(oh.temp[,crbm$delay:1,], generated_series.aux, along = 2);
Also we plot them (or some of their variables)
In [12]:
#plot first dimension of each sequence
par(mfrow = c(length(data_idx) ,1));
for (i in 1:length(data_idx))
{
start <- data_idx[i];
plot.true <- batchdata[(start - crbm$delay):(start + 100), 2];
plot.pred <- generated_series[i,, 2];
plot(plot.true, col = "blue", type = "l", lty = 2, xlab = "", ylab = data_idx[i], xlim = c(-10,100), ylim = c(min(plot.true, plot.pred),max(plot.true, plot.pred)));
lines(plot.pred, col = "green");
legend("topleft", legend = c("True", "Predicted"), col = c("blue","green"), lty = c(2,1), cex = 0.75, y.intersp = 1);
}