In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'png')
plt.rcParams['savefig.dpi'] = 75
plt.rcParams['figure.autolayout'] = False
plt.rcParams['figure.figsize'] = 10, 6
plt.rcParams['axes.labelsize'] = 18
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['font.size'] = 16
plt.rcParams['lines.linewidth'] = 2.0
plt.rcParams['lines.markersize'] = 8
plt.rcParams['legend.fontsize'] = 14
plt.rcParams['text.usetex'] = True
plt.rcParams['font.family'] = "serif"
plt.rcParams['font.serif'] = "cm"
Charles K. Fisher; drckf@unlearnai.com
Jonathan R. Walsh; drjrw@unlearnai.com
Austin Huang; austinh@alum.mit.edu
Aaron M. Smith; drams@unlearnai.com
Restricted Boltzmann Machines (RBMs) are a class of neural network that can be used to approximate arbitrary probability distributions. Unlike feed-forward neural networks, RBMs are stochastic, which makes them perform well as trainable generative models. Training an RBM through stochastic gradient descent is difficult, however, because it is necessary to draw a sample from the model distribution for each gradient evaluation.
This notebook describes an approach to sampling from an RBM by coupling the distribution to a slowly varying auxiliary variable. The non-equilibrium sampler allows the system to move around energy barriers, which results in better mixing. This simple modification leads to large improvements in training -- especially for RBMs with Gaussian visible units -- with minimal computational cost compared to standard samplingmethods.
This notebook is an experiment in open science. It will provide all of the background information, data sets, and code required for a traditional scientific publication -- but in an open-source format that anybody can access (or even contribute to).
This project is connected to paysage: a Python library for machine learning with Boltzmann machines. The non-equilibrium sampler discussed in this notebook is implemented in paysage as the class DrivenSequentialMC
.
Recent progress in machine learning and artificial intelligence has been driven by a renaissance in research on artificial neural networks. Deep learning methods have been especially powerful tools for large-scale supervised learning problems. In fact, one could go so far as to claim that there is now a well-established recipe for success on supervised learning problems: step 1 -- collect a very large dataset with reliable labels (for example, a billion images labeled as "cat" or "not cat"), step 2 -- apply some noise or nuissance transformations to the data that you want your model to be insensitive to, step 3 -- use stochastic gradient descent to train a very large multi-layer neural network on these data. Unfortunately, there are many problems that cannot be solved using this three step recipe, usually because of a failure at step 1. If there is little or no data with reliable labels then we have to throw away the cookbook and try something else.
Unsupervised learning is the task of learning from data without labels. In the unsupervised context, "learning" refers to developing an understanding of the process that generates the data, or to constructing a simpler (e.g., sparse or low-dimensional) representation that describes the data manifold. In the latter case, the resulting representations can be used as features in a supervised model. This work, however, will focus primarily on the former: how to train a model that is able to capture the important aspects of the data generating process so that it is possible to learn to simulate a system.
Constructing a generative model for a stochastic system requires a general mathematical model for probability distributions that can be trained from data. Systems that are at equilibrium can be modeled using methods inspired by statistical physics. Physical systems reach an equilibrium described by a Boltzmann distribution $P(\boldsymbol{x}) = Z^{-1} e^{-\beta H(\boldsymbol{x})}$, where $H(\boldsymbol{x})$ is the potential energy of the system, $Z = \text{Tr}[e^{-\beta H(\boldsymbol{x})}]$ and $\beta = 1 / T$ is the inverse temperature. Within this context, learning a probability distribution amounts to identifying the appropriate energy function that assigns a low energy to configurations that are highly probable and a high energy to configurations that are lowly probable. One can visualize this this as an "energy landscape" with many hills and valleys, where the system spends most of the time in the valleys and occasionally climbs a mountain to reach another valley on the other side.
Boltzmann Machines are a general class of stochastic, undirected neural networks designed for learning probabilistic generative models. In this work, we will focus on Restricted Boltzmann Machines (or RBMs). An RBM is organized into two layers: a visible layer ($v$) and a hidden layer ($h$). There are interactions between the units in the visible layer and the units in the hidden layer, but there are no intra-layer interactions. This bipartite structure is useful for generating random samples from an RBM with Markov Chain Monte Carlo methods by Gibbs sampling.
Fitting an RBM requires maximizing the likelihood of observing the data under the model. In principle, the likelihood can be maximized using stochastic gradient descent. In practice, however, computing the gradient requires calculating averages with respect to the model distribution. The averages cannot be computed analytically, which means that it is necessary to draw random samples from the distribution using Monte Carlo methods. This process can take quite a long time, especially if the energy landscape is very rugged. We describe a method that "moves around" energy barriers in order to speed up sampling from RBMs, leading to improved generative models.
An RBM is a probabilistic generative model defined through an energy function. Let $v_i$ for $i = 1, \ldots, N$ denote the units of the visible layer and $h_{\mu}$ for $\mu = 1, \ldots, M$ denote the units of the hidden layer. The most general form for the energy function of an RBM is:
\begin{equation} H(\boldsymbol{v}, \boldsymbol{h}) = -\sum_i f_i(v_i) - \sum_{\mu} f_{\mu}(h_{\mu}) - \sum_{i \mu} W_{i \mu} g_i(v_i) g_{\mu}(h_{\mu}) \end{equation}Here, $f_i(\cdot)$ and $g_i(\cdot)$ are functions defined for each visible unit, and $f_{\mu}(\cdot)$ and $g_{\mu}(\cdot)$ are functions defined for each hidden unit. The joint probability distribution of the visible and hidden units is obtained by analogy with Boltzmann's distribution is physics:
\begin{equation} p_{RBM}(\boldsymbol{v}, \boldsymbol{h}) = Z^{-1} e^{-H(\boldsymbol{v}, \boldsymbol{h}) } \end{equation}where $Z = \int d \boldsymbol{v} d \boldsymbol{h} e^{-H(\boldsymbol{v}, \boldsymbol{h}) }$ is the normalizing constant of the distribution (also called the partition function).
The hidden units only act to mediate interations between the visible units to account for higher order correlations. The marginal distribution of the visible units can be obtained by explicitly integrating over the hidden units:
\begin{equation} p_{RBM}(\boldsymbol{v}) = Z^{-1} \int d \boldsymbol{h} e^{-H(\boldsymbol{v}, \boldsymbol{h}) } \end{equation}We can define a marginal free energy for the visible units (up to an additive constant) by taking the negative logarithm of $p_{RBM}(\boldsymbol{v})$:
\begin{equation} F_{RBM}(\boldsymbol{v}) = -\sum_i f_i(v_i) - \sum_{\mu} \log \int d h \exp \left(f_{\mu}(h) + \sum_{i} W_{i \mu} g_i(v_i) g_{\mu}(h) \right) \end{equation}Training an RBM involves optimizing the parameters of the model so that the marginal distribution of the visible units is approximately equal to the data distribution: \begin{equation} p_{data}(\boldsymbol{v})\approx p_{RBM}(\boldsymbol{v}) = Z^{-1} \int d \boldsymbol{h} e^{-H(\boldsymbol{v}, \boldsymbol{h}) } \end{equation}
We formulate the training problem as an optimization problem aimed at minimizing a measure of difference between $p_{data}(\boldsymbol{v})$ and $p_{RBM}(\boldsymbol{v})$ called the Kullback-Leibler divergence:
\begin{equation} D_{KL}(p_{data}(\boldsymbol{v}) || p_{RBM}(\boldsymbol{v})) = -E_{data}[\log p_{RBM}(\boldsymbol{v}) ] + \text{constant} \end{equation}Bernoulli-Bernoulli Restricted Boltzmann Machine: \begin{equation} H(\boldsymbol{v}, \boldsymbol{h}) = -\sum_i a_i v_i - \sum_{\mu} b_{\mu} h_{\mu} - \sum_{i \mu} W_{i \mu} v_i h_{\mu} \end{equation}
Bernoulli-Gaussian Restricted Boltzmann Machine (Hopfield Model): \begin{equation} H(\boldsymbol{v}, \boldsymbol{h}) = -\sum_i a_i v_i + \frac{1}{2} \sum_{\mu} h_{\mu}^2- \sum_{i \mu} W_{i \mu} v_i h_{\mu} \end{equation}
Gaussian-Bernoulli Restricted Boltzmann Mchine: \begin{equation} H(\boldsymbol{v}, \boldsymbol{h}) = \sum_i \frac{(v_i - \bar{v}_i)^2}{2 \sigmai^2} - \sum{\mu} b){\mu} h_{\mu}
In [2]:
# Examples
import paysage
In [ ]: