To learn more about using bayesian methods, to train deep learning method please refer to this post by Thomas Wiecki http://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/. This notebook follows a similar structure.
Author: shkr
Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small mini-batch of data-items for every sample we generate?
Stochastic Gradient Fisher scoring is an algorithm which at high mixing rates (or epsilon valuees) samples from a normal approximation of the
posterior, while for slow mixing rates it will mimic the behaviour of SGLD with a preconditioner matrix that is as parameter B
.
The mixing rate is controlled by epsilon which is $ \epsilon = 2^{-\frac{t}{step\_size\_decay}}*step\_size $
Here we run the sampling algorithm with a minibatch of the training samples of X_train.
The total_size
is X_train.shape[0]
. If the preconditioner matrix is not specified, an identity matrix is used.
Reference: https://www.slideshare.net/hustwj/austerity-in-mcmc-land-cutting-the-computational-budget
In [1]:
%matplotlib inline
import theano
floatX = theano.config.floatX
import pymc3 as pm
import theano.tensor as T
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons
In [2]:
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X = X.astype(floatX)
Y = Y.astype(floatX)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)
In [3]:
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');
Store training data in Theano shared variables
In [4]:
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)
A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.
In [5]:
total_size = len(Y_train)
n_hidden = 5
# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden).astype(floatX)
init_2 = np.random.randn(n_hidden, n_hidden).astype(floatX)
init_out = np.random.randn(n_hidden).astype(floatX)
def build_network(ann_input, ann_output):
with pm.Model() as model:
# Weights from input to hidden layer
weights_in_1 = pm.Normal('w_in_1', 0, sd=1,
shape=(X.shape[1], n_hidden),
testval=init_1)
# Weights from 1st to 2nd layer
weights_1_2 = pm.Normal('w_1_2', 0, sd=1,
shape=(n_hidden, n_hidden),
testval=init_2)
# Weights from hidden layer to output
weights_2_out = pm.Normal('w_2_out', 0, sd=1,
shape=(n_hidden,),
testval=init_out)
# Build neural-network using tanh activation function
act_1 = pm.math.tanh(pm.math.dot(ann_input, weights_in_1))
act_2 = pm.math.tanh(pm.math.dot(act_1, weights_1_2))
act_out = pm.math.sigmoid(pm.math.dot(act_2, weights_2_out))
# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli('out',
act_out,
observed=ann_output,
total_size=total_size)
return model, out
In [6]:
from six.moves import zip
def train_sgfs(model, B=None):
# Tensors and RV that will be using mini-batches
minibatch_tensors = [ann_input, ann_output]
minibatch_RVs = [out]
# Generator that returns mini-batches in each iteration
def create_minibatch(data):
rng = np.random.RandomState(0)
while True:
# Return random data samples of set size 50 each iteration
ixs = rng.randint(len(data), size=50)
yield data[ixs]
minibatches = zip(
create_minibatch(X_train),
create_minibatch(Y_train),
)
total_size = len(Y_train)
with model:
draws = 4500
step_method = pm.SGFS(batch_size=50,
total_size=total_size,
B=B,
step_size=.1,
step_size_decay=100,
minibatches=minibatches,
minibatch_tensors=minibatch_tensors)
trace = pm.sample(draws=draws, step=step_method)
return step_method, trace
neural_network, out = build_network(ann_input, ann_output)
%time step_method, trace = train_sgfs(neural_network)
A smaller wall time than mini-batch ADVI
In [7]:
pm.traceplot(trace);
In [8]:
# Replace shared variables with testing set
ann_input.set_value(X_test)
ann_output.set_value(Y_test)
# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500, random_seed=0)
# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5
In [9]:
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
sns.despine()
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');
In [10]:
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))
In [11]:
grid = np.mgrid[-3:3:100j,-3:3:100j].astype(floatX)
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)
In [12]:
ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)
# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500, random_seed=0)
In [13]:
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(grid[0, :], grid[1, :], ppc['out'].mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
ax.set_title('$B \propto I_t $')
cbar.ax.set_ylabel('Posterior predictive mean probability of class label = 0');
So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:
In [14]:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(grid[0, :], grid[1, :], ppc['out'].std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y')
ax.set_title('$B \propto I_t $')
cbar.ax.set_ylabel('Uncertainty (posterior predictive standard deviation)');
This section was added with the introduction of the first approximate MCMC sampling algorithm in pymc3, Stochastic Gradient Fisher Scoring.
The file sgmcmc.py defines a BaseStochasticGradient class that can be extended to other approximate algorithms by only implementing _initialize_values
and mk_training_fn
methods