TensorFlow and Edward

Packages

References

TensorFlow

A Python/C++/Go framework for compiling and executing mathematical expressions
First released by Google in 2015
Based on Data Flow Graphs
Widely used to implement Deep Neural Networks (DNN)
Edward uses TensorFlow to implement a Probabilistic Programming Language (PPL)
Can distribute computation to multiple computers, each of which potentially has multiple CPU, GPU or TPU devices.

Data flow graph

Operations
Tensors
Constants, variables, placeholders
Sessions

Execution model

Agents

Client
Master
Workers
Devices

Placement algorithm

Kernel on device?
Size of input and output tensors
Expected execution time
Heuristic for cross-device transmission time

Optimization 1: Common subgraph elimination

Before	After

Optimization 2: As late as possible (ALAP) scheduling

ASAP	ALAP

Lossy compression for cross-device transmission

Automatic differentiation

Symbol-to-symbol calculation of gradient
Used for back-propagation in neural networks
Used for gradient based optimization, HMC etc in Edward

Other features

Control flow (if and while) - enable recursion and cycles
Checkpoints
- save
- restore
TensorBoard visualization
- Graphs
- Scalar summaries (e.g. evaluation metrics)
- Histogram summaries (e.g. weight distribution)

Abstraction layers

Deep Neural Networks
- contrib.learn
- tflearn
- tf-slim
- keras
Probabilistic Programming Language
- edward

TensorFlow Examples

Hello world



In [1]:

    
import tensorflow as tf

h = tf.constant('Hello')
w = tf.constant(' world!')
hw = h + w

with tf.Session() as s:
    ans = s.run(hw)



In [2]:

    
hw









    Out[2]:





<tf.Tensor 'add:0' shape=() dtype=string>



In [3]:

    
ans









    Out[3]:





b'Hello world!'

Arithmetic on data flow graphs



In [4]:

    
a = tf.constant(5)
b = tf.constant(2)
c = tf.constant(3)
d = tf.multiply(a, b)
e = tf.add(b, c)
f = tf.subtract(d, e)

with tf.Session() as s:
    fetches = [a, b, c, d, e, f]
    ans = s.run(fetches)
    
ans









    Out[4]:





[5, 2, 3, 10, 5, 5]

Using operators



In [5]:

    
a = tf.constant(5)
b = tf.constant(2)
c = tf.constant(3)
d = a * b
e = b + c
f = d - e

with tf.Session() as s:
    fetrches = [a,b,c,d,e,f]
    ans = s.run(fetches)
    
ans









    Out[5]:





[5, 2, 3, 10, 5, 5]

Placeholders



In [6]:

    
import numpy as np



In [7]:

    
x_data = np.random.randn(5, 10)
w_data = np.random.randn(10, 1)

x = tf.placeholder('float32', (5, 10))
w = tf.placeholder('float32', (10, 1))
b = tf.fill((5,1), -1.0)

xwb = tf.matmul(x, w) + b
v = tf.reduce_max(xwb)

with tf.Session() as s:
    ans = s.run(v, feed_dict={x: x_data, w: w_data})
    
ans









    Out[7]:





9.055994

Linear regreession



In [8]:

    
n, p = 1000, 3

α = -1.0
β = np.reshape([0.5, 0.2, 1.0], (3,1))
x_data = np.random.randn(n, p)
y = α + x_data @ β + np.random.randn(n, 1)



In [ ]:

    
x = tf.placeholder('float32', [None, p])
y_true = tf.placeholder('float32', [None, 1])

a = tf.Variable(0.0, dtype='float32')
b = tf.Variable(np.zeros((3,1), dtype='float32'))

y_pred = a + tf.matmul(x, b)

ϵ = 0.5
loss = tf.reduce_mean(tf.square(y_true - y_pred))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=ϵ)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()



In [9]:

    
steps = 5
with tf.Session() as session:
    session.run(init)
    
    for i in range(1, steps):
        
        session.run(train, feed_dict={y_true: y, x: x_data})
        
        if i% 1 == 0:
            a_, b_ = session.run([a, b])
            print('iter={}'.format(i))
            print('a = {}'.format(a_))
            print('b = {}'.format(b_.ravel()))
            print()









    



iter=1
a = -0.9617397785186768
b = [ 0.58339775  0.21545935  0.97747314]

iter=2
a = -0.9741984009742737
b = [ 0.57611543  0.22001249  1.00742912]

iter=3
a = -0.9753177762031555
b = [ 0.57605499  0.2206755   1.00737011]

iter=4
a = -0.9753146171569824
b = [ 0.57600915  0.22066851  1.00741804]

MNIST digits classificaiton (canonical toy example)

Collection of $28 \times 28$ pixel images of hand-written digits. Objective is to classify image into one of ten possile classes. State of the art DNN methods can achieve accuracy of approxmately 99.8% accuracy.

Download the data using input_data from tutorials.mnist
Declare x, W, y_true and y_pred
Define loss function
Define minimization algorithm
Define evaluation metrics
Start a session to
1. Run loop for minimization of batches
2. Run evaluation metric on test data



In [ ]:

    
from tensorflow.examples.tutorials.mnist import input_data

n, p = 784, 10
steps = 1000
batch_size = 100
alpha = 0.5

data_dir = '/tmp/data'

data = input_data.read_data_sets(data_dir, one_hot=True)



In [ ]:

    
x = tf.placeholder(tf.float32, [None, n])
W = tf.Variable(tf.zeros([n, p]))

y_true = tf.placeholder(tf.float32, [None, 10])
y_pred = tf.matmul(x, W)

loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(logits=y_pred, labels=y_true))

gd = tf.train.GradientDescentOptimizer(alpha).minimize(loss)

correct_mask = tf.equal(tf.arg_max(y_pred, 1), tf.arg_max(y_true, 1))
accuracy = tf.reduce_mean(tf.cast(correct_mask, tf.float32))



In [10]:

    
with tf.Session() as s:
    s.run(tf.global_variables_initializer())

    # train
    for i in range(steps):
        batch_xs, batch_ys = data.train.next_batch(batch_size)
        s.run(gd, feed_dict={x: batch_xs, y_true: batch_ys})
        
    # test
    ans = s.run(accuracy, feed_dict={x: data.test.images, y_true: data.test.labels})



In [12]:

    
ans









    Out[12]:





0.91790003

Using the `tflearn` abstraction layer

This just implements logistic regresion. Note that we can get much better perofrmance using DNN, but that is not covered here.



In [13]:

    
! pip install --quiet tflearn



In [14]:

    
import tflearn
import tflearn.datasets.mnist as mnist

X, Y, validX, validY = mnist.load_data(one_hot=True)

# Building our neural network
input_layer = tflearn.input_data(shape=[None, 784])
output_layer = tflearn.fully_connected(input_layer, 10, activation='softmax')

# Optimization
sgd = tflearn.SGD(learning_rate=0.5)
net = tflearn.regression(output_layer, optimizer=sgd)

# Training
model = tflearn.DNN(net, tensorboard_verbose=3)
model.fit(X, Y, validation_set=(validX, validY), n_epoch=3)









    



Training Step: 2579  | total loss: 0.25304 | time: 28.264s
| SGD | epoch: 003 | loss: 0.25304 -- iter: 54976/55000
Training Step: 2580  | total loss: 0.24640 | time: 29.406s
| SGD | epoch: 003 | loss: 0.24640 | val_loss: 0.28411 -- iter: 55000/55000
--



In [15]:

    
model.evaluate(validX, validY)









    Out[15]:





[0.92030000000000001]

Edward

Overview

Named after George Edward Pelham Box

Data

numpy arrays or tensorflow tensors
tnesorflow placeholders
tensorflow data readers

Model

A model is a joint distribution $p(x, z)$ of data $x$ and latent variables $z$
A random variable has a distribution parametrized by a parameter tensor $\theta^*$
Each random variable is associated to a tenor $$ x^* \sim p(x \mid \theta^*) $$
Random variables can be combined with other TensorFlow operations

Models are built by composing random variables

Beta-Bernoulli Model

Types of models

Directed graphical models
Neural networks
Bayesian non-parametric models
Probabilistic programs (stochastic control flow with contingent dependencies)

Inference

Posterior inference $$ q(z, \beta; \lambda) \approx p(z, \beta | x) $$
Parameter estimation $$ \text{optimize} \; \hat{\theta} \leftarrow p(x; \theta) $$
Conditional inference $$ q(\beta)q(z) \approx p(z, \beta \mid x) $$

Methods for inference

Variational inference
- MAP is a special case with point mass RVs
Monte Carlo
Composition of inference
- Hybrid algorithms (e.g. EM variants)
- Message passing algorithms (e.g. expectation propagation)

Criticize

Point-based evaluations

Evaluation metrics
- Classification error
- Mean absolute error
- Log-likelihood

Posterior predictive checks (PPC)

Posterior predictive distribution $$ p(x_\text{new} \mid x) = \int{p(x_\text{new} \mid z) p(z \mid x) dz} $$
Procedure
- Draw sample from posterior predictive distribution
- Calculate test statistic on sample (e.g. mean, max)
- Repeat to get distribution of statistic
- Compare test statistic on original data to distribution

Edward examples

Linear Regreessiion



In [16]:

    
import edward as ed
from edward.models import Normal

Data



In [17]:

    
n, p = 1000, 3

α = -1.0
β = np.reshape([0.5, 0.2, 1.0], (3,1))

# data for training
x_train = np.random.randn(n, p)
y_train = α + x_train @ β + np.random.normal(0, 1, (n,1))
y_train = y_train.ravel()

# data for testing
x_test =  np.random.randn(n, p)
y_test = α + x_test @ β + np.random.normal(0, 1, (n,1))
y_test= y_test.ravel()

Model

Given data $(x, y)$,

$$ p(w) = \mathcal{N}(w, 0, 1) \\ p(b) = \mathcal{N}(b, 0, 1) \\ p(y \mid w, b, x) = \prod_{i=1}^{n} \mathcal{N}(y_i \mid x_i^T w + b, 1) $$

Note that we label the intercept $\alpha$ as the bias $b$ and the coefficeints $\beta$ as weights $w$ following neural network conventions.



In [18]:

    
X = tf.placeholder(tf.float32, [n, p])
w = Normal(mu=tf.zeros(p), sigma=tf.ones(p))
b = Normal(mu=tf.zeros(1), sigma=tf.ones(1))
y = Normal(mu=ed.dot(X, w) + b, sigma=tf.ones(n))

Inference

We fit a fully factroized variational model by minimizing the Kullback-Leibler divergence.



In [19]:

    
qw = Normal(mu=tf.Variable(tf.random_normal([p])),
            sigma=tf.nn.softplus(tf.Variable(tf.random_normal([p]))))
qb = Normal(mu=tf.Variable(tf.random_normal([1])),
            sigma=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))



In [20]:

    
inference = ed.KLqp({w: qw, b: qb}, data={X: x_train, y: y_train})
inference.run()









    



1000/1000 [100%] ██████████████████████████████ Elapsed: 7s | Loss: 1438.422

Criticism

Find the posterior predictive distrbution.



In [21]:

    
y_post = ed.copy(y, {w: qw, b: qb})
# This is equivalent to
# y_post = Normal(mu=ed.dot(X, qw) + qb, sigma=tf.ones(N))

Calculate evalution metrics.



In [22]:

    
print("Mean squared error on test data:")
print(ed.evaluate('mean_squared_error', data={X: x_test, y_post: y_test}))

print("Mean absolute error on test data:")
print(ed.evaluate('mean_absolute_error', data={X: x_test, y_post: y_test}))









    



Mean squared error on test data:
1.03445
Mean absolute error on test data:
0.813144

Check parameters (true, prior, posterior)



In [23]:

    
list(zip(β, w.eval(), qw.eval()))









    Out[23]:





[(array([ 0.5]), -0.39793062, 0.52038413),
 (array([ 0.2]), 0.39972538, 0.15367131),
 (array([ 1.]), -0.095560297, 0.97044563)]



In [24]:

    
α, b.eval(), qb.eval()









    Out[24]:





(-1.0, array([-0.54414594], dtype=float32), array([-0.9823342], dtype=float32))

More examples

See tutorials

TensorFlow and Edward

Packages

References

TensorFlow

Data flow graph

Execution model

Agents

Placement algorithm

Optimization 1: Common subgraph elimination

Optimization 2: As late as possible (ALAP) scheduling

Automatic differentiation

Other features

Abstraction layers

TensorFlow Examples

Hello world

Arithmetic on data flow graphs

Using operators

Placeholders

Linear regreession

MNIST digits classificaiton (canonical toy example)

Using the tflearn abstraction layer

Edward

Overview

Data

Model

Models are built by composing random variables

Types of models

Inference

Methods for inference

Criticize

Point-based evaluations

Posterior predictive checks (PPC)

Edward examples

Linear Regreessiion

Data

Model

Inference

Criticism

More examples

Using the `tflearn` abstraction layer