Probabilistic programming with PyMC


In [ ]:
import numpy as np
import sklearn.datasets, sklearn.linear_model, sklearn.neighbors
import sklearn.manifold, sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
import sys, os, time
import scipy.io.wavfile, scipy.signal
import pymc as mc
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (18.0, 10.0)

In [ ]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

Topic purpose

This section will cover probabilistic inference. Rather than learning a single set of parameters by optimisation, we can model probability distributions over possible models that might be compatible with our data. We'll use Monte Carlo sampling to make it simple and easy (if not very efficient) to work with probabilistic models. We will use these approaches to model typing behaviour at the keystroke level, and both make predictions given some data ("how likely is it that this sequence was typed by user X?") and quantify how much confidence we have in those models.

Graphical models

Transformations of expressions to graphs is familiar to most computer scientists -- it is an essential part of most optimising compilers. For example, the equation of a straight line might be written as a graph (this is how a compiler would break down the expression):

Adding unknowns

If we have multiple dependent random variables whose distribution we want to infer, we can draw a graph of dependencies to form a graphical model. This explictly models dependencies between random variables (i.e. ones we don't know the value of precisely) and inference can be performed on the entire graph.

In CS terms, we are writing expressions down without fixing the variables, and then allowing the distribution of the values to be inferred when we observe data. This inference process narrows down the likely range a random variable could take on (hopefully!).

In a probabilistic graphical model, some nodes in the graph are observed -- that is we know their state because we have explicity measured it, and others are unobserved -- we know (or have guessed) the form of their distribution but not the parameters of that distribution. Some dependencies are deterministic (i.e. fully defined by the values of their parents), while others are stochastic. We can infer the posterior distribution of unobserved nodes by integrating over the possible values that could have occured given the observed values.

We can modify our straight line equation to write a model for linear regression:

All we need to do is specify that we expected the output $y$ to be normally distributed around the equation of a line given by $m$ and $c$; we can now infer $\sigma, m, c$ from observed data. Or we can fix any of them, and infer the remainder (if, e.g. we knew in advance that $c=0$). Our assumption here is that we will observe data which has a latent structure modelled by a linear dependence on a variable $x$, plus some normally-distributed observation noise.

Note that we must put some prior distribution on every stochastic node.


Let's do it: PyMC

We'll use the excellent PyMC module to do the inference. If you have questions about this module, you can read this tutorial or the API docs. There's a new and even nicer version with very powerful capabilities currently called PyMC3, but the dependencies are "hard" to install at the moment.

Let's implement the linear regression model in the intro in practice, using PyMC to build a graphical model and then run MCMC to sample from the posterior (i.e. estimate the distribution of random variables after seeing some evidence).


In [ ]:
### Bayesian Linear Regression with pymc
### We use Monte Carlo sampling to estimate the distribution of a linear function with a normally
### distributed error, given some observed data.
### Vaguely based on: http://matpalm.com/blog/2012/12/27/dead_simple_pymc/ and http://sabermetricinsights.blogspot.co.uk/2014/05/bayesian-linear-regression-with-pymc.html

## Utility function to plot the graph of a PyMC model
def show_dag(model):
    dag = mc.graph.dag(model)
    dag.write("graph.png",format="png")
    from IPython.display import Image
    i = Image(filename='graph.png')
    return i

## generate data with a known distribution
## this will be our "observed" data
x = np.sort(np.random.uniform(0,20, (50,)))
m = 2
c = 15

# Add on some measurement noise, with std. dev. 3.0
epsilon = data = np.random.normal(0,3, x.shape)
y = m * x + c + epsilon

plt.plot(x,y, '.', label="Datapoints")
plt.plot(x, m*x+c, '--', lw=3, label="True")
plt.legend()
plt.xlabel("x")
plt.xlabel("y")

In [ ]:
## Now, set up the PyMC model
## specify the prior distribution of the unknown line function variables
## Here, we assume a normal distribution over m and c
m_unknown = mc.Normal('m', 0, 0.01)
c_unknown = mc.Normal('c', 0, 0.001)

## specify a prior over the precision (inverse variance) of the error term
# precision = 1/variance
## Here we specify a uniform distribution from 0.001 to 10.0
precision = mc.Uniform('precision', lower=0.001, upper=10.0)

# specify the observed input variable
# we use a normal distribution, but this has no effect --
# the values are fixed and the paramters
# never updated; this is just a way of transforming x 
# into a variable pymc can work with
# (it's really a hack)
x_obs = mc.Normal("x_obs", 0, 1, value=x, observed=True)

@mc.deterministic(plot=False)
def line(m=m_unknown, c=c_unknown, x=x_obs):
    return x*m+c

# specify the observed output variable (note if use tau instead of sigma, we use the precision paramterisation)
y_obs =  mc.Normal('y_obs', mu=line, tau=precision, value=y, observed=True)

model = mc.Model([m_unknown, c_unknown, precision, x_obs, y_obs])

# display the graphical model
show_dag(model)

In [ ]:
# sample from the distribution
mcmc = mc.MCMC(model)
mcmc.sample(iter=10000)

## plot histograms of possible parameter values
plt.figure()
plt.hist(mcmc.trace("m")[:], normed=True, bins=30)
plt.title("Estimate of m")
plt.figure()
plt.hist(mcmc.trace("c")[:], normed=True, bins=30)
plt.title("Estimate of c")
plt.figure()
plt.hist(np.sqrt(1.0/mcmc.trace("precision")[:]), normed=True, bins=30)
plt.title("Estimate of epsilon std.dev.")
plt.figure()

## now plot overlaid samples from the linear function
ms = mcmc.trace("m")[:]
cs = mcmc.trace("c")[:]

plt.title("Sampled fits")
plt.plot(x, y, '.', label="Observed")
plt.plot(x, x*m+c, '--', label="True")
xf = np.linspace(-20,40,200)
for m,c in zip(ms[::20], cs[::20]):    
    plt.plot(xf, xf*m+c, 'r-', alpha=0.005)
plt.legend()
plt.xlim(-20,40)
plt.ylim(-40,80)

A simple mixture model

We can include both discrete and continuous variables. A very important case is where we have a mixture model. That is, we believe our observations come from one of a number of distributions. For example, in modelling human heights, we might expect height to be normally distributed, but to have two different distributions for men and women.

It is very straightforward to add this to a PyMC graphical model; it is just another random variable to infer.


In [ ]:
## Adapted from the example given at 
## http://stackoverflow.com/questions/18987697/how-to-model-a-mixture-of-3-normals-in-pymc

n = 3
ndata = 500

## A Dirichlet distribution specifies the distribution over categories
## All 1 means that every category is equally likely
dd = mc.Dirichlet('dd', theta=(1,)*n)

## This variable "selects" the category (i.e. the normal distribution)
## to use. The Dirichlet distribution sets the prior over the categories.
category = mc.Categorical('category', p=dd, size=ndata)

## Now we set our priors the precision and mean of each normal distribution
## Note the use of "size" to generate a **vector** of variables 
# (i.e. one for each category)

## We expect the precision of each normal to be Gamma distributed 
# (this mainly forces it to be positive!)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)

## And the means of the normal to be normally distributed, with a precision of 0.001 
# (i.e. std. dev 1000)
means = mc.Normal('means', 0, 0.001, size=n)

## These deterministic functions link the means of the observed distribution 
# to the categories
## They just select one of the elements of the mean/precision vector, 
# given the current value of category
## The input variables must be specified in the parameters, so that 
# PyMC knows which variables to pass to it
@mc.deterministic
def mean(category=category, means=means):
    return means[category]

@mc.deterministic
def prec(category=category, precs=precs):
    return precs[category]

## Generate synthetic mixture-of-normals data, 
# with means at -50,0,+50, and std. dev of 1
v = np.random.randint( 0, n, ndata)
data = (v==0)*(np.random.normal(50,5,ndata)) + (v==1)*(np.random.normal(-50,5,ndata)) + (v==2)*np.random.normal(0,5,ndata)


## Plot the original data
plt.hist(data, bins=50)  

## Now we specify the variable we observe -- which is normally distributed, *but*
## we don't know the mean or precision. 
# Instead, we pass the **functions** mean() and pred()
## which will be used at each sampling step.
## We specify the observed values of this node, and tell PyMC these are observed 
## This is all that is needed to specify the model
obs = mc.Normal('obs', mean, prec, value=data, observed = True)

## Now we just bundle all the variables together for PyMC
model = mc.Model({'dd': dd,
              'category': category,
              'precs': precs,
              'means': means,
              'obs': obs})

In [ ]:
def show_dag(model):
    dag = mc.graph.dag(model)
    dag.write("graph.png",format="png")
    from IPython.display import Image
    i = Image(filename='graph.png')
    return i
    
show_dag(model)

In [ ]:
mcmc = mc.MCMC(model)

## Now we tell the sampler what method to use
## Metropolis works well, but we must tell PyMC to use a specific
## discrete sampler for the category variable to get good results in a reasonable time
mcmc.use_step_method(mc.AdaptiveMetropolis, model.means)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.precs)
mcmc.use_step_method(mc.DiscreteMetropolis, model.category) ## this step is key!
mcmc.use_step_method(mc.AdaptiveMetropolis, model.dd)

## Run the sampler
mcmc.sample(iter=525000, burn=2000)

In [ ]:
plt.figure()
plt.hist(mcmc.trace('means').gettrace()[:], normed=True)
plt.title("Estimated means")
plt.legend(['Component 1', 'Component 2', 'Component 3'])
plt.figure()
## show the result in terms of std. dev. (i.e sqrt(1.0/precision))
plt.title("Estimated std. dev")
plt.hist(np.sqrt(1.0/mcmc.trace('precs').gettrace()[:]), normed=True)
plt.legend(['Component 1', 'Component 2', 'Component 3'])

MCMC in practice: sampling issues

The great thing about MCMC approaches is that you can basically write down your model and then run inference directly. There is no need to derive complex approximations, or to restrict ourselves to limited models for which we can compute answers analyitically.

MCMC allows us to use distributions we can't even sample from directly. First we couldn't calculate the evidence P(B), so we integrated; but we couldn't solve the integral, so we sampled; but then we couldn't sample from the distribution so we used MCMC. It's a very general approach!

The bad thing about MCMC approaches is that, even though it will do the "right thing" asymptotically, the choice of sampling strategy has a very large influence for the kind of sample runs that are practical to execute.

Bayesian inference should depend only on the priors and the evidence observed; but MCMC approaches also depend on the sampling strategy used to approximate the posterior.

Metropolis-Hastings

Metropolis-Hastings (or just plain Metropolis) takes a different approach, and is able to work in high-dimensional spaces. Metropolis also uses an auxiliary distribution $q(x)$, but it uses this to wander around in the distribution space, accepting jumps to new positions based on both $q(x)$ and $p(x)$. This random walk (a Markov chain, because we make a random jump conditioned only on where we currently are) is a the "Markov Chain" bit of "Markov Chain Monte Carlo".

We just take our current position $x$, and propose a new position $x^\prime = x + x_q$, where $x_q$ is a random sample drawn from $q(x)$. This makes local steps in the space of the probability density. If $q(x)$ has a simple, symmetric form (e.g. is Gaussian), there is a very simple formula to decide whether to accept or reject a step from $p(x)$ to a new candidate position $p(x^\prime)$: $$ p(\text{accept}) = \begin{cases} p(x^\prime)/p(x), & p(x)>=p(x^\prime) \\ 1, & p(x)<p(x^\prime) \end{cases} $$

The asymmetric case is only slightly more involved, but it is unusual to need to use it.


In [ ]:
def metropolis(p,q,x_init,n):
    x = x_init
    
    samples = []
    rejected = [] # we only keep the rejected samples to plot them later
    for i in range(n):
        # find a new candidate spot to jump to
        x_prime = x + q()
        # if it's better, go right away
        if p(x_prime)>p(x):
            x = x_prime
            samples.append(x_prime)            
        else:
            # if not, go with probability proportional to the
            # ratio between the new point and the current one
            pa = p(x_prime)/p(x)
            if np.random.uniform(0,1)<pa:
                x = x_prime
                samples.append(x_prime)
            else:
                rejected.append(x_prime)
                
    return np.array(samples), np.array(rejected)


A = np.array([[0.15, 0.9], [-0.1, 2.5]])
p = lambda x:scipy.stats.multivariate_normal(mean=[0,0], cov=A).pdf(x)
q = lambda: np.random.normal(0,0.5,(2,))
accept, reject = metropolis(p,q,[0.1, 0.3], 200)
plt.figure(figsize=(10,10))
plt.plot(accept[:,0], accept[:,1])
plt.plot(accept[:,0], accept[:,1], 'b.')
plt.plot(reject[:,0], reject[:,1], 'r.')
x,y = np.meshgrid(np.linspace(-5,5,30), np.linspace(-4,4,30))
plt.imshow(p(np.dstack([x,y])), extent=[-4,4,-4,4], cmap='viridis')
plt.grid("off")
plt.title("MCMC sampling with Metropolis-Hastings")

In [ ]:
# run the chain for longer, and plot the trace and the histogram of the variable
accept, reject = metropolis(p,q,[0.1, 0.3], 2000)
plt.subplot(1,2,1)
plt.plot(accept[:,0])
plt.title("Trace for $x$")
plt.subplot(1,2,2)
plt.hist(accept[:,0], bins=20)
plt.title("Histogram of $x$")

plt.figure()

plt.plot(accept[:,1])
plt.subplot(1,2,1)
plt.title("Trace for $y$")
plt.plot(accept[:,0])
plt.subplot(1,2,2)
plt.title("Histogram of $y$")
plt.hist(accept[:,0], bins=20);

Others

There are many other MCMC samplers, such as:

  • Gibbs samplers, which are very efficient when we can sample from the conditional distribution (i.e. from one dimension of a distribution at a time), but not from the joint directly.
  • Hamiltonian samplers, which extend Metropolis-like steps with "virtual physics" which pushes the samples in sensible directions (i.e. not down the gradient of the function!)
  • Slice samplers, which are very clever and efficient, but only work for 1-dimensional (univariate) distributions.

If you're interested in learning more about MCMC, David Mackay's book chapter is a good reference.

Imputation

In PyMC, variables can be observed (fixed) or unobserved (random). PyMC cycles through the array of known values for the observed variables and updates the rest of the graph.

But what if you want to ask "what if?"-style question? For example, if you knew the last two key codes and timings, what is the distribution over the possible times for the next key?

PyMC implements this using imputation, where certain missing values in an observed variable can be inferred (imputed) from the rest of the model. Masked arrays are used to implement imputation; these allow arrays to have "blank" values, that PyMC can fill in automatically.

This approach creates one new random variable per missing data item; this can create very large models if you are not careful!


In [ ]:
## Example, using the linear regression model from the last section:
import numpy.ma as ma # masked array support

## generate the data for the regression
x = np.sort(np.random.uniform(0,20, (50,)))
m = 2
c = 15
# Add on some measurement noise, with std. dev. 3.0
epsilon = data = np.random.normal(0,3, x.shape)
y = m * x + c + epsilon

## Now the imputation; we will try and infer missing some missing values of y (we still have the corresponding x)
## mark last three values of y invalid
y_impute = y[:]
y_impute[-3:] = 0
y_impute = ma.masked_equal(y_impute,0)
print "Y masked for imputation:", y_impute # we will see the last three entries with --

# create the model (exactly as before, except we switch "y_impute" for "y")
m_unknown = mc.Normal('m', 0, 0.01)
c_unknown = mc.Normal('c', 0, 0.001)
precision = mc.Uniform('precision', lower=0.001, upper=10.0)
x_obs = mc.Normal("x_obs", 0, 1, value=x, observed=True)
@mc.deterministic(plot=False)
def line(m=m_unknown, c=c_unknown, x=x_obs):
    return x*m+c
y_obs =  mc.Normal('y_obs', mu=line, tau=precision, value=y_impute, observed=True)
model = mc.Model([m_unknown, c_unknown, precision, x_obs, y_obs])

In [ ]:
# sample from the distribution
mcmc = mc.MCMC(model)
mcmc.sample(iter=10000, burn=2000, thin=5)

In [ ]:
## now we will have three entries in the y_obs trace from this run
y_trace = mcmc.trace('y_obs')[:]

## the original data
plt.plot(x[:-3],y[:-3], '.')
plt.plot(x[-3:],y[-3:], 'go')
plt.plot(x,x*m+c, '-')

# samples from posterior predicted for the missing values of y
plt.plot(np.tile(x[-3], (len(y_trace[:,0]), 1)), y_trace[:,0],  'r.', alpha=0.01)
plt.plot(np.tile(x[-2], (len(y_trace[:,1]), 1)), y_trace[:,1],  'r.', alpha=0.01)
plt.plot(np.tile(x[-1], (len(y_trace[:,2]), 1)), y_trace[:,2],  'r.', alpha=0.01)

Note that, while it makes sense to be able to infer $x$ given $y$, as well as the $y$ given $x$ we just did, PyMC cannot automatically infer variables in this manner. It can only infer the "forward" path in the graph. In theory, if all the determinstic functions (like the line function) were invertible, then this reverse inference could be performed automatically without changing the model.