```
In [ ]:
```# standard imports
import numpy as np
import matplotlib.pyplot as plt
import sys, os, time
import pandas as pd
import pymc as mc
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (12.0, 8.0)
import scipy.stats
# fix for pydot on windows machines
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

```
In [ ]:
```%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
OutputArea.prototype._should_scroll = function(){return false};

This section will cover probabilistic **inference**. Rather than learning a single set of parameters by optimisation, we can infer probability distributions over possible models that might be compatible with our data.

Concretely, we'll use Markov Chain Monte Carlo to make it simple and easy (if not very efficient) to work with probabilistic models. We will use these approaches to model **typing behaviour** at the keystroke level, and both make predictions given some data ("how likely is it that this sequence was typed by user X?") and quantify how much confidence we have in those models.

The key learning outcome is how to **think** about interaction in a probabilistic context. We will expand this in the second part of the course where we see how to apply probabilistic models to dynamic processes.

We will:

- Discuss keystroke modelling
- Look at graphical models of expressions
- Get to grips with the basics of Bayesian probabilistic inference
Have a quick introduction to probabilistic programming with PyMC

Specifically, we'll look at **keystroke timing** data. We will look at a simple dataset, which captures a sequence of key events and their timing; both the inter-key time and the key hold duration.

This is enough to answer some interesting questions; for example, can we model differences in the way keys are pressed based on the distribution of inter-key times? Or more precisely, can we infer a **probability distribution** over users that might be generating a timing sequence and update that as we see more keystrokes?

Alternatively, we could see if the type of key hit (e.g. space versus letter, or left side versus right side) influences the timing of the **next** keystroke, and build a predictive model to infer the likely next time to key impact. This could be used to optimise keyboard layouts (as you will see later in the week).

If we are very ambitiuous, we could try and recover the keystrokes themselves from timing alone. There probably isn't enough information to do this, at least without very large user-specific datasets and a robust **language model**, but we could probably identify unusual or special keys (e.g. identify `Space`

or `Enter`

or `Esc`

in the timing stream).

The "key" message here is that we are going to build probabilistic models -- we will maintain uncertainty over future possibilities.

We will use a computational representation where **random variables** are *first-class* values and we can write down our models without writing down the detailed mechanics of inference itself.

This is **probabilistic programming**. We will use PyMC to construct our probabilistic models.

We will build a **principed, statistical model** of user behavior that we can reason about, and estimate parameters of that model from quantitative observations of data.

This model is **robust** (it appropriately represents uncertainty) and **generative** (it can simulate behaviour compatible with observations).

The *generative* nature of the model is particularly important in powering optimisation approaches; being able to simulate potential human behaviours can save having to engage in very slow human data acquisition. If the equivalent function evaluation can be done from a realistic model, the optimisation can be radically sped up.

A *random variable* is a variable that can take on different values, but we do not know what value it has; i.e. one that is "unassigned". Probability theory allows us to manipulate random variables without having to assign them a specific value.

A *probability distribution* defines how likely different states of a random variable are. The probability distribution of a random variable $X$ is written:
$$P(X)$$
and is shorthand for the expression
$$Pr(X=x),$$ i.e. that the variable $X$ takes on the specific value $x$.

Random variables can be continuous (e.g. the height of a person) or discrete (the value showing on the face of a dice). The distribution of a discrete variable is described with a *probability mass function* (PMF) which gives each outcome a specific value. A continuous variable has a *probability density function* (PDF) which specifies the spread of the probability as a continuous function.

A probability distribution must assign probabilities in the range 0 (impossible) to 1 (definite) and the PMF or PDF **must** sum/integrate to exactly 1 as the random variable under consideration must take on *some* value.

A very simple discrete PMF is the expected value of the sum of two six-faced die. In this case, $P(X) = P(D_1+D_2)$ for two uniform discrete variables $D_1, D_2 \in \{1,2,3,4,5,6 \}$

```
In [ ]:
```# the PMF of the sum of two dice rolls
def two_dice():
# form the sum of the cross product of these possibilities
roll_two = [i+j for i in range(1,7) for j in range(1,7)]
# now plot the histogram
pmf, edges, patches = plt.hist(roll_two, normed=True, bins=range(1,14))
print("Sum of PMF %.2f" % np.sum(pmf)) # sum of probability should be *exactly* 1.0
plt.title("PMF of sum of 2d6 dice")
plt.xlabel("Sum of rolls x")
plt.ylabel("P(x)")

```
In [ ]:
```two_dice()

```
In [ ]:
```import scipy.stats as stats
# Plot the PDF of the normal distribution
def plot_normal():
# plot the normal (Gaussian distibution) along with a set of points drawn from that distribution
x = np.linspace(-4,4,100)
y = stats.norm.pdf(x,0,1) # mean 0, std. dev. 1
plt.plot(x,y, label="PDF")
plt.axhline(0, color='k', linewidth=0.2) # axis line
# mark the mean
plt.text(0, 0.51, '$\mu$')
plt.axvline(0, color='r')
# highlight one std. dev. to the right
plt.axvspan(0,1, facecolor='b', alpha=0.1, label="1 std. dev.")
plt.text(1.2, 0.3, '$\sigma$')
# take 1000 random samples and scatter plot them
samples = stats.norm.rvs(0,1,1000)
plt.scatter(samples, np.full(samples.shape, .2), s=448, c='b', alpha=0.1, marker='|', label="Samples")
plt.xlabel("$x$")
plt.ylabel("$P(x)$")
plt.legend()

```
In [ ]:
```plot_normal()

The *joint probability* of two random variables is written $$P(x,y)$$ and gives the probability that $x$ and $y$ take the a specific value *simultaneously* (i.e. $Pr(X=x) \land Pr(Y=y)$).

The *marginal probability* is the derivation of $P(x)$ from $P(x,y)$ by integrating (summing) over all the possible choices of $y$:
$$P(x) = \int_y P(x,y) dy$$.

**Marginalisation** just means integration over one or more variables from a joint distribution: it *removes* those variables from the distribution.

Two random variables are *independent* if the they do not have any dependence on each other. If this is the case then the joint distribution is just the product of the individual distributions:
$P(x,y) = P(x)P(y).$ This is not true in the general case where the variables have dependence.

The *conditional probability* of $x$ **given** $y$ is written as $$P(x|y)$$ and can be computed as $$P(x|y) = \frac{P(x,y)}{P(y)}.$$ This tells us how likely $x$ is to occur **if we already know** (or fix) the value of $y$.

```
In [ ]:
```def joint_marginal(cov):
# create an independent 2D normal distribution
x,y = np.meshgrid(np.linspace(-3,3,50), np.linspace(-3,3,50))
pos = np.empty(x.shape + (2,))
pos[:,:,0] = x
pos[:,:,1] = y
joint_pdf = scipy.stats.multivariate_normal.pdf(pos, [0,0], cov)
fig = plt.figure()
# plot the joint
ax = fig.add_subplot(2,2,1)
ax.axis('equal')
plt.title("Joint p(x,y)")
ax.pcolor(x,y,joint_pdf, cmap='viridis')
# plot the marginals
ax = fig.add_subplot(2,2,3)
ax.axis('equal')
plt.title("Marginal $P(x) = \int\ P(x,y) dy$")
ax.plot(x[0,:], np.sum(joint_pdf, axis=0))
ax = fig.add_subplot(2,2,2)
ax.axis('equal')
plt.title("Marginal $P(y) = \int\ P(x,y) dx$")
ax.plot(np.sum(joint_pdf, axis=1), x[0,:])
# plot p(x|y)
ax = fig.add_subplot(2,2,4)
ax.axis('equal')
plt.title("Conditional $P(x|y) = \\frac{P(x,y)}{P(y)}$")
marginal = np.tile(np.sum(joint_pdf, axis=0), (joint_pdf.shape[0],1))
ax.pcolor(x,y,joint_pdf/marginal, cmap='viridis')
joint_marginal([[1,0],[0.5,1]])

*Bayesians* treat probability as a **calculus of belief**; in this model of thought, probabilities are measures of degrees of belief. $P(A)=0$ means a belief that $A$ cannot be true and $P(A)=1$ is a belief that $A$ is absolutely certain.

Other representations of uncertainty are strictly inferior to probabilistic methods *in the sense that* a person, agent, computer placing "bets" on future events using probabilistic models has the best possible return out of all decision systems when there is uncertainty.

*Bayesians* allow for belief in states to be combined and manipulated via the rules of probability. The key process in Bayesian logic is *updating of beliefs*. Given some *prior* belief (it's Glasgow, it's not likely to be sunny) and some new evidence (there seems to be a bright reflection inside) we can update our belief to calculate the *posterior* -- our new probability that it is sunny outside. Bayesian inference requires that we accept priors over events, i.e. that we must explicitly quantify our assumptions with probability distributions.

There are only a few basic axioms of probability:

all possible events $A$ -- probabilities are 0, or positive and less than 1.

for the complete set of possible values $A \in \sigma$ in a set of possibilities $\sigma$ -- something always happens.

i.e. the probability of either of $A$ or $B$ happening is the sum of the independent probabilities minus the probability of both happening.

The conditional probability $P(A|B)$ means the probability that $B$ will happen *given that we already know $B$ to have happened*.
$$P(A|B) = \frac{P(A \land B)}{ P(B)}$$

Note that we can see an "event" as a random variable taking on a specific value i.e. $P(X=x)$

We often want to know the probability of a model $H$ (and its parameters $\theta$) given some data $p(H, \theta|D)$. But we are often in the situation that we can only compute the likelihood of the data being generated by the model. Bayes' rule gives a consistent way to invert the probability distribution, taking a prior into account: $$ P(A|B) = \frac{P(B|A) P(A)}{P(B)} $$

$$ P(H,\theta|D) = \frac{P(D|H,\theta) P(H,\theta)}{P(D)} $$$P(A|B)$ is called the *posterior*, $P(B|A)$ is called the *likelihood*, $P(A)$ is the *prior* and $P(B)$ is the *evidence*. Bayes' rule gives a consistent rule to take some prior belief and combine it with observed data to estimate a new distribution which combines them.

```
In [ ]:
```def prior_posterior(prior_mean=0, prior_std=1, n=10):
mean = prior_mean
std = prior_std
var = std*std
prior = scipy.stats.norm(mean,std)
evidence = scipy.stats.norm(1, 0.25)
xs = np.linspace(-5,5,200)
plt.fill_between(xs, prior.pdf(xs), label="Prior", alpha=0.1)
plt.fill_between(xs, evidence.pdf(xs), label="True", alpha=0.1)
sample_var = 1.0 # the *expected* variance of our observations
# note that changing this allows us to continously adjust our belief
# in our observations
for i in range(n):
sample = evidence.rvs()
# single step update for a normal distribution
mean = (var * sample + sample_var * mean) / (sample_var + var)
var = (var*sample_var) / (sample_var+var)
# plot the sample and the resulting pdf
plt.plot([sample,sample],[0,-0.1], 'c', alpha=0.7)
plt.plot(xs, scipy.stats.norm(mean,np.sqrt(var)).pdf(xs), 'k:', alpha=0.5)
plt.plot([sample,sample],[0,-0.1], 'c', alpha=0.7, label="Observations")
plt.fill_between(xs, scipy.stats.norm(mean,np.sqrt(var)).pdf(xs), color='g', label="Posterior", alpha=0.2)
plt.legend()

```
In [ ]:
```prior_posterior(0,0.75)

```
In [ ]:
```prior_posterior(0,3)

```
In [ ]:
```prior_posterior(-3,0.5)

```
In [ ]:
```prior_posterior(-3,0.5, n=100)

The probability of multiple **independent** random variables taking on a set of values can be computed from the product:
$$P(X,Y,Z) = P(X)P(Y)P(Z)$$
and in general
$$P(X_1, \dots, X_n) = \prod_{i=1}^{n} x_i$$

We often have to have to compute such products, but to multiply lots of values $<1$ leads to numerical issues. Instead, we often prefer to manipulate *log probabilities*, which can be summed instead of multiplied:
$$\log P(X_1, \dots, X_n) = \sum_{i=1}^{n} \log P(X_i)$$

This is simply a numerical convenience which avoids underflow. The **log-likelihood** is just $\log P(B|A)$, and is often more convenient to work with than the raw likelihood.

We can say that the posterior probability is *proportional* to the product of the prior and the likelihood. But to evaluate its value, we need to compute $P(B)$, the evidence.
This is tricky, but because probabilities must add up to 1, we can write $P(B)$ as:
$$ P(B) = \sum_{i} P(B|A_i) P(A_i) $$
for a set of discrete possibilities $A_i$ or
$$ P(B) = \int_{A} P(B|A) P(A) dA $$ for a continuous distribution over $A$.

This trick is essential in understanding Bayesian inference!

In general this is difficult to compute. For binary simple cases where there are only two possible outcomes ($A$ can only be true or false), Bayes' rule can be written as:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\bar A) P(\bar A)}, $$where $\bar A$ means "when A is false". In words:

- The probability that it is sunny given I can see a bright reflection is equal to:
- The probability that I would see a bright reflection if it
*were*sunny times the probability that it might be sunny- over

- The probability that I would see a bright reflection if it
*were*sunny times the probability that it might be sunny plus the probability that I would see a bright reflection if it were*not*sunny times the probability it might not be sunny.

- The probability that I would see a bright reflection if it

Transformations of expressions to graphs is familiar to most computer scientists -- it is an essential part of most optimising compilers. For example, the equation of a straight line might be written as a graph with dependencies as shown below (this is how a compiler would break down the expression):

If we have multiple dependent random variables whose distribution we want to infer, we can draw a graph of dependencies to form a *graphical model*. This explictly models dependencies between **random variables** (i.e. ones we don't know the value of precisely) and inference can be performed on the entire graph.

**In CS terms, we are writing expressions down without fixing the variables, and then allowing the distribution of the values to be inferred when we observe data.** This inference process narrows down the likely range a random variable could take on (hopefully!).

In a **probabilistic graphical model**, some nodes in the graph are **observed** -- that is we know their state because we have explicity measured it, and others are **unobserved** -- we know (or have guessed) the form of their distribution but not the parameters of that distribution. Some dependencies are deterministic (i.e. fully defined by the values of their parents), while others are stochastic (i.e. are characerised by distributions). We can infer the **posterior** distribution of unobserved nodes by integrating over the possible values that could have occured given the observed values.

We can modify our straight line equation to write a model for **linear regression**:

All we need to do is specify that we expected the output $y$ to be normally distributed around the equation of a line given by $m$ and $c$; we can now **infer** $\sigma, m, c$ from observed data.

Or we can fix any of them, and infer the remainder (if, e.g. we knew in advance that $c=0$, or that $m$ tended to be large, assuming we could express this prior belief as a valid distribution).

Our assumption here is that we will observe data which has a **latent structure** modelled by a linear dependence on a variable $x$, plus some normally-distributed observation noise.

**Note that we must put some prior distribution on every stochastic node; otherwise we cannot apply Bayes rule and therefore cannot perform updates.**

We can't have any "wild" stochastic nodes which do not eventually depend on deterministic nodes, via some chain of prior distributions.

We can draw samples from a distribution, which gives us a set of definite (non-random) values which are distributed according to the PDF or PMF.

The sample mean $\mu$ of a set of samples from a distribution is an estimate of the *expectation*, which improves as the number of samples $N$ increases.

We can write this as: $$ \mu = \frac{1}{N} \sum_{i=0}^{N} x_i \approx E[x] = \int_x x P(x) dx,$$ where $x_i$ are random samples from $P(x)$.

Furthermore, if we want to apply any function $f(x)$ to the distribution (e.g. to answer question like 'what is the expected value of a normal distribution whose "output" is squared'), we can estimate of $E[f(x)]$ very simply: $$ \frac{1}{N} \sum_{i=0}^{N} f(x_i) \approx E[f(x)] = \int_x f(x) P(x) dx,$$.

We often cannot compute the integral to normalise the probability distribution on the bottom of formula for Bayes Rule $P(B)$.

One common way to get round this is to use *Monte Carlo sampling*, where we approximate the integration over all the parameters by summing over random samples drawn from the distribution of parameters. There are many techinques to do this efficiently, as the naive sampling methods become very inefficient in high dimensional spaces (i.e. $N$ must be very large to reliably sample the space).

One of these techniques is *Markov chain Monte Carlo* (MCMC) constructs random walks which "wander" about in probablity distributions in a way that makes samples drawn from them represent the true distribution correctly. In other words, we can set up a model such that ergodic distribution of a a process (the MCMC process) is the posterior distribution.

The key idea is that if we can *evaluate* the likelihood $P(B|A)$ and prior $P(A)$ everywhere, we can find clever ways to sample from these distributions and thus the posterior, even if we can't draw samples from the posterior directly.
Details on how this can be done technically in are explained in these lecture notes by Chris Holmes. We'll be using the PyMC software to do the heavy lifting.

We'll use the excellent PyMC module to do the inference. If you have questions about this module, you can read this tutorial or the API docs. There's a new and even nicer version with very powerful capabilities currently called **PyMC3**, but the dependencies are "hard" to install at the moment.

Let's implement the linear regression model in the intro in practice, using PyMC to build a graphical model and then run MCMC to sample from the posterior (i.e. estimate the distribution of random variables after seeing some evidence).

```
In [ ]:
```### Bayesian Linear Regression with pymc
### We use Monte Carlo sampling to estimate the distribution of a linear function with a normally
### distributed error, given some observed data.
### Vaguely based on: http://matpalm.com/blog/2012/12/27/dead_simple_pymc/ and http://sabermetricinsights.blogspot.co.uk/2014/05/bayesian-linear-regression-with-pymc.html
## Utility function to plot the graph of a PyMC model
def show_dag(model):
dag = mc.graph.dag(model)
dag.write("graph.png",format="png")
from IPython.display import Image
i = Image(filename='graph.png')
return i
## generate data with a known distribution
## this will be our "observed" data
x = np.sort(np.random.uniform(0,20, (50,)))
m = 2
c = 15
epsilon = 3
# Add on some measurement noise, with std. dev. 3.0
noise = np.random.normal(0,epsilon, x.shape)
y = m * x + c + noise
plt.scatter(x,y, c='C1', label="Datapoints")
plt.plot(x, m*x+c, '--', lw=3, label="True")
plt.legend()
plt.xlabel("x")
plt.xlabel("y")
plt.title("Synthetic data, and true generating function")

```
In [ ]:
```print("True m: %.2f"%m)
print("True c: %.2f"%c)
print("True epsilon: %.2f"%epsilon)

```
In [ ]:
```## Now, set up the PyMC model
## Specify the prior distribution of the unknown line function variables
## Here, we assume a normal distribution over m and c
# NB: note that these are parameterised as having a mean and a *precision* (1/variance)
m_unknown = mc.Normal('m', 0, 0.01)
c_unknown = mc.Normal('c', 0, 0.001)
## specify a prior over the precision (inverse variance) of the error term
# precision = 1/variance
## Here we specify a uniform distribution from 0.001 to 10.0
precision = mc.Uniform('precision', lower=0.001, upper=10.0)
# specify the observed input variable
# we use a normal distribution, but this has no effect --
# the values are fixed and the parameters
# never updated; this is just a way of transforming x
# into a variable pymc can work with
# (it's really a hack)
x_obs = mc.Normal("x_obs", 0, 1, value=x, observed=True)
# our link function, which links our observations to the stochastic
# variables
@mc.deterministic(plot=False)
def line(m=m_unknown, c=c_unknown, x=x_obs):
return x*m+c
# specify the observed output variable
# (note we use tau instead of sigma, the precision paramterisation)
# note that it is critical to realise we are modelling a stochastic output variable
# whose *mean* and *variance* we are inferring
y_obs = mc.Normal('y_obs', mu=line, tau=precision, value=y, observed=True)
model = mc.Model([m_unknown, c_unknown, precision, x_obs, y_obs])
# display the graphical model
show_dag(model)

We can now tell PyMC to perform inference on this graph, given the data observed and the parameters we fixed. We have to specify how many samples to take from the *posterior distribution*. PyMC will then perform the sampling process, and return samples for all of the stochastic variables in the graph.

```
In [ ]:
```# sample from the distribution
mcmc = mc.MCMC(model)
mcmc.sample(iter=20000)

```
In [ ]:
```## plot histograms of possible parameter values
plt.figure()
plt.hist(mcmc.trace("m")[:], normed=True, bins=30)
plt.title("Estimate of m")
plt.figure()
plt.hist(mcmc.trace("c")[:], normed=True, bins=30)
plt.title("Estimate of c")
plt.figure()
plt.hist(np.sqrt(1.0/mcmc.trace("precision")[:]), normed=True, bins=30)
plt.title("Estimate of epsilon std.dev.")
plt.figure()

```
In [ ]:
```## now plot overlaid samples from the linear function
ms = mcmc.trace("m")[:]
cs = mcmc.trace("c")[:]
plt.title("Sampled fits")
plt.plot(x, y, '.', label="Observed")
plt.plot(x, x*m+c, '--C2', label="True")
xf = np.linspace(-20,40,200)
for m,c in zip(ms[::20], cs[::20]):
plt.plot(xf, xf*m+c, 'C1', alpha=0.005)
plt.plot(xf, xf*0, 'C1', alpha=0.00, label="Posterior model")
plt.legend()
plt.xlim(-20,40)
plt.ylim(-40,80)

We can include both **discrete** and **continuous** variables. A very important case is where we have a **mixture model**. That is, we believe our observations come from one of a number of distributions. For example, in modelling human heights, we might expect height to be normally distributed, but to have two different distributions for men and women.

It is very straightforward to add this to a PyMC graphical model; it is just another random variable to infer. This model is discrete, rather than continuous, and its value *selects* among possible parameter configurations for child nodes; this is a *hierarchical Bayesian model*.

```
In [ ]:
```## Adapted from the example given at
## http://stackoverflow.com/questions/18987697/how-to-model-a-mixture-of-3-normals-in-pymc
n = 3
ndata = 800
## Generate synthetic mixture-of-normals data, with means at -100,0,+50, and std. dev of 15,5,5
v = np.random.randint( 0, n, ndata)
data = ((v==0)*(np.random.normal(50,5,ndata)) +
(v==1)*(np.random.normal(-100,15,ndata)) +
(v==2)*np.random.normal(0,5,ndata))
## Plot the original data
plt.hist(data, bins=50);
plt.title("Histogram of mixture model synthetic data")

```
In [ ]:
```plt.plot(data, '.')
plt.title("Scatterplot of synthetic mixture model data")

We model this discrete component of our model as a **categorical variable**, which switches between multiple possible normal distributions.

We have to estimate:

- the distribution of the categorical variable (i.e. the weightings of the mixture)
- the parameters of each normal

This means we assume the data (for 3 mixture components) has three weights ($\alpha_1, \alpha_2, \alpha_3$) and two parameters for each distribution ($\mu_1, \sigma_1,\ \mu_2, \sigma_2,\ \mu_3, \sigma_3$)

In this case our weights are equal, but we can still infer the distribution and check whether the inference process recovers this uniform distribution.

We assume we know the *number* of mixtures here (it gets *much* harder to do the inference efficiently if we don't know this)

```
In [ ]:
```## A Dirichlet model specifies the prior distribution over categories
## All 1 means that every category is equally likely
dd = mc.Dirichlet('dd', theta=(1,)*n)
## This variable "selects" the category (i.e. the normal distribution)
## to use. The Dirichlet distribution sets the prior over the categories.
category = mc.Categorical('category', p=dd, size=ndata)
#category = mc.Container([mc.Categorical("category%i" % i, p=dd)
# for i in range(ndata)])
## Now we set our priors the precision and mean of each normal distribution
## Note the use of "size" to generate a **vector** of variables
## (i.e. one for each category)
## We expect the precision of each normal to be Gamma
## distributed (this mainly forces it to be positive!)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
## And the means of the normal to be normally distributed,
## with a precision of 0.01 (i.e. std. dev 100)
means = mc.Normal('means', 0, 0.01, size=n)
## These deterministic functions link the means of the
## observed distribution to the categories
## They just select one of the elements of the
## mean/precision vector, given the current value of category
## The input variables must be specified in the parameters,
## so that PyMC knows which variables to pass to it
@mc.deterministic
def mean(category=category, means=means):
return means[category]
@mc.deterministic
def prec(category=category, precs=precs):
return precs[category]
## Now we specify the variable we observe -- which is normally distributed, *but*
## we don't know the mean or precision.
## Instead, we pass the **functions** mean() and pred()
## which will be used at each sampling step.
## We specify the observed values of this node, and tell PyMC these are observed
## This is all that is needed to specify the model
obs = mc.Normal('obs', mean, prec, value=data, observed = True)
## Now we just bundle all the variables together for PyMC
model = mc.Model({'dd': dd,
'category': category,
'precs': precs,
'means': means,
'obs': obs})

```
In [ ]:
```def show_dag(model):
dag = mc.graph.dag(model)
dag.write("graph.png",format="png")
from IPython.display import Image
i = Image(filename='graph.png')
return i
show_dag(model)

```
In [ ]:
```mcmc = mc.MCMC(model)
## Now we tell the sampler what method to use
## Metropolis works well, but we must tell PyMC to use a specific
## discrete sampler for the category variable to get good results in a reasonable time
mcmc.use_step_method(mc.AdaptiveMetropolis, model.means)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.precs)
mcmc.use_step_method(mc.DiscreteMetropolis, model.category, proposal_distribution='Prior')
#mcmc.use_step_method(mc.DiscreteMetropolis, model.category) ## this step is key!
mcmc.use_step_method(mc.AdaptiveMetropolis, model.dd)
## Run the sampler (we'll discuss burn and thin later)
mcmc.sample(iter=525000, burn=10000)

```
In [ ]:
```plt.figure()
plt.hist(mcmc.trace('means').gettrace()[:], normed=True)
plt.title("Estimated means")
plt.legend(['Component 1', 'Component 2', 'Component 3'])
plt.figure()
## show the result in terms of std. dev. (i.e sqrt(1.0/precision))
plt.title("Estimated std. dev")
plt.hist(np.sqrt(1.0/mcmc.trace('precs').gettrace()[:]), normed=True)
plt.legend(['Component 1', 'Component 2', 'Component 3'])

```
In [ ]:
```pmf, bins, patches = plt.hist(mcmc.trace('category').gettrace()[:].ravel(), normed=True, bins=[0,1,2,3])
print(np.sum(pmf))

What we have plotted is the **posterior distribution of the model parameters**; i.e. the values we expect the model parameters to take on given the data we observed and our prior.

The **predictive posterior** is the *distribution over observations* we would expect to see. This means drawing samples from the model, while integrating over parameters from the posterior. By sampling from the predictive posterior, we are generating new synthetic data that should have the same statistical properties as the data (if our model is good).

One simple way of drawing samples from the predictive posterior is to iterate through the posterior samples, then sample from the model with the parameters fixed to the values for that posterior sample.

```
In [ ]:
```means = mcmc.trace('means').gettrace()[:]
precs = mcmc.trace('precs').gettrace()[:]
categories = mcmc.trace('category').gettrace()[:]
# iterate over all posterior samples
samples = []
for i in range(len(means)):
# sample from the predictive posterior
# we just draw five samples here; we could draw lots of them, giving better "resolution"
for j in range(5):
cat = categories[i][np.random.randint(0, ndata)]
samples.append(mc.rnormal(means[i][cat],precs[i][cat]))
plt.hist(samples, bins=np.linspace(-200, 200, 100));

As a further visualisation, we can visualise the *likelihood* of the data given the model, $P(D|H,\theta)$. All we have is samples from the posterior, so we can only compute the likelihood for each of those samples. This gives a trace of the likelihood. We can visualise this using a Box plot:

```
In [ ]:
```import mcmc_utils
plt.boxplot(mcmc_utils.logp_trace(mcmc, model, n_samples=20000));

```
In [ ]:
```# simple single Gaussian model
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
means = mc.Normal('means', 0, 0.001, size=n)
obs = mc.Normal('obs', mean, prec, value=data, observed = True)
model_simple = mc.Model({
'precs': precs,
'means': means,
'obs': obs})
mcmc_simple = mc.MCMC(model_simple)
mcmc_simple.sample(iter=525000, burn=10000)

```
In [ ]:
```import mcmc_utils
plt.boxplot(mcmc_utils.logp_trace(mcmc_simple, model_simple, n_samples=40000));

Back to the keyboard! We'd like to be able to produce probabilistic models of human behaviour; ideally *generative* models which could used to simulate plausible approximations of real behaviour.

*[Image credit: Mysid - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=364930]*

In this practical, we'll model key press behaviour; in particular
we will build a simple model which assumes *key hold durations* (how long each key is held down for) are assumed to be *LogNormal* distributed with distribution parameters that are different for each **type** of key; in this example:

- 0:
**navigation arrows** - 1:
**backspace** - 2:
**printable characters**numbers, letters, punctuation, whitespace

Your task is to estimate the parameters of the LogNormal distribution for each of these three possible key classes (i.e. a mixture model), and see if there are any systematic differences. This involves building a model much like the one above, but with a mixture component that is **observed** rather than inferred.

This will be based on a keylog of approximately 6000 key presses (my keypresses editing an old summerschool notebook).

The steps needed to complete this task are as follows:

- Load the data
- Eyeball the data
- Classify the keys into different types
- Construct the MCMC model
- Run it
- Display the results

Steps 1-3 are provided below. You will need to complete steps 4-6, by adapting the MCMC sampling code provided above.

For Step 4, You will need to construct the PyMC model specification, then run the MCMC sampler, much as in the example above. But the inter-key timings should be a Lognormal rather than normal (Gaussian). Key timings are (obviously!) positive and are right-skewed; log-normal distributions model this well. **Note that this is a mixture model, but unlike the above example, the mixture component is an observed variable (you know the key class)**

```
In [ ]:
```lognorm = scipy.stats.lognorm(s=1, loc=0)
x = np.linspace(-0.5, 5, 200)
plt.plot(x, lognorm.pdf(x))
plt.title("Log-normal PDF")

We can load the data from `keylogs.csv`

. This has four columns:

```
user_id delta_t duration_t keycode
```

`user_id`

is a unique integer per-user (there is only one user here).`delta_t`

is in milliseconds*.*`duration_t`

is the duration the key was held down, in milliseconds.`keycode`

is the keycode (result of`e.which`

in Javascript) of the keypress.

```
In [ ]:
```# Load the data, and give the columns sensible names
keylog = pd.read_csv('jupyter_keylog.csv',
names=['user_id', 'delta_t', 'duration_t', 'keycode'])

```
In [ ]:
```# summarize the data
print(keylog.describe())

```
In [ ]:
```plt.hist(keylog.delta_t, bins=np.linspace(0,500,50))
plt.title("Delta T")
plt.xlabel("Milliseconds")
plt.ylabel("Count")

```
In [ ]:
```plt.hist(keylog.duration_t, bins=np.linspace(0,500,50));
plt.title("Duration T")
plt.xlabel("Milliseconds")
plt.ylabel("Count")

```
In [ ]:
```plt.hist(keylog.keycode, bins=np.arange(1,100));
plt.title("Keycode")
plt.xlabel("Keycode")
plt.ylabel("Count")

We can classify keys using the `keys`

module, which lets us classify keys into semantic categories.

```
In [ ]:
```import keys
# these are all the key types known to the keys module
print(keys.get_key_types())

```
In [ ]:
```# convert a list of keycodes to their names, so we can see what kind of
# data we are dealing with
print(keys.key_to_name(keylog.keycode[0:100]))

The easiest way to encode information for the MCMC model is to use integer indices for each key "class", so that a standard `Categorical`

variable type can be used.

To classify keys into discrete classes, we can use `keys.classify_keys(k, classes)`

. This takes a list of possible classes, and for each key, it will output which classes it belongs to. This will be a list of indices (since in theory one key might be able to belong to multiple key classes). The list will be empty if the key does not fall into any of the specified classes.

```
In [ ]:
```print(keys.classify_keys(keylog.keycode[0:100], ['letters', 'backspace']))

```
In [ ]:
```class_names = ['arrows', 'backspace', 'character_keys']
classes = keys.classify_keys(keylog.keycode, class_names)
filtered_classes = [c[0] if len(c)>0 else None for c in classes]
print(filtered_classes[0:100])

Then we filter both the keycodes *and* the corresponding `duration_t`

, removing any `None`

keys, to create two new lists:

- a list of class codes
- a list of corresponding durations

```
In [ ]:
```class_codes = []
durations = []
for key_class, duration in zip(filtered_classes, keylog.duration_t):
if key_class is not None:
class_codes.append(key_class)
durations.append(duration)

```
In [ ]:
```# just print the first 100 elements of each of these vectors
print(class_codes[0:100])
print(durations[0:100])

- Use a prior of
`p=[1/3.0, 1/3.0, 1/3.0]`

for the Categorical variable (even though it is observed, a value for`p`

is required) - Make sure you flag the Categorical variable as being observed
- You can reuse the same priors for mean and prec as in the previous example (the parameters of the priors have little effect on the inference, as you can check yourself)
- Show your model with
`show_dag()`

before running it.

```
In [ ]:
```# Solution
# Put the MCMC model definition here

```
In [ ]:
```# Solution
show_dag(model)

```
In [ ]:
```## Solution
# Run your model

- You can use the traces above
- It is also informative to try drawing samples from the posterior distributions. PyMC2 has no convenient method to do this directly (PyMC3 does), but you can just iterate through a bunch of the mean/prec, and then sample from a corresponding distribution.

In PyMC, sampling from any distribution with known parameters can be done using mc.r[XXX], where [XXX] is the lowercase distribution name.

```
rnormal(0, 1) # draw a sample from Normal, mean=0, precision=1
rlognormal(0, 1) # draw a sample from the log Normal with u=0, precision=1
rgamma(0.1, 0.1) # draw a sample from the Gamma distribution with alpha=0.1, beta = 0.1
```

```
In [ ]:
```# 5 samples from lognormal, mu=0, prec=1
mc.rlognormal(0,1,size=5)

```
In [ ]:
```# Solution
# show visualisations like trace histograms, predicitive posterior

There are lots of interesting things we could also have done with just this subset of the data (class + duration):

- Explored different models for the time distribution (e.g. Gamma instead of lognormal)
- Predicted the key class
*given*the duration. - Try
**clustering**the durations (e.g. with a Dirichlet process) and see if we can infer groupings of key behaviour that correspond to key classes or keyboard locations.

The **great thing** about MCMC approaches is that you can basically write down your model and then run inference directly. There is no need to derive complex approximations, or to restrict ourselves to limited models for which we can compute answers analyitically.

MCMC allows us to use distributions *we can't even sample from directly*. First we couldn't calculate the evidence P(B), so we integrated; but we couldn't solve the integral, so we sampled; but then we couldn't directly sample from the distribution so we used MCMC. It's a very general approach! (approaches like approximate Bayesian computation (ABC) let us even do inference when we can't **evaluate** the likelihood, never mind sample from it)

The **bad thing** about MCMC approaches is that, even though it will do the "right thing" *asymptotically*, the choice of sampling strategy has a very large influence for the kind of sample runs that are practical to execute.

Bayesian inference should depend only on the priors and the evidence observed; but MCMC approaches also depend on the sampling strategy used to approximate the posterior.

We'll briefly describe how MCMC actually works, its limitations, and how we can detect MCMC problems and correct for them.

The simplest way to perform Monte Carlo sampling for a distribution we can't sample directly is to do **rejection sampling**. We have a distribution we want to sample given by a pdf $p(x)$, and instead sample from an easy distribution $q(x)$, (usually uniform, i.e. a box) where $q(x)>p(x) \forall x$. Then, we draw a new sample $x_q$ from $q(x)$ (horizontal sample) and then sample uniformly from $x_s = [0,x_q]$ (vertical sample) and see if $x_s<f(x_q)$. If we so we keep it as a draw from the distribution, otherwise we reject it.

```
In [ ]:
```def rejection_sample(p,interval_x, interval_y, n):
xs = np.random.uniform(interval_x[0], interval_x[1],(n,))
ys = np.random.uniform(interval_y[0], interval_y[1],(n,))
kept = p(xs)>ys
return kept, np.logical_not(kept), xs, ys
def odd_pdf(x):
return scipy.stats.halfcauchy.pdf(x, scale=0.1)
kept, rejected, xs, ys = rejection_sample(odd_pdf, [-1,1], [0,7], 2000)
plt.plot(xs[kept], ys[kept], 'r.')
plt.plot(xs[kept], np.zeros_like(xs[kept])+0.01, 'r|', markersize=20)
for x,y in zip(xs[kept], ys[kept]):
plt.plot([x,x], [0.01,y], 'r', alpha=0.1)
plt.plot(xs[rejected], ys[rejected], 'k.')
xf = np.linspace(-1,1,100)
plt.plot(xf,odd_pdf(xf), '--')
print "Fraction under the curve: %.2f" % (np.sum(kept) / float(len(xs)))
plt.title("Red points are distributed according to the blue PDF")

This is easy to implement, but works very poorly in *high dimensions* because the rejection rate increases exponentially with increasing dimension. We had about 5% acceptance in the above example: in 10D for a similar distribution, we would have an acceptance rate of $9\times 10^{-14}$: in practice we would *never* find a sample underneath the PDF by chance alone!

Metropolis-Hastings (or just plain Metropolis) takes a different approach, and is able to work in high-dimensional spaces. Metropolis samples also uses an auxiliary distribution $q(x)$, but it uses this to **wander around** in the distribution space, accepting jumps to new positions based on both $q(x)$ and $p(x)$. This random walk (a **Markov chain**, because we make a random jump conditioned only on where we currently are) is a the "Markov Chain" bit of "Markov Chain Monte Carlo".

We just take our current position $x$, and propose a new position $x^\prime = x + x_q$, where $x_q$ is a random sample drawn from $q(x)$. This makes local steps in the space of the probability density. If $q(x)$ has a simple, symmetric form (e.g. is Gaussian), there is a very simple formula to decide whether to accept or reject a step from $p(x)$ to a new candidate position $p(x^\prime)$: $$ p(\text{accept}) = \begin{cases} p(x^\prime)/p(x), & p(x)>=p(x^\prime) \\ 1, & p(x)<p(x^\prime) \end{cases} $$

The asymmetric case is only slightly more involved, but it is very unusual to need to use it.

Amazingly, this simple procedure will (in the limit!) take unbiased samples from the complete posterior distribution. In practice, convergence can be difficult to achieve, especially in higher dimensions, with multi-modal posteriors.

```
In [ ]:
```def metropolis(p,q,x_init,n):
# Perform Metropolis MCMC sampling.
# p(x): a function that can be evaluated anywhere. p(x) returns the value of p at x
# q(): a function q that draws a sample from a symmetric distribution and returns it
# x_init: a starting point
# n: number of samples
x = x_init
samples = []
rejected = [] # we only keep the rejected samples to plot them later
for i in range(n):
# find a new candidate spot to jump to
x_prime = x + q()
# if it's better, go right away
if p(x_prime)>p(x):
x = x_prime
samples.append(x_prime)
else:
# if not, go with probability proportional to the
# ratio between the new point and the current one
pa = p(x_prime)/p(x)
if np.random.uniform(0,1)<pa:
x = x_prime
samples.append(x_prime)
else:
rejected.append(x_prime)
return np.array(samples), np.array(rejected)

```
In [ ]:
```# test the sampling process
# create an interesting distribution p (just a mixture of two gaussians)
A = np.array([[0.15, 0.9], [-0.3, 2.5]])
p1 = lambda x:scipy.stats.multivariate_normal(mean=[0,0], cov=A).pdf(x)
p2 = lambda x:scipy.stats.multivariate_normal(mean=[3,0], cov=np.eye(2)).pdf(x)
p = lambda x:p1(x)*0.5+p2(x)*0.5
# create a proposal distribution, with std. dev. 0.25
q = lambda: np.random.normal(0,0.75,(2,))
# make 500 MCMC steps
accept, reject = metropolis(p,q,[0.1, 0.3], 500)
# plot a heatmap of the distribution, along with the
# accepted and rejected samples from that MCMC chain
plt.figure(figsize=(10,10))
plt.plot(accept[:,0], accept[:,1])
plt.plot(accept[:,0], accept[:,1], 'b.')
plt.plot(reject[:,0], reject[:,1], 'rx')
x,y = np.meshgrid(np.linspace(-5,5,30), np.linspace(-4,4,30))
plt.imshow(p(np.dstack([x,y])), extent=[-4,4,-4,4], cmap='viridis')
plt.grid("off")
plt.title("MCMC sampling with Metropolis-Hastings")

```
In [ ]:
```# run the chain for longer, and plot the trace and the histogram of the variable
accept, reject = metropolis(p,q,[0.1, 0.3], 5000)
plt.subplot(1,2,1)
plt.plot(accept[:,0])
plt.title("Trace for $x$")
plt.subplot(1,2,2)
plt.hist(accept[:,0], bins=20)
plt.title("Histogram of $x$")
plt.figure()
plt.plot(accept[:,1])
plt.subplot(1,2,1)
plt.title("Trace for $y$")
plt.plot(accept[:,1])
plt.subplot(1,2,2)
plt.title("Histogram of $y$")
plt.hist(accept[:,1], bins=20);
plt.figure()
xbins, ybins = np.linspace(-5,5,30), np.linspace(-4,4,30)
plt.hist2d(accept[:,0], accept[:,1], normed=True, bins=[xbins, ybins])
plt.title("2D histogram of MCMC points")

There are many other MCMC samplers, such as:

**Gibbs**samplers, which are very efficient when we can sample from the conditional distribution (i.e. from one dimension of a distribution at a time), but not from the joint directly.**Hamiltonian**samplers, which extend Metropolis-like steps with "virtual physics" which pushes the samples in sensible directions (i.e. not down the gradient of the function!). These are implemented in PyMC3, as well as in the MCMC system**STAN**, and can work very well in otherwise intractably high-dimensional spaces.**Slice**samplers, which are very clever and efficient, but only work for 1-dimensional (univariate) distributions.

MCMC tries to draw **independent, unbiased** samples from the posterior, but the sampling process (like Metropolis), is not inherently unbiased. For example, successive samples in a random walk are correlated and obviously not independent.

And although the Markov Chain approach (under fairly relaxed assumptions) will asympotically sample from all of the posterior, if the random walk starts off very far from the bulk of the distribution, it will "wander in the wilderness" for some time before reaching significant probability density. This means early samples from the distribution might be unreasonably dense in very low probability regions in the posterior. How "good" the Markov chain is at sampling from the posterior is called **mixing**; some MCMC setups may mix very badly until they get warmed up.

To mitigate these two common issues, there are a couple of standard tricks:

**Burn-in**, which ignores the first $n$ samples from an MCMC draw, to make sure the chain is "mixing" well. Typically, several thousand samples might be ignored.**Thinnning**, which takes one sample from every $k$ consecutive samples from the chain, to reduce correlation. Values of raound 5-50 are common.

Tuning these is a matter of art!

```
In [ ]:
```## Burn-in and thinning plot
# introduce correlations
y = accept[:,1]
x = np.arange(len(y))
# discard 400 samples, keep every 8th sample
burn = 400
thin = 8
plt.plot(x[0:burn], y[0:burn], 'r:')
plt.plot(x[burn::thin], y[burn::thin], 'go', markersize=8)
plt.plot(x[burn:], y[burn:], 'k:', alpha=0.1)
plt.plot(x[burn:], y[burn:], 'k.', alpha=0.1)
plt.axvline(burn, c='r')
plt.text(15,2.5,"Burn-in period")

**None of these are definitive**, but can give skilled MCMC practitioners insight into the operation of the sampling process.

If you're interested in learning more about MCMC, David Mackay's book chapter is a good reference.