# Inferring typing behaviour



In [ ]:

# standard imports
import numpy as np
import matplotlib.pyplot as plt
import sys, os, time
import pandas as pd
import pymc as mc
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (12.0, 8.0)
import scipy.stats

# fix for pydot on windows machines
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'




In [ ]:

%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
OutputArea.prototype._should_scroll = function(){return false};



## Topic purpose

This section will cover probabilistic inference. Rather than learning a single set of parameters by optimisation, we can infer probability distributions over possible models that might be compatible with our data.

Concretely, we'll use Markov Chain Monte Carlo to make it simple and easy (if not very efficient) to work with probabilistic models. We will use these approaches to model typing behaviour at the keystroke level, and both make predictions given some data ("how likely is it that this sequence was typed by user X?") and quantify how much confidence we have in those models.

The key learning outcome is how to think about interaction in a probabilistic context. We will expand this in the second part of the course where we see how to apply probabilistic models to dynamic processes.

We will:

## Part I

### Keystroke modelling

Specifically, we'll look at keystroke timing data. We will look at a simple dataset, which captures a sequence of key events and their timing; both the inter-key time and the key hold duration.

This is enough to answer some interesting questions; for example, can we model differences in the way keys are pressed based on the distribution of inter-key times? Or more precisely, can we infer a probability distribution over users that might be generating a timing sequence and update that as we see more keystrokes?

Alternatively, we could see if the type of key hit (e.g. space versus letter, or left side versus right side) influences the timing of the next keystroke, and build a predictive model to infer the likely next time to key impact. This could be used to optimise keyboard layouts (as you will see later in the week).

If we are very ambitiuous, we could try and recover the keystrokes themselves from timing alone. There probably isn't enough information to do this, at least without very large user-specific datasets and a robust language model, but we could probably identify unusual or special keys (e.g. identify Space or Enter or Esc in the timing stream).

The "key" message here is that we are going to build probabilistic models -- we will maintain uncertainty over future possibilities.

### Probabilistic programming

We will use a computational representation where random variables are first-class values and we can write down our models without writing down the detailed mechanics of inference itself.

This is probabilistic programming. We will use PyMC to construct our probabilistic models.

### Why is this computational HCI?

We will build a principed, statistical model of user behavior that we can reason about, and estimate parameters of that model from quantitative observations of data.

This model is robust (it appropriately represents uncertainty) and generative (it can simulate behaviour compatible with observations).

The generative nature of the model is particularly important in powering optimisation approaches; being able to simulate potential human behaviours can save having to engage in very slow human data acquisition. If the equivalent function evaluation can be done from a realistic model, the optimisation can be radically sped up.

# Background

## Probabilistic approaches

#### Random variables

A random variable is a variable that can take on different values, but we do not know what value it has; i.e. one that is "unassigned". Probability theory allows us to manipulate random variables without having to assign them a specific value.

#### Distributions

A probability distribution defines how likely different states of a random variable are. The probability distribution of a random variable $X$ is written: $$P(X)$$ and is shorthand for the expression $$Pr(X=x),$$ i.e. that the variable $X$ takes on the specific value $x$.

Random variables can be continuous (e.g. the height of a person) or discrete (the value showing on the face of a dice). The distribution of a discrete variable is described with a probability mass function (PMF) which gives each outcome a specific value. A continuous variable has a probability density function (PDF) which specifies the spread of the probability as a continuous function.

A probability distribution must assign probabilities in the range 0 (impossible) to 1 (definite) and the PMF or PDF must sum/integrate to exactly 1 as the random variable under consideration must take on some value.

## PMF example: sum of dice rolls

A very simple discrete PMF is the expected value of the sum of two six-faced die. In this case, $P(X) = P(D_1+D_2)$ for two uniform discrete variables $D_1, D_2 \in \{1,2,3,4,5,6 \}$



In [ ]:

# the PMF of the sum of two dice rolls
def two_dice():
# form the sum of the cross product of these possibilities
roll_two = [i+j for i in range(1,7) for j in range(1,7)]
# now plot the histogram
pmf, edges, patches = plt.hist(roll_two, normed=True, bins=range(1,14))
print("Sum of PMF %.2f" % np.sum(pmf)) # sum of probability should be *exactly* 1.0
plt.title("PMF of sum of 2d6 dice")
plt.xlabel("Sum of rolls x")
plt.ylabel("P(x)")




In [ ]:

two_dice()



## PDF example: the normal disribution

The most ubiquitous of all continuous PDFs is the normal or Gaussian distribution. It assigns probabilities to real values $x \in {\mathbb{R}}$. It has a density given by the PDF:

$$P(X) = \frac{1}{\sqrt{2\pi\sigma^2}}\, e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$$


In [ ]:

import scipy.stats as stats
# Plot the PDF of the normal distribution
def plot_normal():
# plot the normal (Gaussian distibution) along with a set of points drawn from that distribution
x = np.linspace(-4,4,100)
y = stats.norm.pdf(x,0,1) # mean 0, std. dev. 1
plt.plot(x,y, label="PDF")
plt.axhline(0, color='k', linewidth=0.2) # axis line

# mark the mean
plt.text(0, 0.51, '$\mu$')
plt.axvline(0, color='r')
# highlight one std. dev. to the right
plt.axvspan(0,1, facecolor='b', alpha=0.1, label="1 std. dev.")
plt.text(1.2, 0.3, '$\sigma$')
# take 1000 random samples and scatter plot them
samples = stats.norm.rvs(0,1,1000)
plt.scatter(samples, np.full(samples.shape, .2), s=448, c='b', alpha=0.1, marker='|', label="Samples")
plt.xlabel("$x$")
plt.ylabel("$P(x)$")
plt.legend()




In [ ]:

plot_normal()



### Joint, marginal, conditional

The joint probability of two random variables is written $$P(x,y)$$ and gives the probability that $x$ and $y$ take the a specific value simultaneously (i.e. $Pr(X=x) \land Pr(Y=y)$).

The marginal probability is the derivation of $P(x)$ from $P(x,y)$ by integrating (summing) over all the possible choices of $y$: $$P(x) = \int_y P(x,y) dy$$.

Marginalisation just means integration over one or more variables from a joint distribution: it removes those variables from the distribution.

Two random variables are independent if the they do not have any dependence on each other. If this is the case then the joint distribution is just the product of the individual distributions: $P(x,y) = P(x)P(y).$ This is not true in the general case where the variables have dependence.

The conditional probability of $x$ given $y$ is written as $$P(x|y)$$ and can be computed as $$P(x|y) = \frac{P(x,y)}{P(y)}.$$ This tells us how likely $x$ is to occur if we already know (or fix) the value of $y$.



In [ ]:

def joint_marginal(cov):
# create an independent 2D normal distribution
x,y = np.meshgrid(np.linspace(-3,3,50), np.linspace(-3,3,50))
pos = np.empty(x.shape + (2,))
pos[:,:,0] = x
pos[:,:,1] = y
joint_pdf = scipy.stats.multivariate_normal.pdf(pos, [0,0], cov)
fig = plt.figure()
# plot the joint
ax.axis('equal')
plt.title("Joint p(x,y)")
ax.pcolor(x,y,joint_pdf, cmap='viridis')
# plot the marginals
ax.axis('equal')
plt.title("Marginal $P(x) = \int\ P(x,y) dy$")
ax.plot(x[0,:], np.sum(joint_pdf, axis=0))
ax.axis('equal')
plt.title("Marginal $P(y) = \int\ P(x,y) dx$")
ax.plot(np.sum(joint_pdf, axis=1), x[0,:])
# plot p(x|y)
ax.axis('equal')
plt.title("Conditional $P(x|y) = \\frac{P(x,y)}{P(y)}$")
marginal = np.tile(np.sum(joint_pdf, axis=0), (joint_pdf.shape[0],1))
ax.pcolor(x,y,joint_pdf/marginal, cmap='viridis')
joint_marginal([[1,0],[0.5,1]])



## Probability theory and Bayesian inference

#### Probability as a calculus of belief

Bayesians treat probability as a calculus of belief; in this model of thought, probabilities are measures of degrees of belief. $P(A)=0$ means a belief that $A$ cannot be true and $P(A)=1$ is a belief that $A$ is absolutely certain.

#### Probability as the optimal way of representing uncertainty

Other representations of uncertainty are strictly inferior to probabilistic methods in the sense that a person, agent, computer placing "bets" on future events using probabilistic models has the best possible return out of all decision systems when there is uncertainty.

Bayesians allow for belief in states to be combined and manipulated via the rules of probability. The key process in Bayesian logic is updating of beliefs. Given some prior belief (it's Glasgow, it's not likely to be sunny) and some new evidence (there seems to be a bright reflection inside) we can update our belief to calculate the posterior -- our new probability that it is sunny outside. Bayesian inference requires that we accept priors over events, i.e. that we must explicitly quantify our assumptions with probability distributions.

### Axioms of probability

There are only a few basic axioms of probability:

#### Boundedness

$$0 \leq P(A) < 1$$

all possible events $A$ -- probabilities are 0, or positive and less than 1.

#### Sum to unity

$$\sum_A P(A)=1$$

for the complete set of possible values $A \in \sigma$ in a set of possibilities $\sigma$ -- something always happens.

#### Joint probability

$$P(A \lor B) = P(A) + P(B) - P(A \land B),$$

i.e. the probability of either of $A$ or $B$ happening is the sum of the independent probabilities minus the probability of both happening.

#### Conditional probability

The conditional probability $P(A|B)$ means the probability that $B$ will happen given that we already know $B$ to have happened. $$P(A|B) = \frac{P(A \land B)}{ P(B)}$$

### Events

Note that we can see an "event" as a random variable taking on a specific value i.e. $P(X=x)$

#### Prior, likelihood, posterior

We often want to know the probability of a model $H$ (and its parameters $\theta$) given some data $p(H, \theta|D)$. But we are often in the situation that we can only compute the likelihood of the data being generated by the model. Bayes' rule gives a consistent way to invert the probability distribution, taking a prior into account: $$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$

$$P(H,\theta|D) = \frac{P(D|H,\theta) P(H,\theta)}{P(D)}$$

$P(A|B)$ is called the posterior, $P(B|A)$ is called the likelihood, $P(A)$ is the prior and $P(B)$ is the evidence. Bayes' rule gives a consistent rule to take some prior belief and combine it with observed data to estimate a new distribution which combines them.



In [ ]:

def prior_posterior(prior_mean=0, prior_std=1, n=10):
mean = prior_mean
std = prior_std
var = std*std
prior = scipy.stats.norm(mean,std)
evidence = scipy.stats.norm(1, 0.25)

xs = np.linspace(-5,5,200)
plt.fill_between(xs, prior.pdf(xs), label="Prior", alpha=0.1)
plt.fill_between(xs, evidence.pdf(xs), label="True", alpha=0.1)

sample_var = 1.0 # the *expected* variance of our observations
# note that changing this allows us to continously adjust our belief
# in our observations

for i in range(n):

sample = evidence.rvs()
# single step update for a normal distribution
mean = (var * sample + sample_var * mean) / (sample_var + var)
var = (var*sample_var) / (sample_var+var)
# plot the sample and the resulting pdf
plt.plot([sample,sample],[0,-0.1], 'c', alpha=0.7)
plt.plot(xs, scipy.stats.norm(mean,np.sqrt(var)).pdf(xs), 'k:', alpha=0.5)
plt.plot([sample,sample],[0,-0.1], 'c', alpha=0.7, label="Observations")

plt.fill_between(xs, scipy.stats.norm(mean,np.sqrt(var)).pdf(xs), color='g', label="Posterior", alpha=0.2)
plt.legend()




In [ ]:

prior_posterior(0,0.75)




In [ ]:

prior_posterior(0,3)




In [ ]:

prior_posterior(-3,0.5)




In [ ]:

prior_posterior(-3,0.5, n=100)



#### Log probabilities

The probability of multiple independent random variables taking on a set of values can be computed from the product: $$P(X,Y,Z) = P(X)P(Y)P(Z)$$ and in general $$P(X_1, \dots, X_n) = \prod_{i=1}^{n} x_i$$

We often have to have to compute such products, but to multiply lots of values $<1$ leads to numerical issues. Instead, we often prefer to manipulate log probabilities, which can be summed instead of multiplied: $$\log P(X_1, \dots, X_n) = \sum_{i=1}^{n} \log P(X_i)$$

This is simply a numerical convenience which avoids underflow. The log-likelihood is just $\log P(B|A)$, and is often more convenient to work with than the raw likelihood.

## Integration over the evidence

We can say that the posterior probability is proportional to the product of the prior and the likelihood. But to evaluate its value, we need to compute $P(B)$, the evidence. This is tricky, but because probabilities must add up to 1, we can write $P(B)$ as: $$P(B) = \sum_{i} P(B|A_i) P(A_i)$$ for a set of discrete possibilities $A_i$ or $$P(B) = \int_{A} P(B|A) P(A) dA$$ for a continuous distribution over $A$.

This trick is essential in understanding Bayesian inference!

In general this is difficult to compute. For binary simple cases where there are only two possible outcomes ($A$ can only be true or false), Bayes' rule can be written as:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\bar A) P(\bar A)},$$

where $\bar A$ means "when A is false". In words:

• The probability that it is sunny given I can see a bright reflection is equal to:
• The probability that I would see a bright reflection if it were sunny times the probability that it might be sunny
• over
• The probability that I would see a bright reflection if it were sunny times the probability that it might be sunny plus the probability that I would see a bright reflection if it were not sunny times the probability it might not be sunny.

# Representing probabilistic models for computation

#### Graphical models

Transformations of expressions to graphs is familiar to most computer scientists -- it is an essential part of most optimising compilers. For example, the equation of a straight line might be written as a graph with dependencies as shown below (this is how a compiler would break down the expression):

If we have multiple dependent random variables whose distribution we want to infer, we can draw a graph of dependencies to form a graphical model. This explictly models dependencies between random variables (i.e. ones we don't know the value of precisely) and inference can be performed on the entire graph.

In CS terms, we are writing expressions down without fixing the variables, and then allowing the distribution of the values to be inferred when we observe data. This inference process narrows down the likely range a random variable could take on (hopefully!).

In a probabilistic graphical model, some nodes in the graph are observed -- that is we know their state because we have explicity measured it, and others are unobserved -- we know (or have guessed) the form of their distribution but not the parameters of that distribution. Some dependencies are deterministic (i.e. fully defined by the values of their parents), while others are stochastic (i.e. are characerised by distributions). We can infer the posterior distribution of unobserved nodes by integrating over the possible values that could have occured given the observed values.

### Linear regression

We can modify our straight line equation to write a model for linear regression:

All we need to do is specify that we expected the output $y$ to be normally distributed around the equation of a line given by $m$ and $c$; we can now infer $\sigma, m, c$ from observed data.

Or we can fix any of them, and infer the remainder (if, e.g. we knew in advance that $c=0$, or that $m$ tended to be large, assuming we could express this prior belief as a valid distribution).

Our assumption here is that we will observe data which has a latent structure modelled by a linear dependence on a variable $x$, plus some normally-distributed observation noise.

Note that we must put some prior distribution on every stochastic node; otherwise we cannot apply Bayes rule and therefore cannot perform updates.

We can't have any "wild" stochastic nodes which do not eventually depend on deterministic nodes, via some chain of prior distributions.

### Sampling

We can draw samples from a distribution, which gives us a set of definite (non-random) values which are distributed according to the PDF or PMF.

The sample mean $\mu$ of a set of samples from a distribution is an estimate of the expectation, which improves as the number of samples $N$ increases.

We can write this as: $$\mu = \frac{1}{N} \sum_{i=0}^{N} x_i \approx E[x] = \int_x x P(x) dx,$$ where $x_i$ are random samples from $P(x)$.

Furthermore, if we want to apply any function $f(x)$ to the distribution (e.g. to answer question like 'what is the expected value of a normal distribution whose "output" is squared'), we can estimate of $E[f(x)]$ very simply: $$\frac{1}{N} \sum_{i=0}^{N} f(x_i) \approx E[f(x)] = \int_x f(x) P(x) dx,$$.

### Monte Carlo inference

We often cannot compute the integral to normalise the probability distribution on the bottom of formula for Bayes Rule $P(B)$.

One common way to get round this is to use Monte Carlo sampling, where we approximate the integration over all the parameters by summing over random samples drawn from the distribution of parameters. There are many techinques to do this efficiently, as the naive sampling methods become very inefficient in high dimensional spaces (i.e. $N$ must be very large to reliably sample the space).

One of these techniques is Markov chain Monte Carlo (MCMC) constructs random walks which "wander" about in probablity distributions in a way that makes samples drawn from them represent the true distribution correctly. In other words, we can set up a model such that ergodic distribution of a a process (the MCMC process) is the posterior distribution.

The key idea is that if we can evaluate the likelihood $P(B|A)$ and prior $P(A)$ everywhere, we can find clever ways to sample from these distributions and thus the posterior, even if we can't draw samples from the posterior directly. Details on how this can be done technically in are explained in these lecture notes by Chris Holmes. We'll be using the PyMC software to do the heavy lifting.

## Let's do it: PyMC

We'll use the excellent PyMC module to do the inference. If you have questions about this module, you can read this tutorial or the API docs. There's a new and even nicer version with very powerful capabilities currently called PyMC3, but the dependencies are "hard" to install at the moment.

Let's implement the linear regression model in the intro in practice, using PyMC to build a graphical model and then run MCMC to sample from the posterior (i.e. estimate the distribution of random variables after seeing some evidence).



In [ ]:

### Bayesian Linear Regression with pymc
### We use Monte Carlo sampling to estimate the distribution of a linear function with a normally
### distributed error, given some observed data.
### Vaguely based on: http://matpalm.com/blog/2012/12/27/dead_simple_pymc/ and http://sabermetricinsights.blogspot.co.uk/2014/05/bayesian-linear-regression-with-pymc.html

## Utility function to plot the graph of a PyMC model
def show_dag(model):
dag = mc.graph.dag(model)
dag.write("graph.png",format="png")
from IPython.display import Image
i = Image(filename='graph.png')
return i

## generate data with a known distribution
## this will be our "observed" data
x = np.sort(np.random.uniform(0,20, (50,)))
m = 2
c = 15
epsilon = 3
# Add on some measurement noise, with std. dev. 3.0
noise =   np.random.normal(0,epsilon, x.shape)
y = m * x + c + noise

plt.scatter(x,y, c='C1', label="Datapoints")
plt.plot(x, m*x+c, '--', lw=3, label="True")
plt.legend()
plt.xlabel("x")
plt.xlabel("y")
plt.title("Synthetic data, and true generating function")




In [ ]:

print("True m: %.2f"%m)
print("True c: %.2f"%c)
print("True epsilon: %.2f"%epsilon)




In [ ]:

## Now, set up the PyMC model
## Specify the prior distribution of the unknown line function variables
## Here, we assume a normal distribution over m and c
# NB: note that these are parameterised as having a mean and a *precision* (1/variance)
m_unknown = mc.Normal('m', 0, 0.01)
c_unknown = mc.Normal('c', 0, 0.001)

## specify a prior over the precision (inverse variance) of the error term
# precision = 1/variance
## Here we specify a uniform distribution from 0.001 to 10.0
precision = mc.Uniform('precision', lower=0.001, upper=10.0)

# specify the observed input variable
# we use a normal distribution, but this has no effect --
# the values are fixed and the parameters
# never updated; this is just a way of transforming x
# into a variable pymc can work with
# (it's really a hack)
x_obs = mc.Normal("x_obs", 0, 1, value=x, observed=True)

# variables
@mc.deterministic(plot=False)
def line(m=m_unknown, c=c_unknown, x=x_obs):
return x*m+c

# specify the observed output variable
# (note we use tau instead of sigma, the precision paramterisation)
# note that it is critical to realise we are modelling a stochastic output variable
# whose *mean* and *variance* we are inferring
y_obs =  mc.Normal('y_obs', mu=line, tau=precision, value=y, observed=True)

model = mc.Model([m_unknown, c_unknown, precision, x_obs, y_obs])

# display the graphical model
show_dag(model)



## Running the model

We can now tell PyMC to perform inference on this graph, given the data observed and the parameters we fixed. We have to specify how many samples to take from the posterior distribution. PyMC will then perform the sampling process, and return samples for all of the stochastic variables in the graph.



In [ ]:

# sample from the distribution
mcmc = mc.MCMC(model)
mcmc.sample(iter=20000)



## Plotting the results

We can view the histograms of the posterior samples (called the trace). This lets us see the full distribution of possible values these parameters could take on. True Bayesian inference always provides distributions of parameter values rather than point parameter estimates.



In [ ]:

## plot histograms of possible parameter values
plt.figure()
plt.hist(mcmc.trace("m")[:], normed=True, bins=30)
plt.title("Estimate of m")
plt.figure()
plt.hist(mcmc.trace("c")[:], normed=True, bins=30)
plt.title("Estimate of c")
plt.figure()
plt.hist(np.sqrt(1.0/mcmc.trace("precision")[:]), normed=True, bins=30)
plt.title("Estimate of epsilon std.dev.")
plt.figure()



## Plotting the model predictions

We can also plot the predictions in the form of the regression lines each parameter set would correspond to. By overlaying one regression line for each sample, with suitably low opacity, we can build up an impression of the models that are compatible with the data.



In [ ]:

## now plot overlaid samples from the linear function
ms = mcmc.trace("m")[:]
cs = mcmc.trace("c")[:]

plt.title("Sampled fits")
plt.plot(x, y, '.', label="Observed")
plt.plot(x, x*m+c, '--C2', label="True")
xf = np.linspace(-20,40,200)
for m,c in zip(ms[::20], cs[::20]):
plt.plot(xf, xf*m+c, 'C1', alpha=0.005)

plt.plot(xf, xf*0, 'C1', alpha=0.00, label="Posterior model")
plt.legend()
plt.xlim(-20,40)
plt.ylim(-40,80)



## A simple mixture model

We can include both discrete and continuous variables. A very important case is where we have a mixture model. That is, we believe our observations come from one of a number of distributions. For example, in modelling human heights, we might expect height to be normally distributed, but to have two different distributions for men and women.

It is very straightforward to add this to a PyMC graphical model; it is just another random variable to infer. This model is discrete, rather than continuous, and its value selects among possible parameter configurations for child nodes; this is a hierarchical Bayesian model.



In [ ]:

## Adapted from the example given at
## http://stackoverflow.com/questions/18987697/how-to-model-a-mixture-of-3-normals-in-pymc

n = 3
ndata = 800
## Generate synthetic mixture-of-normals data, with means at -100,0,+50, and std. dev of 15,5,5
v = np.random.randint( 0, n, ndata)
data = ((v==0)*(np.random.normal(50,5,ndata)) +
(v==1)*(np.random.normal(-100,15,ndata)) +
(v==2)*np.random.normal(0,5,ndata))
## Plot the original data
plt.hist(data, bins=50);
plt.title("Histogram of mixture model synthetic data")




In [ ]:

plt.plot(data, '.')
plt.title("Scatterplot of synthetic mixture model data")



# Graphical models with discrete variables

We model this discrete component of our model as a categorical variable, which switches between multiple possible normal distributions.

We have to estimate:

• the distribution of the categorical variable (i.e. the weightings of the mixture)
• the parameters of each normal

This means we assume the data (for 3 mixture components) has three weights ($\alpha_1, \alpha_2, \alpha_3$) and two parameters for each distribution ($\mu_1, \sigma_1,\ \mu_2, \sigma_2,\ \mu_3, \sigma_3$)

In this case our weights are equal, but we can still infer the distribution and check whether the inference process recovers this uniform distribution.

We assume we know the number of mixtures here (it gets much harder to do the inference efficiently if we don't know this)



In [ ]:

## A Dirichlet model specifies the prior distribution over categories
## All 1 means that every category is equally likely
dd = mc.Dirichlet('dd', theta=(1,)*n)

## This variable "selects" the category (i.e. the normal distribution)
## to use. The Dirichlet distribution sets the prior over the categories.
category = mc.Categorical('category', p=dd, size=ndata)
#category = mc.Container([mc.Categorical("category%i" % i, p=dd)
#                         for i in range(ndata)])

## Now we set our priors the precision and mean of each normal distribution
## Note the use of "size" to generate a **vector** of variables
## (i.e. one for each category)

## We expect the precision of each normal to be Gamma
## distributed (this mainly forces it to be positive!)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)

## And the means of the normal to be normally distributed,
## with a precision of 0.01 (i.e. std. dev 100)
means = mc.Normal('means', 0, 0.01, size=n)

## These deterministic functions link the means of the
## observed distribution to the categories
## They just select one of the elements of the
## mean/precision vector, given the current value of category
## The input variables must be specified in the parameters,
## so that PyMC knows which variables to pass to it
@mc.deterministic
def mean(category=category, means=means):
return means[category]

@mc.deterministic
def prec(category=category, precs=precs):
return precs[category]

## Now we specify the variable we observe -- which is normally distributed, *but*
## we don't know the mean or precision.
## Instead, we pass the **functions** mean() and pred()
## which will be used at each sampling step.
## We specify the observed values of this node, and tell PyMC these are observed
## This is all that is needed to specify the model
obs = mc.Normal('obs', mean, prec, value=data, observed = True)

## Now we just bundle all the variables together for PyMC
model = mc.Model({'dd': dd,
'category': category,
'precs': precs,
'means': means,
'obs': obs})




In [ ]:

def show_dag(model):
dag = mc.graph.dag(model)
dag.write("graph.png",format="png")
from IPython.display import Image
i = Image(filename='graph.png')
return i

show_dag(model)




In [ ]:

mcmc = mc.MCMC(model)

## Now we tell the sampler what method to use
## Metropolis works well, but we must tell PyMC to use a specific
## discrete sampler for the category variable to get good results in a reasonable time
mcmc.use_step_method(mc.DiscreteMetropolis, model.category, proposal_distribution='Prior')

#mcmc.use_step_method(mc.DiscreteMetropolis, model.category) ## this step is key!

## Run the sampler (we'll discuss burn and thin later)
mcmc.sample(iter=525000, burn=10000)



# Results

Again, we can plot the histograms of the "trace" of the MCMC sampling process. This gives us samples from the posterior distribution.



In [ ]:

plt.figure()
plt.hist(mcmc.trace('means').gettrace()[:], normed=True)
plt.title("Estimated means")
plt.legend(['Component 1', 'Component 2', 'Component 3'])
plt.figure()
## show the result in terms of std. dev. (i.e sqrt(1.0/precision))
plt.title("Estimated std. dev")
plt.hist(np.sqrt(1.0/mcmc.trace('precs').gettrace()[:]), normed=True)
plt.legend(['Component 1', 'Component 2', 'Component 3'])




In [ ]:

pmf, bins, patches = plt.hist(mcmc.trace('category').gettrace()[:].ravel(), normed=True, bins=[0,1,2,3])
print(np.sum(pmf))



## Predictive posterior: sampling from the model

What we have plotted is the posterior distribution of the model parameters; i.e. the values we expect the model parameters to take on given the data we observed and our prior.

The predictive posterior is the distribution over observations we would expect to see. This means drawing samples from the model, while integrating over parameters from the posterior. By sampling from the predictive posterior, we are generating new synthetic data that should have the same statistical properties as the data (if our model is good).

One simple way of drawing samples from the predictive posterior is to iterate through the posterior samples, then sample from the model with the parameters fixed to the values for that posterior sample.



In [ ]:

means = mcmc.trace('means').gettrace()[:]
precs = mcmc.trace('precs').gettrace()[:]
categories = mcmc.trace('category').gettrace()[:]
# iterate over all posterior samples
samples = []
for i in range(len(means)):
# sample from the predictive posterior
# we just draw five samples here; we could draw lots of them, giving better "resolution"
for j in range(5):
cat = categories[i][np.random.randint(0, ndata)]
samples.append(mc.rnormal(means[i][cat],precs[i][cat]))

plt.hist(samples, bins=np.linspace(-200, 200, 100));



## Model likelihood

As a further visualisation, we can visualise the likelihood of the data given the model, $P(D|H,\theta)$. All we have is samples from the posterior, so we can only compute the likelihood for each of those samples. This gives a trace of the likelihood. We can visualise this using a Box plot:



In [ ]:

import mcmc_utils
plt.boxplot(mcmc_utils.logp_trace(mcmc, model, n_samples=20000));



We can compare this likelihood with that of a simpler model (here, a single Gaussian, instead of a mixture)



In [ ]:

# simple single Gaussian model
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
means = mc.Normal('means', 0, 0.001, size=n)
obs = mc.Normal('obs', mean, prec, value=data, observed = True)

model_simple = mc.Model({
'precs': precs,
'means': means,
'obs': obs})
mcmc_simple = mc.MCMC(model_simple)
mcmc_simple.sample(iter=525000, burn=10000)




In [ ]:

import mcmc_utils
plt.boxplot(mcmc_utils.logp_trace(mcmc_simple, model_simple, n_samples=40000));



## Practical: Modelling typing with a MCMC model

Back to the keyboard! We'd like to be able to produce probabilistic models of human behaviour; ideally generative models which could used to simulate plausible approximations of real behaviour.

[Image credit: Mysid - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=364930]

In this practical, we'll model key press behaviour; in particular we will build a simple model which assumes key hold durations (how long each key is held down for) are assumed to be LogNormal distributed with distribution parameters that are different for each type of key; in this example:

• 1: backspace
• 2: printable characters numbers, letters, punctuation, whitespace

Your task is to estimate the parameters of the LogNormal distribution for each of these three possible key classes (i.e. a mixture model), and see if there are any systematic differences. This involves building a model much like the one above, but with a mixture component that is observed rather than inferred.

This will be based on a keylog of approximately 6000 key presses (my keypresses editing an old summerschool notebook).

## Steps

The steps needed to complete this task are as follows:

2. Eyeball the data
3. Classify the keys into different types
4. Construct the MCMC model
5. Run it
6. Display the results

Steps 1-3 are provided below. You will need to complete steps 4-6, by adapting the MCMC sampling code provided above.

For Step 4, You will need to construct the PyMC model specification, then run the MCMC sampler, much as in the example above. But the inter-key timings should be a Lognormal rather than normal (Gaussian). Key timings are (obviously!) positive and are right-skewed; log-normal distributions model this well. Note that this is a mixture model, but unlike the above example, the mixture component is an observed variable (you know the key class)



In [ ]:

lognorm = scipy.stats.lognorm(s=1, loc=0)
x = np.linspace(-0.5, 5, 200)

plt.plot(x, lognorm.pdf(x))
plt.title("Log-normal PDF")



We can load the data from keylogs.csv. This has four columns:

    user_id  delta_t  duration_t  keycode


• user_id is a unique integer per-user (there is only one user here). delta_t is in milliseconds since the previous keydown event. duration_t is the duration the key was held down, in milliseconds.
• keycode is the keycode (result of e.which in Javascript) of the keypress.


In [ ]:

# Load the data, and give the columns sensible names
names=['user_id', 'delta_t', 'duration_t', 'keycode'])




In [ ]:

# summarize the data
print(keylog.describe())



## 2. Eyeballing data

Histograms are a quick way to visualise the distributions. You can try alternative visualisations if you are interested (e.g. a scatterplot of duration_t versus delta_t).



In [ ]:

plt.hist(keylog.delta_t, bins=np.linspace(0,500,50))
plt.title("Delta T")
plt.xlabel("Milliseconds")
plt.ylabel("Count")




In [ ]:

plt.hist(keylog.duration_t, bins=np.linspace(0,500,50));
plt.title("Duration T")
plt.xlabel("Milliseconds")
plt.ylabel("Count")




In [ ]:

plt.hist(keylog.keycode, bins=np.arange(1,100));
plt.title("Keycode")
plt.xlabel("Keycode")
plt.ylabel("Count")



## 3. Classifying keys

We can classify keys using the keys module, which lets us classify keys into semantic categories.



In [ ]:

import keys
# these are all the key types known to the keys module
print(keys.get_key_types())




In [ ]:

# convert a list of keycodes to their names, so we can see what kind of
# data we are dealing with
print(keys.key_to_name(keylog.keycode[0:100]))



### Getting class indices

The easiest way to encode information for the MCMC model is to use integer indices for each key "class", so that a standard Categorical variable type can be used.

To classify keys into discrete classes, we can use keys.classify_keys(k, classes). This takes a list of possible classes, and for each key, it will output which classes it belongs to. This will be a list of indices (since in theory one key might be able to belong to multiple key classes). The list will be empty if the key does not fall into any of the specified classes.



In [ ]:

print(keys.classify_keys(keylog.keycode[0:100], ['letters', 'backspace']))



## Creating the final dataset for inference

We classify the keys into arrow, backspace and character_keys as the three classes. Since each key can only belong to one of these classes, we filter this into a list of integers, one per key, with None for keys not having any of these classes.



In [ ]:

class_names = ['arrows', 'backspace', 'character_keys']
classes = keys.classify_keys(keylog.keycode, class_names)
filtered_classes = [c[0] if len(c)>0 else None for c in classes]
print(filtered_classes[0:100])



Then we filter both the keycodes and the corresponding duration_t, removing any None keys, to create two new lists:

• a list of class codes
• a list of corresponding durations


In [ ]:

class_codes = []
durations = []
for key_class, duration in zip(filtered_classes, keylog.duration_t):
if key_class is not None:
class_codes.append(key_class)
durations.append(duration)



## Dataset complete

This is our final dataset for the MCMC model.



In [ ]:

# just print the first 100 elements of each of these vectors
print(class_codes[0:100])
print(durations[0:100])



## Hints

• Use a prior of p=[1/3.0, 1/3.0, 1/3.0] for the Categorical variable (even though it is observed, a value for p is required)
• Make sure you flag the Categorical variable as being observed
• You can reuse the same priors for mean and prec as in the previous example (the parameters of the priors have little effect on the inference, as you can check yourself)
• Show your model with show_dag() before running it.


In [ ]:

# Solution
# Put the MCMC model definition here




In [ ]:

# Solution
show_dag(model)



## 5. Run the model



In [ ]:

## Solution



## 6. Report the results

• You can use the traces above
• It is also informative to try drawing samples from the posterior distributions. PyMC2 has no convenient method to do this directly (PyMC3 does), but you can just iterate through a bunch of the mean/prec, and then sample from a corresponding distribution.

In PyMC, sampling from any distribution with known parameters can be done using mc.r[XXX], where [XXX] is the lowercase distribution name.

rnormal(0, 1) # draw a sample from Normal, mean=0, precision=1
rlognormal(0, 1) # draw a sample from the log Normal with u=0, precision=1
rgamma(0.1, 0.1) # draw a sample from the Gamma distribution with  alpha=0.1, beta = 0.1


In [ ]:

# 5 samples from lognormal, mu=0, prec=1
mc.rlognormal(0,1,size=5)




In [ ]:

# Solution
# show visualisations like trace histograms, predicitive posterior



# Further thoughts

There are lots of interesting things we could also have done with just this subset of the data (class + duration):

• Explored different models for the time distribution (e.g. Gamma instead of lognormal)
• Predicted the key class given the duration.
• Try clustering the durations (e.g. with a Dirichlet process) and see if we can infer groupings of key behaviour that correspond to key classes or keyboard locations.

## MCMC in practice: sampling issues

The great thing about MCMC approaches is that you can basically write down your model and then run inference directly. There is no need to derive complex approximations, or to restrict ourselves to limited models for which we can compute answers analyitically.

MCMC allows us to use distributions we can't even sample from directly. First we couldn't calculate the evidence P(B), so we integrated; but we couldn't solve the integral, so we sampled; but then we couldn't directly sample from the distribution so we used MCMC. It's a very general approach! (approaches like approximate Bayesian computation (ABC) let us even do inference when we can't evaluate the likelihood, never mind sample from it)

The bad thing about MCMC approaches is that, even though it will do the "right thing" asymptotically, the choice of sampling strategy has a very large influence for the kind of sample runs that are practical to execute.

Bayesian inference should depend only on the priors and the evidence observed; but MCMC approaches also depend on the sampling strategy used to approximate the posterior.

## How MCMC works

We'll briefly describe how MCMC actually works, its limitations, and how we can detect MCMC problems and correct for them.

### Rejection sampling

The simplest way to perform Monte Carlo sampling for a distribution we can't sample directly is to do rejection sampling. We have a distribution we want to sample given by a pdf $p(x)$, and instead sample from an easy distribution $q(x)$, (usually uniform, i.e. a box) where $q(x)>p(x) \forall x$. Then, we draw a new sample $x_q$ from $q(x)$ (horizontal sample) and then sample uniformly from $x_s = [0,x_q]$ (vertical sample) and see if $x_s<f(x_q)$. If we so we keep it as a draw from the distribution, otherwise we reject it.



In [ ]:

def rejection_sample(p,interval_x, interval_y, n):
xs = np.random.uniform(interval_x[0], interval_x[1],(n,))
ys = np.random.uniform(interval_y[0], interval_y[1],(n,))
kept  = p(xs)>ys
return kept, np.logical_not(kept), xs, ys

def odd_pdf(x):
return scipy.stats.halfcauchy.pdf(x, scale=0.1)

kept, rejected, xs, ys = rejection_sample(odd_pdf, [-1,1], [0,7], 2000)
plt.plot(xs[kept], ys[kept], 'r.')

plt.plot(xs[kept], np.zeros_like(xs[kept])+0.01, 'r|', markersize=20)
for x,y in zip(xs[kept], ys[kept]):
plt.plot([x,x], [0.01,y], 'r', alpha=0.1)

plt.plot(xs[rejected], ys[rejected], 'k.')

xf = np.linspace(-1,1,100)
plt.plot(xf,odd_pdf(xf), '--')

print "Fraction under the curve: %.2f" % (np.sum(kept) / float(len(xs)))
plt.title("Red points are distributed according to the blue PDF")



### Rejection sampling in high-dimension

This is easy to implement, but works very poorly in high dimensions because the rejection rate increases exponentially with increasing dimension. We had about 5% acceptance in the above example: in 10D for a similar distribution, we would have an acceptance rate of $9\times 10^{-14}$: in practice we would never find a sample underneath the PDF by chance alone!

### Metropolis-Hastings

Metropolis-Hastings (or just plain Metropolis) takes a different approach, and is able to work in high-dimensional spaces. Metropolis samples also uses an auxiliary distribution $q(x)$, but it uses this to wander around in the distribution space, accepting jumps to new positions based on both $q(x)$ and $p(x)$. This random walk (a Markov chain, because we make a random jump conditioned only on where we currently are) is a the "Markov Chain" bit of "Markov Chain Monte Carlo".

We just take our current position $x$, and propose a new position $x^\prime = x + x_q$, where $x_q$ is a random sample drawn from $q(x)$. This makes local steps in the space of the probability density. If $q(x)$ has a simple, symmetric form (e.g. is Gaussian), there is a very simple formula to decide whether to accept or reject a step from $p(x)$ to a new candidate position $p(x^\prime)$: $$p(\text{accept}) = \begin{cases} p(x^\prime)/p(x), & p(x)>=p(x^\prime) \\ 1, & p(x)<p(x^\prime) \end{cases}$$

The asymmetric case is only slightly more involved, but it is very unusual to need to use it.

Amazingly, this simple procedure will (in the limit!) take unbiased samples from the complete posterior distribution. In practice, convergence can be difficult to achieve, especially in higher dimensions, with multi-modal posteriors.



In [ ]:

def metropolis(p,q,x_init,n):
# Perform Metropolis MCMC sampling.
# p(x): a function that can be evaluated anywhere. p(x) returns the value of p at x
# q(): a function q that draws a sample from a symmetric distribution and returns it
# x_init: a starting point
# n: number of samples
x = x_init

samples = []
rejected = [] # we only keep the rejected samples to plot them later
for i in range(n):
x_prime = x + q()
# if it's better, go right away
if p(x_prime)>p(x):
x = x_prime
samples.append(x_prime)
else:
# if not, go with probability proportional to the
# ratio between the new point and the current one
pa = p(x_prime)/p(x)
if np.random.uniform(0,1)<pa:
x = x_prime
samples.append(x_prime)
else:
rejected.append(x_prime)

return np.array(samples), np.array(rejected)




In [ ]:

# test the sampling process
# create an interesting distribution p (just a mixture of two gaussians)
A = np.array([[0.15, 0.9], [-0.3, 2.5]])
p1 = lambda x:scipy.stats.multivariate_normal(mean=[0,0], cov=A).pdf(x)
p2 = lambda x:scipy.stats.multivariate_normal(mean=[3,0], cov=np.eye(2)).pdf(x)
p = lambda x:p1(x)*0.5+p2(x)*0.5
# create a proposal distribution, with std. dev. 0.25
q = lambda: np.random.normal(0,0.75,(2,))

# make 500 MCMC steps
accept, reject = metropolis(p,q,[0.1, 0.3], 500)

# plot a heatmap of the distribution, along with the
# accepted and rejected samples from that MCMC chain
plt.figure(figsize=(10,10))
plt.plot(accept[:,0], accept[:,1])
plt.plot(accept[:,0], accept[:,1], 'b.')
plt.plot(reject[:,0], reject[:,1], 'rx')
x,y = np.meshgrid(np.linspace(-5,5,30), np.linspace(-4,4,30))
plt.imshow(p(np.dstack([x,y])), extent=[-4,4,-4,4], cmap='viridis')
plt.grid("off")
plt.title("MCMC sampling with Metropolis-Hastings")




In [ ]:

# run the chain for longer, and plot the trace and the histogram of the variable
accept, reject = metropolis(p,q,[0.1, 0.3], 5000)
plt.subplot(1,2,1)
plt.plot(accept[:,0])
plt.title("Trace for $x$")
plt.subplot(1,2,2)
plt.hist(accept[:,0], bins=20)
plt.title("Histogram of $x$")

plt.figure()

plt.plot(accept[:,1])
plt.subplot(1,2,1)
plt.title("Trace for $y$")
plt.plot(accept[:,1])
plt.subplot(1,2,2)
plt.title("Histogram of $y$")
plt.hist(accept[:,1], bins=20);

plt.figure()
xbins, ybins = np.linspace(-5,5,30), np.linspace(-4,4,30)
plt.hist2d(accept[:,0], accept[:,1], normed=True, bins=[xbins, ybins])
plt.title("2D histogram of MCMC points")



### Others

There are many other MCMC samplers, such as:

• Gibbs samplers, which are very efficient when we can sample from the conditional distribution (i.e. from one dimension of a distribution at a time), but not from the joint directly.
• Hamiltonian samplers, which extend Metropolis-like steps with "virtual physics" which pushes the samples in sensible directions (i.e. not down the gradient of the function!). These are implemented in PyMC3, as well as in the MCMC system STAN, and can work very well in otherwise intractably high-dimensional spaces.
• Slice samplers, which are very clever and efficient, but only work for 1-dimensional (univariate) distributions.

## Burn-in and thinning

MCMC tries to draw independent, unbiased samples from the posterior, but the sampling process (like Metropolis), is not inherently unbiased. For example, successive samples in a random walk are correlated and obviously not independent.

And although the Markov Chain approach (under fairly relaxed assumptions) will asympotically sample from all of the posterior, if the random walk starts off very far from the bulk of the distribution, it will "wander in the wilderness" for some time before reaching significant probability density. This means early samples from the distribution might be unreasonably dense in very low probability regions in the posterior. How "good" the Markov chain is at sampling from the posterior is called mixing; some MCMC setups may mix very badly until they get warmed up.

To mitigate these two common issues, there are a couple of standard tricks:

• Burn-in, which ignores the first $n$ samples from an MCMC draw, to make sure the chain is "mixing" well. Typically, several thousand samples might be ignored.
• Thinnning, which takes one sample from every $k$ consecutive samples from the chain, to reduce correlation. Values of raound 5-50 are common.

Tuning these is a matter of art!



In [ ]:

## Burn-in and thinning plot

# introduce correlations
y = accept[:,1]
x = np.arange(len(y))

# discard 400 samples, keep every 8th sample
burn = 400
thin = 8
plt.plot(x[0:burn], y[0:burn], 'r:')
plt.plot(x[burn::thin], y[burn::thin], 'go', markersize=8)
plt.plot(x[burn:], y[burn:], 'k:', alpha=0.1)
plt.plot(x[burn:], y[burn:], 'k.', alpha=0.1)

plt.axvline(burn, c='r')
plt.text(15,2.5,"Burn-in period")



The PyMC manual explains a number of other diagnostic statistics and plots. None of these are definitive, but can give skilled MCMC practitioners insight into the operation of the sampling process.

If you're interested in learning more about MCMC, David Mackay's book chapter is a good reference.