Inferring typing behaviour: I


In [ ]:
import numpy as np
import sklearn.datasets, sklearn.linear_model, sklearn.neighbors
import sklearn.manifold, sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
import sys, os, time
import scipy.io.wavfile, scipy.signal
import pymc as mc
import cv2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (18.0, 10.0)

In [ ]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

Topic purpose

This section will cover probabilistic inference. Rather than learning a single set of parameters by optimisation, we can model probability distributions over possible models that might be compatible with our data. We'll use Monte Carlo sampling to make it simple and easy (if not very efficient) to work with probabilistic models. We will use these approaches to model typing behaviour at the keystroke level, and both make predictions given some data ("how likely is it that this sequence was typed by user X?") and quantify how much confidence we have in those models.

Outline

In the next two hours, we will:

[Part I]

[Part II]


Keystroke modelling

Specifically, we'll look at inter-key timing data, that you have all been collecting from your notebooks today. This dataset is very simple. It captures a sequence of key events and their timing.

This is enough to answer some interesting questions; for example, can we detect which user is typing based on the distribution of inter-key times? Or more precisely, can we infer a probability distribution over users that might be generating a timing sequence and update that as we see more keystrokes?

Alternatively, we could see if the type of key hit (e.g. space versus letter, or left side versus right side) influences the timing of the next keystroke, and build a predictive model to infer the likely next time to key impact.

If we are very ambitiuous, we could try and recover the keystrokes themselves from timing alone. There probably isn't enough information to do this, at least without very large user-specific datasets and a robust language model, but we could probably identify unusual keys (e.g. identify space or enter "punctuation" in the timing stream).

The "key" message here is that we are going to build probabilistic models -- we will maintain uncertainty over future possibilities.

This is probabilistic programming.


Graphical models

Transformations of expressions to graphs is familiar to most computer scientists -- it is an essential part of most optimising compilers. For example, the equation of a straight line might be written as a graph (this is how a compiler would break down the expression):

Adding unknowns

If we have multiple dependent random variables whose distribution we want to infer, we can draw a graph of dependencies to form a graphical model. This explictly models dependencies between random variables (i.e. ones we don't know the value of precisely) and inference can be performed on the entire graph.

In CS terms, we are writing expressions down without fixing the variables, and then allowing the distribution of the values to be inferred when we observe data. This inference process narrows down the likely range a random variable could take on (hopefully!).

In a probabilistic graphical model, some nodes in the graph are observed -- that is we know their state because we have explicity measured it, and others are unobserved -- we know (or have guessed) the form of their distribution but not the parameters of that distribution. Some dependencies are deterministic (i.e. fully defined by the values of their parents), while others are stochastic. We can infer the posterior distribution of unobserved nodes by integrating over the possible values that could have occured given the observed values.

We can modify our straight line equation to write a model for linear regression:

All we need to do is specify that we expected the output $y$ to be normally distributed around the equation of a line given by $m$ and $c$; we can now infer $\sigma, m, c$ from observed data. Or we can fix any of them, and infer the remainder (if, e.g. we knew in advance that $c=0$). Our assumption here is that we will observe data which has a latent structure modelled by a linear dependence on a variable $x$, plus some normally-distributed observation noise.

Note that we must put some prior distribution on every stochastic node.

Probabilistic approaches

Random variables

A random variable is a variable that can (at some point in the future) take on different values; i.e. one that is "unassigned". Proability theory allows us to manipulate random variables without having to assign them a specific value.

Distributions

A probability distribution defines how likely different states of a random variable are. The probability distribution of a random variable $x$ is written: $$P(x)$$ Random variables can be continuous (e.g. the height of a person) or discrete (the value showing on the face of a dice). The distribution of a discrete variable is described with a probability mass function (PMF) which gives each outcome a specific value. A continuous variable has a probability density function (PDF) which specifies the spread of the probability as a continuous function.

A probability distribution must assign probabilities in the range 0 (impossible) to 1 (definite) and the PMF or PDF must integrate to exactly 1 as the random variable under consideration must take on some value.

Joint, marginal, conditional

The joint probability of two random variables is written $$P(x,y)$$ and gives the probability that $x$ and $y$ take the same value simultaneously.

The marginal probability is the derivation of $P(x)$ from $P(x,y)$ by integrating (summing) over all the possible choices of $y$: $$P(x) = \int P(x,y) dy$$

Two random variables are independent if the they do not have any dependence on each other. If this is the case then the joint distribution is just the product of the individual distributions: $P(x,y) = P(x)P(y)$

The conditional probability of $x$ given $y$ is written as $$P(x|y)$$ and can be computed as $$P(x|y) = \frac{P(x,y)}{P(x)}.$$ This tells us how likely $x$ is to occur if we already know (or fix) the value of $y$.


In [ ]:
def joint_marginal(cov):
    # create an independent 2D normal distribution
    x,y = np.meshgrid(np.linspace(-3,3,50), np.linspace(-3,3,50))
    pos = np.empty(x.shape + (2,))
    pos[:,:,0] = x
    pos[:,:,1] = y
    joint_pdf = scipy.stats.multivariate_normal.pdf(pos, [0,0], cov)
    fig = plt.figure()
    # plot the joint
    ax = fig.add_subplot(2,2,1)
    ax.axis('equal')
    plt.title("Joint p(x,y)")
    ax.pcolor(x,y,joint_pdf, cmap='viridis')
    # plot the marginals
    ax = fig.add_subplot(2,2,3)
    ax.axis('equal')
    plt.title("Marginal $p(x) = \int\  p(x,y) dy$")
    ax.plot(x[0,:], np.sum(joint_pdf, axis=0))
    ax = fig.add_subplot(2,2,2)
    ax.axis('equal')
    plt.title("Marginal $p(y) = \int\  p(x,y) dx$")
    ax.plot(np.sum(joint_pdf, axis=1), x[0,:])
    # plot p(x|y)
    ax = fig.add_subplot(2,2,4)
    ax.axis('equal')
    plt.title("Conditional $p(x|y) = \\frac{p(x,y)}{p(x)}$")
    marginal = np.tile(np.sum(joint_pdf, axis=0), (joint_pdf.shape[0],1))
    ax.pcolor(x,y,joint_pdf/marginal, cmap='viridis')
joint_marginal([[1,0],[0.5,1]])

Probability theory and Bayesian inference

Probability as a calculus of belief

Bayesians treat probability as a calculus of belief; in this model of thought, probabilities are measures of degrees of belief. $P(X)=0$ means a belief that $X$ cannot be true and $P(X)=1$ is a belief that $X$ is certainly 1.

Probability as the optimal way of representing uncertainty

Other representations of uncertainty are strictly inferior to probabilistic methods in the sense that a person, agent, computer placing "bets" on future events using probabilistic models has the best possible "return" out of all decision systems when there is uncertainty.

Bayesians allow for belief in states to be combined and manipulated via the rules of probability. The key process in Bayesian logic is updating of beliefs. Given some prior belief (it's Glasgow, it's not likely to be sunny) and some new evidence (there seems to be a bright reflection inside) we can update our belief to calculate the posterior -- our new probability that it is sunny outside. Bayesian inference requires that we accept priors over events, i.e. that we must explicitly quantify our assumptions with probability distributions.

Prior, likelihood, posterior

We often want to know the probability of a model (and its parameters) given some data $p(H, \theta|D)$. But we can only compute the likelihood of the data being generated by the model. Bayes' rule gives a consistent model for inverting the probability distribution: $$ p(A|B) = \frac{p(B|A) P(A)}{P(B)} $$

$P(A|B)$ is called the posterior, $P(B|A)$ is called the likelihood, $P(A)$ is the prior and $P(B)$ is the evidence. Bayes' rule gives a consistent rule to take some prior belief and combine it with observed data to estimate a new distribution which combines them.


In [ ]:
def prior_posterior():
    mean = 0
    std = 1
    prior = scipy.stats.norm(mean,std)
    evidence = scipy.stats.norm(1, 0.1)
    xs = np.linspace(-5,5,200)
    plt.plot(xs, prior.pdf(xs), label="Prior")
    for i in range(10):
        # this is **not** inference! just a visual example!
        mean = 0.8*mean + 0.2*1
        std = 0.8*std + 0.2*0.05
        v = evidence.rvs()
        plt.plot([v,v],[0,1], 'c', alpha=0.7)
        plt.plot(xs, scipy.stats.norm(mean,std).pdf(xs), 'k:', alpha=0.5)
    plt.plot([v,v],[0,1], 'c', alpha=0.7, label="Observations")
        
    plt.plot(xs, scipy.stats.norm(mean,std).pdf(xs), 'g', label="Posterior")
    plt.legend()
    
prior_posterior()

Integration over the evidence

We can say that the posterior probability is proportional to the product of the prior and the likelihood. But to evaluate its value, we need to compute $P(B)$, the evidence. This is tricky, but because probabilities must add up to 1, we can write $P(B)$ as: $$ P(B) = \sum_{i} P(B|A_i) P(A_i) $$ for a set of discrete possibilities $A_i$ or $$ P(B) = \int_{A} P(B|A) P(A) dA $$ for a continuous distribution over $A$.

This trick is essential in understanding Bayesian inference!

In general this is difficult to compute. For binary simple cases where there are only two possible outcomes ($A$ can only be true or false), Bayes' rule can be written as:

$$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\bar A) P(\bar A)}, $$

where $\bar A$ means "when A is false". In words:

The probability that it is sunny given I can see a bright reflection is equal to:
    The probability that I would see a bright reflection if it *were* sunny times the probability that it might be sunny 
    over
    The probability that I would see a bright reflection if it *were* sunny times the probability that it might be sunny plus the probability that I would see a bright reflection if it were *not* sunny times the probability it might not be sunny.

Sampling

We can draw samples from a distribution, which gives us a set of definite (non-random) variables which are distributed according to the PDF or PMF. The mean $\mu$ of a set of samples from a distribution is an estimate of the expectation, which improves as the number of samples $N$ increases.

We can write this as: $$ \frac{1}{N} \sum_{i=0}^{N} x_i \approx \int p(x) dx,$$ where $x_i$ are random samples from $p(x)$.

Furthermore, if we want to apply any function $f(x)$ to the distribution (e.g. to answer question like 'what is the expected value of a normal distribution whose "output" is squared'), we can estimate of $E[f(x)]$ very simply: $$ \frac{1}{N} \sum_{i=0}^{N} f(x_i) \approx \int f(x) p(x) dx,$$.

Monte Carlo inference

We often cannot compute the integral to normalise the probability distribution. One common way to get round this is to use Monte Carlo sampling, where we approximate the integration over all the parameters by summing over random samples drawn from the distribution of parameters. There are many techinques to do this efficiently, as the naive sampling methods become very inefficient in high dimensional spaces (i.e. $N$ must be very large to reliably sample the space).

This allows us to solve the tricky denominator in Bayes Rule:

$$ p(A|B) = \frac{p(B|A) P(A)}{P(B)} $$$$ = \frac{p(B|A) P(A)}{\int_{A} P(B|A) P(A) dA} $$

$$ \approx \frac{p(B|A) P(A)}{\frac{1}{N} \sum_{i=0}^{N} P(B|x) P(x)} $$

Since we can usually sample from $P(A)$ easily, this becomes very straightforward to compute (as long as we can sample from P(B|A), but there are tricks to do this even if it has an unpleasant form).

One of these techniques is Markov chain Monte Carlo (MCMC) constructs random walks which "wander" about in probablity distributions in a way that makes samples drawn from them represent the true distribution correctly.



Let's do it: PyMC

We'll use the excellent PyMC module to do the inference. If you have questions about this module, you can read this tutorial or the API docs. There's a new and even nicer version with very powerful capabilities currently called PyMC3, but the dependencies are "hard" to install at the moment.

Let's implement the linear regression model in the intro in practice, using PyMC to build a graphical model and then run MCMC to sample from the posterior (i.e. estimate the distribution of random variables after seeing some evidence).


In [ ]:
### Bayesian Linear Regression with pymc
### We use Monte Carlo sampling to estimate the distribution of a linear function with a normally
### distributed error, given some observed data.
### Vaguely based on: http://matpalm.com/blog/2012/12/27/dead_simple_pymc/ and http://sabermetricinsights.blogspot.co.uk/2014/05/bayesian-linear-regression-with-pymc.html

## Utility function to plot the graph of a PyMC model
def show_dag(model):
    dag = mc.graph.dag(model)
    dag.write("graph.png",format="png")
    from IPython.display import Image
    i = Image(filename='graph.png')
    return i

## generate data with a known distribution
## this will be our "observed" data
x = np.sort(np.random.uniform(0,20, (50,)))
m = 2
c = 15

# Add on some measurement noise, with std. dev. 3.0
epsilon = data = np.random.normal(0,3, x.shape)
y = m * x + c + epsilon

plt.plot(x,y, '.', label="Datapoints")
plt.plot(x, m*x+c, '--', lw=3, label="True")
plt.legend()
plt.xlabel("x")
plt.xlabel("y")

In [ ]:
## Now, set up the PyMC model
## specify the prior distribution of the unknown line function variables
## Here, we assume a normal distribution over m and c
m_unknown = mc.Normal('m', 0, 0.01)
c_unknown = mc.Normal('c', 0, 0.001)

## specify a prior over the precision (inverse variance) of the error term
# precision = 1/variance
## Here we specify a uniform distribution from 0.001 to 10.0
precision = mc.Uniform('precision', lower=0.001, upper=10.0)

# specify the observed input variable
# we use a normal distribution, but this has no effect -- the values are fixed and the paramters
# never updated; this is just a way of transforming x into a variable pymc can work with
# (it's really a hack)
x_obs = mc.Normal("x_obs", 0, 1, value=x, observed=True)

@mc.deterministic(plot=False)
def line(m=m_unknown, c=c_unknown, x=x_obs):
    return x*m+c

# specify the observed output variable (note if use tau instead of sigma, we use the precision paramterisation)
y_obs =  mc.Normal('y_obs', mu=line, tau=precision, value=y, observed=True)

model = mc.Model([m_unknown, c_unknown, precision, x_obs, y_obs])

# display the graphical model
show_dag(model)

In [ ]:
# sample from the distribution
mcmc = mc.MCMC(model)
mcmc.sample(iter=10000)

## plot histograms of possible parameter values
plt.figure()
plt.hist(mcmc.trace("m")[:], normed=True, bins=30)
plt.title("Estimate of m")
plt.figure()
plt.hist(mcmc.trace("c")[:], normed=True, bins=30)
plt.title("Estimate of c")
plt.figure()
plt.hist(np.sqrt(1.0/mcmc.trace("precision")[:]), normed=True, bins=30)
plt.title("Estimate of epsilon std.dev.")
plt.figure()

## now plot overlaid samples from the linear function
ms = mcmc.trace("m")[:]
cs = mcmc.trace("c")[:]

plt.title("Sampled fits")
plt.plot(x, y, '.', label="Observed")
plt.plot(x, x*m+c, '--', label="True")
xf = np.linspace(-20,40,200)
for m,c in zip(ms[::20], cs[::20]):    
    plt.plot(xf, xf*m+c, 'r-', alpha=0.005)
plt.legend()
plt.xlim(-20,40)
plt.ylim(-40,80)

A simple mixture model

We can include both discrete and continuous variables. A very important case is where we have a mixture model. That is, we believe our observations come from one of a number of distributions. For example, in modelling human heights, we might expect height to be normally distributed, but to have two different distributions for men and women.

It is very straightforward to add this to a PyMC graphical model; it is just another random variable to infer.


In [ ]:
## Adapted from the example given at 
## http://stackoverflow.com/questions/18987697/how-to-model-a-mixture-of-3-normals-in-pymc

n = 3
ndata = 500

## A Dirichlet model specifies the distribution over categories
## All 1 means that every category is equally likely
dd = mc.Dirichlet('dd', theta=(1,)*n)

## This variable "selects" the category (i.e. the normal distribution)
## to use. The Dirichlet distribution sets the prior over the categories.
category = mc.Categorical('category', p=dd, size=ndata)

## Now we set our priors the precision and mean of each normal distribution
## Note the use of "size" to generate a **vector** of variables (i.e. one for each category)

## We expect the precision of each normal to be Gamma distributed (this mainly forces it to be positive!)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)

## And the means of the normal to be normally distributed, with a precision of 0.001 (i.e. std. dev 1000)
means = mc.Normal('means', 0, 0.001, size=n)

## These deterministic functions link the means of the observed distribution to the categories
## They just select one of the elements of the mean/precision vector, given the current value of category
## The input variables must be specified in the parameters, so that PyMC knows which variables to pass to it
@mc.deterministic
def mean(category=category, means=means):
    return means[category]

@mc.deterministic
def prec(category=category, precs=precs):
    return precs[category]

## Generate synthetic mixture-of-normals data, with means at -50,0,+50, and std. dev of 1
v = np.random.randint( 0, n, ndata)
data = (v==0)*(np.random.normal(50,5,ndata)) + (v==1)*(np.random.normal(-50,5,ndata)) + (v==2)*np.random.normal(0,5,ndata)


## Plot the original data
plt.hist(data, bins=50)  

## Now we specify the variable we observe -- which is normally distributed, *but*
## we don't know the mean or precision. Instead, we pass the **functions** mean() and pred()
## which will be used at each sampling step.
## We specify the observed values of this node, and tell PyMC these are observed 
## This is all that is needed to specify the model
obs = mc.Normal('obs', mean, prec, value=data, observed = True)

## Now we just bundle all the variables together for PyMC
model = mc.Model({'dd': dd,
              'category': category,
              'precs': precs,
              'means': means,
              'obs': obs})

In [ ]:
def show_dag(model):
    dag = mc.graph.dag(model)
    dag.write("graph.png",format="png")
    from IPython.display import Image
    i = Image(filename='graph.png')
    return i
    
show_dag(model)

In [ ]:
mcmc = mc.MCMC(model)

## Now we tell the sampler what method to use
## Metropolis works well, but we must tell PyMC to use a specific
## discrete sampler for the category variable to get good results in a reasonable time
mcmc.use_step_method(mc.AdaptiveMetropolis, model.means)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.precs)
mcmc.use_step_method(mc.DiscreteMetropolis, model.category) ## this step is key!
mcmc.use_step_method(mc.AdaptiveMetropolis, model.dd)

## Run the sampler
mcmc.sample(iter=125000, burn=2000)

In [ ]:
plt.figure()
plt.hist(mcmc.trace('means').gettrace()[:], normed=True)
plt.title("Estimated means")
plt.legend(['Component 1', 'Component 2', 'Component 3'])
plt.figure()
## show the result in terms of std. dev. (i.e sqrt(1.0/precision))
plt.title("Estimated std. dev")
plt.hist(np.sqrt(1.0/mcmc.trace('precs').gettrace()[:]), normed=True)
plt.legend(['Component 1', 'Component 2', 'Component 3'])

Practical: Modelling typing with a MCMC model

You'll build a simple model which assumes inter-key timings are Gamma distributed, with distribution parameters that are different for each person.

Steps

You will need to construct the PyMC model specification, then run the MCMC sampler. The structure should be a mixture model, like the one above. But the inter-key timings should be a Gamma rather than normal (Gaussian). Note that this is a mixture model, but unlike the above example, the mixture component is an observed variable (you know the user id)

Load the data from keylogs.csv. This has four columns:

    user_id  delta_t  duration_t  keycode

user_id is a unique integer per-user. Time (delta_t) is in milliseconds since the previous keydown event. duration_t is the duration the key was held down, in milliseconds. keycode is the keycode (result of e.which in Javascript) of the keypress.


In [ ]:
## the variables in the model should go in the list passed to Model
model = mc.Model([])

## see the graphical representation of the model
show_dag(model)

## Construct a sampler
mcmc = mc.MCMC(model)

## Sample from the result; you should try changing the number of iterations
mcmc.sample(iter=10000)

## Use the trace methods from pymc to explore the distribution of values