Inference with GPs

The dataset needed for this worksheet can be downloaded. Once you have downloaded s9_gp_dat.tar.gz, and moved it to this folder, execute the following cell:


In [ ]:
!tar -zxvf s9_gp_dat.tar.gz
!mv *.txt data/

Here are the functions we wrote in the previous tutorial to compute and draw from a GP:


In [ ]:
import numpy as np
from scipy.linalg import cho_factor


def ExpSquaredKernel(t1, t2=None, A=1.0, l=1.0):
    """
    Return the ``N x M`` exponential squared
    covariance matrix between time vectors `t1`
    and `t2`. The kernel has amplitude `A` and
    lengthscale `l`.
    
    """
    if t2 is None:
        t2 = t1
    T2, T1 = np.meshgrid(t2, t1)
    return A ** 2 * np.exp(-0.5 * (T1 - T2) ** 2 / l ** 2)


def draw_from_gaussian(mu, S, ndraws=1, eps=1e-12):
    """
    Generate samples from a multivariate gaussian
    specified by covariance ``S`` and mean ``mu``.
    
    (We derived these equations in Day 1, Notebook 01, Exercise 7.)
    """
    npts = S.shape[0]
    L, _ = cho_factor(S + eps * np.eye(npts), lower=True)
    L = np.tril(L)
    u = np.random.randn(npts, ndraws)
    x = np.dot(L, u) + mu[:, None]
    return x.T


def compute_gp(t_train, y_train, t_test, sigma=0, A=1.0, l=1.0):
    """
    Compute the mean vector and covariance matrix of a GP
    at times `t_test` given training points `y_train(t_train)`.
    The training points have uncertainty `sigma` and the
    kernel is assumed to be an Exponential Squared Kernel
    with amplitude `A` and lengthscale `l`.
    
    """
    # Compute the required matrices
    kernel = ExpSquaredKernel
    Stt = kernel(t_train, A=1.0, l=1.0)
    Stt += sigma ** 2 * np.eye(Stt.shape[0])
    Spp = kernel(t_test, A=1.0, l=1.0)
    Spt = kernel(t_test, t_train, A=1.0, l=1.0)

    # Compute the mean and covariance of the GP
    mu = np.dot(Spt, np.linalg.solve(Stt, y_train))
    S = Spp - np.dot(Spt, np.linalg.solve(Stt, Spt.T))
    
    return mu, S

The Marginal Likelihood

In the previous notebook, we learned how to construct and sample from a simple GP. This is useful for making predictions, i.e., interpolating or extrapolating based on the data you measured. But the true power of GPs comes from their application to regression and inference: given a dataset $D$ and a model $M(\theta)$, what are the values of the model parameters $\theta$ that are consistent with $D$? The parameters $\theta$ can be the hyperparameters of the GP (the amplitude and time scale), the parameters of some parametric model, or all of the above.

A very common use of GPs is to model things you don't have an explicit physical model for, so quite often they are used to model "nuisances" in the dataset. But just because you don't care about these nuisances doesn't mean they don't affect your inference: in fact, unmodelled correlated noise can often lead to strong biases in the parameter values you infer. In this notebook, we'll learn how to compute likelihoods of Gaussian Processes so that we can marginalize over the nuisance parameters (given suitable priors) and obtain unbiased estimates for the physical parameters we care about.

Given a set of measurements $y$ distributed according to $$ \begin{align} y \sim \mathcal{N}(\mathbf{\mu}(\theta), \mathbf{\Sigma}(\alpha)) \end{align} $$ where $\theta$ are the parameters of the mean model $\mu$ and $\alpha$ are the hyperparameters of the covariance model $\mathbf{\Sigma}$, the marginal likelihood of $y$ is $$ \begin{align} \ln P(y | \theta, \alpha) = -\frac{1}{2}(y-\mu)^\top \mathbf{\Sigma}^{-1} (y-\mu) - \frac{1}{2}\ln |\mathbf{\Sigma}| - \frac{N}{2} \ln 2\pi \end{align} $$

where $||$ denotes the determinant and $N$ is the number of measurements. The term marginal refers to the fact that this expression implicitly integrates over all possible values of the Gaussian Process; this is not the likelihood of the data given one particular draw from the GP, but given the ensemble of all possible draws from $\mathbf{\Sigma}$.

Exercise 1

Define a function ln_gp_likelihood(t, y, sigma, A=1, l=1) that returns the log-likelihood defined above for a vector of measurements y at a set of times t with uncertainty sigma. As before, A and l should get passed direcetly to the kernel function. Note that you're going to want to use np.linalg.slogdet to compute the log-determinant of the covariance instead of np.log(np.linalg.det). (Why?)


In [ ]:
def ln_gp_likelihood(t, y, sigma=0, A=1.0, l=1.0):
    """
    
    """
    # do stuff in here
    pass

Exercise 2

The following dataset was generated from a zero-mean Gaussian Process with a Squared Exponential Kernel of unity amplitude and unknown timescale. Compute the marginal log likelihood of the data over a range of reasonable values of $l$ and find the maximum. Plot the likelihood (not log likelihood) versus $l$; it should be pretty Gaussian. How well are you able to constrain the timescale of the GP?


In [9]:
import matplotlib.pyplot as plt
t, y, sigma = np.loadtxt("data/sample_data.txt", unpack=True)
plt.plot(t, y, "k.", alpha=0.5, ms=3)
plt.xlabel("time")
plt.ylabel("data");


Exercise 3a

The timeseries below was generated by a linear function of time, $y(t)= mt + b$. In addition to observational uncertainty $\sigma$ (white noise), there is a fair bit of correlated (red) noise, which we will assume is well described by the squared exponential covariance with a certain (unknown) amplitude $A$ and timescale $l$.

Your task is to estimate the values of $m$ and $b$, the slope and intercept of the line, respectively. In this part of the exercise, assume there is no correlated noise. Your model for the $n^\mathrm{th}$ datapoint is thus

$$ \begin{align} y_n \sim \mathcal{N}(m t_n + b, \sigma_n\mathbf{I}) \end{align} $$

and the probability of the data given the model can be computed by calling your GP likelihood function:

def lnprob(params):
    m, b = params
    model = m * t + b
    return ln_gp_likelihood(t, y - model, sigma, A=0, l=1)

Note, importantly, that we are passing the residual vector, $y - (mt + b)$, to the GP, since above we coded up a zero-mean Gaussian process. We are therefore using the GP to model the residuals of the data after applying our physical model (the equation of the line).

To estimate the values of $m$ and $b$ we could generate a fine grid in those two parameters and compute the likelihood at every point. But since we'll soon be fitting for four parameters (in the next part), we might as well upgrade our inference scheme and use the emcee package to do Markov Chain Monte Carlo (MCMC). If you haven't used emcee before, check out the first few tutorials on the documentation page. The basic setup for the problem is this:

import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob)

initial = [4.0, 15.0]
p0 = initial + 1e-3 * np.random.randn(nwalkers, ndim)

print("Running burn-in...")
p0, _, _ = sampler.run_mcmc(p0, nburn)   # nburn = 500 should do
sampler.reset()

print("Running production...")
sampler.run_mcmc(p0, nsteps);            # nsteps = 1000 should do

where nwalkers is the number of walkers (something like 20 or 30 is fine), ndim is the number of dimensions (2 in this case), and lnprob is the log-probability function for the data given the model. Finally, p0 is a list of starting positions for each of the walkers. Above we picked some fiducial/eyeballed value for $m$ and $b$, then added a small random number to each to generate different initial positions for each walker. This will initialize all walkers in a ball centered on some point, and as the chain progresses they'll diffuse out and begin to explore the posterior.

Once you have sampled the posterior, plot several draws from it on top of the data. You can access a random draw from the posterior by doing

m, b = sampler.flatchain[np.random.randint(len(sampler.flatchain))]

Also plot the true line that generated the dataset (given by the variables m_true and b_true below). Do they agree, or is there bias in your inferred values? Use the corner package to plot the joint posterior. How many standard deviations away from the truth are your inferred values?


In [13]:
t, y, sigma = np.loadtxt("data/sample_data_line.txt", unpack=True)
m_true, b_true, A_true, l_true = np.loadtxt("data/sample_data_line_truths.txt", unpack=True)
plt.errorbar(t, y, yerr=sigma, fmt="k.", label="observed")
plt.plot(t, m_true * t + b_true, color="C0", label="truth")
plt.legend(fontsize=12)
plt.xlabel("time")
plt.ylabel("data");


Exercise 3b

This time, let's actually model the correlated noise. Re-define your lnprob function to accept four parameters (slope, intercept, amplitude, and timescale). If you didn't before, it's a good idea to enforce some priors to keep the parameters within reasonable (and physical) ranges. If any parameter falls outside this range, have lnprob return negative infinity (i.e., zero probability).

You'll probably want to run your chains for a bit longer this time, too. As before, plot some posterior samples for the line, as well as the corner plot. How did you do this time? Is there any bias in your inferred values? How does the variance compare to the previous estimate?

Exercise 3c

If you didn't do this already, re-plot the posterior samples on top of the data, but this time draw them from the GP, conditioned on the data. How good is the fit?