In [ ]:
from __future__ import print_function, division, absolute_import
In this tutorial, we are to explore some slightly more realistic applications of GPs to astrophysical (or at least, astronomy-like) datasets.
We will do this using the popular george
package by Daniel Foreman-Mackey. george
doesn't have all the functionality of more general packages such as GPy
and scikit-learn
, but it still has a nice modelling interface, is easy to use, and is faster than either of the other two.
We will also use another of Dan's packages, emcee
to explore posterior probabilities using MCMC, and his corner.py
module to plot the resulting parameter samples.
Why george
? george
doesn't have all the functionality of GPy
, but it is easy to use, and is faster than either of the other two. And I'm more familiar with it.
We will also use another of Dan's packages, emcee
to explore posterior probabilities using MCMC, and his corner.py
module to plot the resulting parameter samples.
Before you start, make sure you have the latest stable version of these packages installed. If you used george
before, note the API has changed significantly between versions 0.2.x and 0.3.0.
The easiest way to install all three packages is with pip
:
pip install emcee
pip install george
pip install corner
Full documentation is available here:
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import george, emcee, corner
from scipy.optimize import minimize
george
introductory tutorialsThe george
documentation includes some nice tutorials, which you'll need to run through before being able to tackle the problems below. Download and run the notebooks, making sure you understand what's going on at each step, and don't hesitate to ask questions!
Now you should have an idea of how to set up a basic GP model using george
, how to make predictions, and how to evaluate the likelihood, optimize it, and explore the posterior using MCMC. I would also encourage you to try out the other tutorials, but they are not pre-requisites for this one.
In [ ]:
N = 100
xobs = np.random.uniform(-5,5,N)
yobs = np.random.uniform(-5,5,N)
zobs = - 0.05 * xobs**2 + 0.03 * yobs**2 - 0.02 * xobs * yobs
eobs = 0.01
zobs += np.random.normal(0,eobs,len(xobs))
plt.scatter(xobs, yobs, c=zobs, s=20, marker='.')
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
cb = plt.colorbar()
cb.set_label(r'$z$');
Now we will construct the GP model using george
. We will use a with different length scales in each of the two dimensions. To set this up in george, you have to multiply two individual kernels together, like that:
k = a * KernelName(b, ndim = 2, axes = 0) * KernelName(c, ndim = 2, axes = 1)
Here KernelName
stands for the name of the kernel used (in george
, the squared exponential kernel is called ExpSquaredKernel
), a
is the output variance, b
is the metric, or length scale, applied to the first input dimension, and c
to the second.
Note this is equivalent to the parametrisation used in the lectures: $$ k(x,x') = A \exp \left[ - \Gamma (x-x')^2\right] = A \exp \left[ - (x-x')^2/m^2\right] $$ with $\Gamma=1/m^2$.
Go ahead and define the kernel in the cell below, with some ball park values for the hyper-parameters (by ball-park, I mean not too many orders of magnitudes off). Then create a GP object using that kernel.
In [ ]:
k = 1.0 * george.kernels. # ...complete
gp = george.GP(# complete
Now you will need to tell the GP object what inputs the covariance matrix is to be evaluated at. This is done using the compute
method. 2-D inputs need to be passed as an $N \times 2$ array, which you will need to construct from the two 1-D arrays of $x$- and $y$-values we generated earlier. The second argument of compute
should be the white noise standard deviation.
In [ ]:
Xobs = np.concatenate([[xobs],[yobs]]).T
gp.compute(# complete
Following the example in the first george
tutorial, define a simple neg log likelihood function, and a function to evaluate its gradient.
In [ ]:
def neg_ln_like(p):
# complete
return # complete
def grad_neg_ln_like(p):
# complete
return # complete
Note that the parameters which are accessed through the set_parameter_vector
method are the logarithms of the values used in building the kernel. The optimization is thus done in terms of the log parameters.
Again following the same example, find the hyper-parameters that maximise the likelihood, using scipy.optimize
's minimize
function, and print the results.
In [ ]:
result = minimize(# complete
print(result)
Now assign those best-fit values to the parameter vector
In [ ]:
gp.set_parameter_vector(# complete
Generate a grid of regularly spaced $x$ and $y$ locations, spanning the range of the observations, where we will evaluate the predictive distribution. Store these in 2-D arrays called X2D
and Y2D
. Then convert them into a single 2-D array of shape $N_{\mathrm{pred}} \times 2$, which will be passed to the GP's predict
method.
Hint: use numpy
's mrid
function.
In [ ]:
X2D,Y2D = np.mgrid[# complete
Xpred = np.concatenate(# complete
Using the best-fit hyper-parameters, evaluate the mean of the predictive distribution at the grid locations. The output will be a 1-D array, which you will need to reshape so it has the same shape as X2D
and Y2D
for plotting.
In [ ]:
zpred = gp.predict(# complete
Z2D = zpred.reshape(# complete
Execute the cell below to plot contours of the predictive mean alongside the data.
In [ ]:
plt.scatter(xobs, yobs, c=zobs, s=20, marker='.')
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
cb = plt.colorbar()
cb.set_label(r'$z$');
plt.contour(X2D,Y2D,Z2D);
Visualising the confidence intervals is a bit tricky in 3-D so we'll skip that. We could use emcee
to explore the posterior distribution of the hyper-parameters, but we will leave that for a more realistic example.
In the above problem we were modelling a non-separable function of $x$ and $y$ (because of the cross-term in the polynomial). Now we will model a separable function, and use a GP with a sum rather than a product of kernels to separate the dependence on each of the input variable.
This exploits the fact that GPs preserve additivity. In other words, a GP with a sum of kernels, each depending on a disjoint subset of the inputs, sets up a probability distribution over functions that are sums of functions of the individual subsets of inputs. This is how the K2SC pipeline (for removing pointing systematics in K2 data) discussed in the lectures works.
As ever, we start by simulating a dataset. Execute the cell below.
In [ ]:
N = 100
xobs = np.random.uniform(-5,5,N)
yobs = np.random.uniform(-5,5,N)
zobs = -0.05 * xobs**2 + np.sin(yobs)
eobs = 0.01
zobs += np.random.normal(0,eobs,len(xobs))
plt.scatter(xobs, yobs, c=zobs, s=20, marker='.')
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
cb = plt.colorbar()
cb.set_label(r'$z$');
We start, once again, by defining the GP object. The kernel will consist of a sum of 2 squared exponentials, one applied to each dimension. It will be useful to be able to access each of the kernel objects separately later, so start by defining each of the component kernel, assigning them to variables k1
and k2
, and then define the overal kernel k
as the sum of the two. Then define the GP object itself.
In [ ]:
k1 = # complete
k2 = # complete
k = # complete
gp = # complete
Xobs = # complete
Next we want to optimize the likelihood. Luckily we can re-use the neg log likelihood and gradient functions from the previous problem. Start by packaging up the two inputs into a single 2-D vector, as in Problem 1, then use the minimize
function to evaluate the max. likelihood hyper-parameters.
In [ ]:
gp.compute(# complete
result = minimize(# complete
print(result)
Now let's plot the predictive distribution to check it worked ok. You can just copy and paste code from Problem 1.
In [ ]:
# copy plotting commands from problem 1c
We now come to evaluating the predictive means for the individual components. The standard expression for the predictive mean is: $$ \overline{\boldsymbol{y}}_* = K(\boldsymbol{x}_*,\boldsymbol{x}) K(\boldsymbol{x},\boldsymbol{x})^{-1} \boldsymbol{y} $$ The predictive mean for a given component of the kernel is obtained simply by replacing the first instance of the covariance matrix between test and training points, $K(\boldsymbol{x}_*,\boldsymbol{x})$, by the corresponding matrix for the component in question only: $$ \overline{\boldsymbol{y}}_{1,*} = K_1(\boldsymbol{x}_*,\boldsymbol{x}) K(\boldsymbol{x},\boldsymbol{x})^{-1} \boldsymbol{y}. $$
george
doesn't provide a built-in function to do this, but
apply_inverse
, which evaluates and returns the product $K(\boldsymbol{x},\boldsymbol{x})^{-1} \boldsymbol{y}$ for a given vector of training set outputs $\boldsymbol{y}$,get_value
, which evaluates the covariance matrix for a given set of inputs.Use these two functions to evaluate the two components of the best-fit GP model in our problem. Store the $x$- and $y$ components in variables fx
and fy
, respectively.
Hint: The apply_inverse
method does what it says in the name, i.e. it modifies its argument by pre-multiplying it by the inverse of the covariance matrix. Therefore, you need to pass it a copy of the vector of obserced outputs, not the original.
In [ ]:
b = np.copy(zobs)
gp.apply_inverse(#
K1 = # complete
fx = np.dot(# complete
K2 = # complete
fy = np.dot(# complete
Now execute the cell below to plot the results.
In [ ]:
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.plot(xobs,zobs,'.',c='grey')
plt.plot(xobs,zobs-fy,'k.')
s = np.argsort(xobs)
plt.plot(xobs[s],fx[s],'r-')
plt.subplot(122)
plt.plot(yobs,zobs,'.',c='grey')
plt.plot(yobs,zobs-fx,'k.')
s = np.argsort(yobs)
plt.plot(yobs[s],fy[s],'r-');
Consider a situation where we have several time-series, which we expect to display the same behaviour (up to observational noise), except for a time-delay. We don't know the form of the behaviour, but we want to measure the time-delay between each pair of time-series. Something like this might arise in AGN reverberation mapping, for example.
We can do this by modelling the time-series as observations of the same GP, with shifted inputs, and marginalising over the GP hyper-parameters to obtain posterior distribution over the time shifts.
First, let's simulate some data. We will cheat by doing this using a GP, so we know it will work. Execute the cell below.
In [ ]:
N = 50
M = 3
t2d = np.tile(np.linspace(0,10,N),(M,1))
for i in range(M):
t2d[i,:] += np.random.uniform(-5./N,5./N,N)
delays_true = [-1.5,3]
t_delayed = np.copy(t2d)
for i in range(M-1):
t_delayed[i+1,:] = t2d[i,:] + delays_true[i]
gp = george.GP(1.0 * george.kernels.Matern52Kernel(3.0))
gppar_true = gp.get_parameter_vector()
y2d = gp.sample(t_delayed.flatten()).reshape((M,N))
wn = 0.1
y2d += np.random.normal(0,wn,(M,N))
for i in range(M):
plt.errorbar(t2d[i,:],y2d[i,:].flatten(),yerr=wn,capsize=0,fmt='.')
plt.xlabel('t')
plt.ylabel('y');
Because the function goes up an down, you can probably guess that the likelihood surface is going to be highly multi-modal. So it's important to have a decent initial guess for the time delays.
A simple way to do obtain one would be by cross-correlation, but since the time-series are not regularly sampled (because of the small random term we added to each of the time arrays), we need to interpolate them onto a regular grid first. What better way to do this than with a GP? This will have the added advantage of giving us an initial estimate of the GP hyper-parameters too (we're assuming we don't know them either, though we will assume we know the white noise standard deviation).
First we need to define a GP object, based on a Matern 3/2 kernel with variable input scale and variance. Do this in the cell below.
In [ ]:
k = # complete
gp = # complete
Now we need to fit each time-series in turn, and compute the mean of the predictive distribution over a tightly sampled, regular grid of time values. If you take care to name our variables right, you can reuse the neg log likelihood and associated gradient functions from Problem 1.
Complete the code below and run it
In [ ]:
p0 = gp.get_parameter_vector()
# 2-D array to hold the best-fit GP HPs for each time-series
p1 = np.tile(# complete
# Regularly sampled time array
treg = # complete
# 2-D array to hold the interpolated time-series
yreg = # complete
c = ['r','g','b']
for i in range(M):
# Compute the gp on the relevant subset of the 2-D time array t2d
# complete
# Assign the corresponding y values to the variable zobs
# (this is the one that neg_ln_like uses to condition the GP)
zobs = # complete
# Optimize the likelihood using minimize
result = minimize(# complete
# Save the best-fit GP HPs in p1
p1[i,:] = # complete
# update the GP parameter vector with the best fit values
gp.# complete
# evaluate the predictive mean conditioned on zobs at locations treg and save in yreg
yreg[i,:] = # complete
# you might want to plot the results to check it worked
plt.plot(t2d[i,:],y2d[i,:],'.',c=c[i])
plt.plot(treg,yreg[i,:],'-',c=c[i])
# And let's print the GP HPs to see if they were sensible.
print('Individual GP fits: best-fit HPs')
print(p1)
Now we are ready to cross-correlate the interpolated time-series. The easiest way to do this is using the function xcorr
from matplotlib.pyplot
. This function returns a tuple of 4 variables, the first two of which are the lags and corresponding cross-correlation values.
In [ ]:
dt = treg[1] - treg[0]
# Array to hold estimates of the time-delays
delays_0 = np.zeros(M-1)
for i in range(M-1):
# use pyplot's xcorr function to cross-correlate yreg[i+1] with yreg[0]
lags, corr, _, _ = # complete
# find the lag that maximises the CCF, convert it to time delay, save in delays_0 array
lmax = # complete
plt.axvline(lmax,color=c[i+1])
delays_0[i] = # complete
plt.xlabel('lag')
plt.ylabel('x-correlation');
# Compare estimated to true delays
print('Estimated time delays from cross-correlation')
print(delays_0)
print('True delays')
print(delays_true)
As you can see, the delays estimated in this way aren't too far off.
To get initial guesses for the GP hyper-parameters, we can take the mean of the best-fit values from the three individual time-series. Do this in the cell below.
In [ ]:
gppar_0 = # complete
print('Estimated GP HPs')
print(gppar_0)
print('True GP HPs')
print(gppar_true)
The GP HPs aren't too far off either.
Now we have some initial guesses for the time-delays and the GP hyper-parameters, we're ready to model the time-series simultaneously, using a single GP. We need to write a new likelihood function to do this. The function will need to apply the delays to the times, before passing these times to george
to evaluate the likelihood itself.
First let's define a function apply_delays
, which will take the delays and the time array t
as inputs, and return an $M \times N$ array of delayed times. This function will be called by the likelihood function, but it might be useful later for plotting the results too. It would also be useful for this function to warn us if the time-delays are such that one of the time-series no longer overlaps with the others at all, for example by returning a boolean variable that is true if all is well, but false if not.
Complete the definition below.
In [ ]:
def apply_delays(delays,t2d):
t_delayed = np.copy(t2d)
for i, delay in enumerate(delays):
t_delayed[i+1,:] # complete
ok = True
M = len(delays) + 1
for i in range(M):
# complete
return t_delayed, ok
Now we are ready to define the likelihood function itself. The likelihood should accept a parameter array consisting of the shifts first, and then the GP hyper-parameters, and make use of the output of apply_delays
to return a very high number if the time delays are unreasonable. Complete the definition below.
In [ ]:
def neg_ln_like_delays(p):
delays = p[# complete
t_delayed, ok = # complete
if not ok:
# complete
gp.set_parameter_vector(# complete
gp.compute(# complete
return # complete
There is no simple analytical way to evaluate the gradient of the log likelihood with respect to the time delays, so we will not define a grad_neg_log_like
function for this problem. The gradient descent optimizer will be slower, since it will have to evaluate the gradients numerically, but for such a small dataset it doesn't matter.
Ok, now we are ready to run the optimizer. Like before, we can use the minimize
function from scipy.optimize
.
In [ ]:
# concatenated array of true time delays and GP HPs
ptrue = # complete
# same for the initial guesses
p0 = # complete
print('Initial guesses')
print(p0)
result = minimize(# complete
# save ML values in p1
p1 = # complete
print('ML parameters')
print(p1)
print('True parameters')
print(ptrue)
As you can see, the optimization further improved our estimates of the time delays and the GP HPs. But how much can we trust these? Let's evaluate posterior uncertainties using MCMC.
We now use MCMC to obtain uncertainty estimates, or confidence intervals, for the model hyper-parameters.
First we need to define the posterior function to pass to the emcee
sampler. We will use improper, flat priors over all the parameters, so the posterior probability is just a trivial wrapper around our neg_ln_like_delays
function. Complete the definition below:
In [ ]:
def lnprob(p):
# complete
Next, we set up the sampler. We will use 32 walkers, and initialise each set of walkers using the maximum likelihood estimates of the parameters plus a small random offset. Complete the code below, using the second george
tutorial as an example.
In [ ]:
ndim, nwalkers = # complete
p2 = # complete
sampler = # complete
Now we are ready to run the MCMC, starting with a burn-in chain of 500 steps, after which we reset the sampler, and run the sampler again for 100 iterations. Complete the code below.
In [ ]:
print("Running burn-in...")
p2, _, _ = # complete
sampler.reset()
print("Running production...")
sampler.# complete
Next we use the corner
function from the corner
module to plot the posterior distributions over the parameters. Complete the code below.
In [ ]:
labels = [r"$\Delta_1$", r"$\Delta_2$", r"$\ln A$", r"$\ln\l$"]
truths = ptrue
corner.corner(# complete
Hopefully the distributions should look reasonable and be consistent with the true values.
We need to extract confidence intervals for the parameters from the MCMC chain, which we can access through sampler.flatchain
In [ ]:
samples = # complete
# The GP parameters were explored in log space, return them to linear space
#samples[:, -2:] = np.exp(samples[:, -2:])
# This handy bit of code will extract median and +/- 1 sigma intervals for each parameter
pv = map(lambda v: (v[1], v[2]-v[1], v[1]-v[0]), zip(*np.percentile(samples, [16, 50, 84], axis=0)))
# Print the results
for i in range(ndim):
pval = # complete
print("Param {}: {:5.2f} +{:4.2f} -{:4.2f} (true: {:5.2f})".format(i+1,pval[0], pval[1], pval[2], ptrue[i]))
Hopefully, the MCMC estimates should be consistent with the true values...
Imagine you are monitoring a particular variable, you want to know its value to a given precision at anyone time, but each observation is costly, so you don't want to take any more than you have to. You can train a GP on the first few observations, then use the predictive distribution to work out when your uncertainty about the current value of the variable is so large that you need to take a new observation. Use the new observation to update the GP hyper parameters and the predictive distribution, and repeat the process...
First we generate a tightly sampled time series over 100 days. This will represent the "true" value of the variable. We will include some periodic behaviour as that makes the problem more interesting. Then we will "observe" 1 point per day for the first 20 days.
In [ ]:
xtrue = np.linspace(0,100,1000)
k = george.kernels.CosineKernel(np.log(12.3)) * george.kernels.ExpSquaredKernel(1000.0)
ytrue = george.GP(k).sample(xtrue)
xobs = xtrue[:200:10]
eobs = 10.0**(np.random.uniform(-1.5,-1,20))
yobs = ytrue[:200:10] + np.random.normal(0,1,20) * eobs
plt.plot(xtrue,ytrue)
plt.errorbar(xobs,yobs,yerr=eobs,fmt='.',capsize=0);
Your task is to devise and implement an algorithm that will schedule observations, based on the data to date, so as to ensure the uncertainty on the value of the function at any one time never exceeds 0.1. At each step, the aglorithm should:
Of course you will need to test your algorithm by comparing the predictions to the true values.
In [ ]:
gp = george.GP(k)
gp.set_parameter_vector([np.log(10),np.log(1000)])
gp.compute(xobs,yerr=eobs)
def nll(p):
gp.set_parameter_vector(p)
return -gp.log_likelihood(yobs)
def gnll(p):
gp.set_parameter_vector(p)
return -gp.grad_log_likelihood(yobs)
result = minimize(nll, gp.get_parameter_vector(), jac=gnll)
print(result)
gp.set_parameter_vector(result.x)
ypred, epred = gp.predict(yobs, xtrue, return_var=True)
plt.plot(xtrue,ytrue)
plt.errorbar(xobs,yobs,yerr=eobs,fmt='.',capsize=0);
plt.fill_between(xtrue,ypred + epred, ypred-epred,alpha=0.2,edgecolor='none')