Fast GP implementations


In [1]:
%matplotlib inline

In [2]:
%config InlineBackend.figure_format = 'retina'

In [3]:
from matplotlib import rcParams
rcParams["savefig.dpi"] = 100
rcParams["figure.dpi"] = 100
rcParams["figure.figsize"] = 12, 4
rcParams["font.size"] = 16
rcParams["text.usetex"] = False
rcParams["font.family"] = ["sans-serif"]
rcParams["font.sans-serif"] = ["cmss10"]
rcParams["axes.unicode_minus"] = False

In [4]:
# https://github.com/matplotlib/matplotlib/issues/12039
try:
    old_get_unicode_index
except NameError:
    print('Patching matplotlib.mathtext.get_unicode_index')
    import matplotlib.mathtext as mathtext
    old_get_unicode_index = mathtext.get_unicode_index
    mathtext.get_unicode_index = lambda symbol, math=True:\
        ord('-') if symbol == '-' else old_get_unicode_index(symbol, math)


Patching matplotlib.mathtext.get_unicode_index

Benchmarking our implementation

Let's do some timing tests and compare them to what we get with two handy GP packages: george and celerite. We'll learn how to use both along the way.

Exercise 1a

Let's time how long our custom implementation of a GP takes for a rather long dataset. Create a time array of 10,000 points between 0 and 10 and time how long it takes to sample the prior of the GP for the default kernel parameters (unit amplitude and timescale). Add a bit of noise to the sample and then time how long it takes to evaluate the log likelihood for the dataset. Make sure to store the value of the log likelihood for later.


In [5]:
import numpy as np
from scipy.linalg import cho_factor


def ExpSquaredKernel(t1, t2=None, A=1.0, l=1.0):
    """
    Return the ``N x M`` exponential squared
    covariance matrix between time vectors `t1`
    and `t2`. The kernel has amplitude `A` and
    lengthscale `l`.
    
    """
    if t2 is None:
        t2 = t1
    T2, T1 = np.meshgrid(t2, t1)
    return A ** 2 * np.exp(-0.5 * (T1 - T2) ** 2 / l ** 2)


def ln_gp_likelihood(t, y, sigma=0, A=1.0, l=1.0):
    """
    Return the log of the GP likelihood of the
    data `y(t)` given uncertainty `sigma` and
    an Exponential Squared Kernel with amplitude `A`
    and length scale `sigma`.
    
    """
    # The covariance and its determinant
    npts = len(t)
    kernel = ExpSquaredKernel
    K = kernel(t, A=A, l=l) + sigma ** 2 * np.eye(npts)
    
    # The marginal log likelihood
    log_like = -0.5 * np.dot(y.T, np.linalg.solve(K, y))
    log_like -= 0.5 * np.linalg.slogdet(K)[1]
    log_like -= 0.5 * npts * np.log(2 * np.pi)
    
    return log_like


def draw_from_gaussian(mu, S, ndraws=1, eps=1e-12):
    """
    Generate samples from a multivariate gaussian
    specified by covariance ``S`` and mean ``mu``.
    
    (We derived these equations in Day 1, Notebook 01, Exercise 7.)
    """
    npts = S.shape[0]
    L, _ = cho_factor(S + eps * np.eye(npts), lower=True)
    L = np.tril(L)
    u = np.random.randn(npts, ndraws)
    x = np.dot(L, u) + mu[:, None]
    return x.T


def compute_gp(t_train, y_train, t_test, sigma=0, A=1.0, l=1.0):
    """
    Compute the mean vector and covariance matrix of a GP
    at times `t_test` given training points `y_train(t_train)`.
    The training points have uncertainty `sigma` and the
    kernel is assumed to be an Exponential Squared Kernel
    with amplitude `A` and lengthscale `l`.
    
    """
    # Compute the required matrices
    kernel = ExpSquaredKernel
    Stt = kernel(t_train, A=1.0, l=1.0)
    Stt += sigma ** 2 * np.eye(Stt.shape[0])
    Spp = kernel(t_test, A=1.0, l=1.0)
    Spt = kernel(t_test, t_train, A=1.0, l=1.0)

    # Compute the mean and covariance of the GP
    mu = np.dot(Spt, np.linalg.solve(Stt, y_train))
    S = Spp - np.dot(Spt, np.linalg.solve(Stt, Spt.T))
    
    return mu, S

In [6]:
%%time
np.random.seed(3)
t = np.linspace(0, 10, 10000)
sigma = np.ones_like(t) * 0.05
gp_mu, gp_S = compute_gp([], [], t, A=1.0, l=1.0)
y = draw_from_gaussian(gp_mu, gp_S)[0] + sigma * np.random.randn(len(t))


CPU times: user 19.6 s, sys: 2.64 s, total: 22.2 s
Wall time: 6.52 s

In [7]:
%%time
ln_gp_likelihood(t, y, sigma)


CPU times: user 43.2 s, sys: 2.86 s, total: 46.1 s
Wall time: 8.74 s
Out[7]:
15734.86961740783

Exercise 1b

Let's time how long it takes to do the same operations using the george package (pip install george).

The kernel we'll use is

kernel = amp ** 2 * george.kernels.ExpSquaredKernel(tau ** 2)

where amp = 1 and tau = 1 in this case.

To instantiate a GP using george, simply run

gp = george.GP(kernel)

The george package pre-computes a lot of matrices that are re-used in different operations, so before anything else, ask it to compute the GP model for your timeseries:

gp.compute(t, sigma)

Note that we've only given it the time array and the uncertainties, so as long as those remain the same, you don't have to re-compute anything. This will save you a lot of time in the long run!

Finally, the log likelihood is given by gp.log_likelihood(y) and a sample can be drawn by calling gp.sample().

How do the speeds compare? Did you get the same value of the likelihood (assuming you computed it for the same sample in both cases)?


In [8]:
import george

In [9]:
%%time
kernel = george.kernels.ExpSquaredKernel(1.0)
gp = george.GP(kernel)
gp.compute(t, sigma)


CPU times: user 9.78 s, sys: 448 ms, total: 10.2 s
Wall time: 3.52 s

In [10]:
%%time
print(gp.log_likelihood(y))


15734.869617378674
CPU times: user 630 ms, sys: 9.2 ms, total: 639 ms
Wall time: 109 ms

In [11]:
%%time
gp.sample()


CPU times: user 193 ms, sys: 2.45 ms, total: 196 ms
Wall time: 34.6 ms
Out[11]:
array([-1.35470891, -1.31074677, -1.32633582, ...,  0.85461801,
        0.92451666,  0.96498234])

Exercise 1c

george offers a fancy GP solver called the HODLR solver, which makes some approximations that dramatically speed up the matrix algebra. Instantiate the GP object again by passing the keyword solver=george.HODLRSolver and re-compute the log likelihood. How long did that take?

(I wasn't able to draw samples using the HODLR solver; unfortunately this may not be implemented.)


In [12]:
%%time
gp = george.GP(kernel, solver=george.HODLRSolver)
gp.compute(t, sigma)


CPU times: user 563 ms, sys: 62.9 ms, total: 625 ms
Wall time: 103 ms

In [13]:
%%time
gp.log_likelihood(y)


CPU times: user 13 ms, sys: 536 µs, total: 13.5 ms
Wall time: 2.81 ms
Out[13]:
15647.419630922675

Exercise 2

The george package is super useful for GP modeling, and I recommend you read over the docs and examples. It implements several different kernels that come in handy in different situations, and it has support for multi-dimensional GPs. But if all you care about are GPs in one dimension (in this case, we're only doing GPs in the time domain, so we're good), then celerite is what it's all about:

pip install celerite

Check out the docs here, as well as several tutorials. There is also a paper that discusses the math behind celerite. The basic idea is that for certain families of kernels, there exist extremely efficient methods of factorizing the covariance matrices. Whereas GP fitting typically scales with the number of datapoints $N$ as $N^3$, celerite is able to do everything in order $N$ (!!!) This is a huge advantage, especially for datasets with tens or hundreds of thousands of data points. Using george or any homebuilt GP model for datasets larger than about 10,000 points is simply intractable, but with celerite you can do it in a breeze.

Repeat the timing tests, but this time using celerite. Note that the Exponential Squared Kernel is not available in celerite, because it doesn't have the special form needed to make its factorization fast. Instead, use the Matern 3/2 kernel, which is qualitatively similar, and which can be approximated quite well in terms of the celerite basis functions:

kernel = celerite.terms.Matern32Term(np.log(1), np.log(1))

Note that celerite accepts the log of the amplitude and the log of the timescale. Other than this, you should be able to compute the likelihood and draw a sample with the same syntax as george.

How much faster did it run?


In [14]:
import celerite
from celerite import terms

In [15]:
%%time
kernel = terms.Matern32Term(np.log(1), np.log(1))
gp = celerite.GP(kernel)
gp.compute(t, sigma)


CPU times: user 39.2 ms, sys: 6.42 ms, total: 45.6 ms
Wall time: 7.05 ms

In [16]:
%%time
gp.log_likelihood(y)


CPU times: user 7.5 ms, sys: 456 µs, total: 7.96 ms
Wall time: 493 µs
Out[16]:
15552.578916096716

In [17]:
%%time
gp.sample()


CPU times: user 3.21 ms, sys: 839 µs, total: 4.05 ms
Wall time: 1.06 ms
Out[17]:
array([0.27368915, 0.39666543, 0.32132302, ..., 1.14988871, 1.10852573,
       1.25196767])

Exercise 3

Let's use celerite for a real application: fitting an exoplanet transit model in the presence of correlated noise.

Below is a (fictitious) light curve for a star with a transiting planet. There is a transit visible to the eye at $t = 0$, which (say) is when you'd expect the planet to transit if its orbit were perfectly periodic. However, a recent paper claims that the planet shows transit timing variations, which are indicative of a second, perturbing planet in the system, and that a transit at $t = 0$ can be ruled out at 3 $\sigma$. Your task is to verify this claim.

Assume you have no prior information on the planet other than the transit occurs in the observation window, the depth of the transit is somewhere in the range $(0, 1)$, and the transit duration is somewhere between $0.1$ and $1$ day. You don't know the exact process generating the noise, but you are certain that there's correlated noise in the dataset, so you'll have to pick a reasonable kernel and estimate its hyperparameters.

Fit the transit with a simple inverted Gaussian with three free parameters:

def transit_shape(depth, t0, dur):
    return -depth * np.exp(-0.5 * (t - t0) ** 2 / (0.2 * dur) ** 2)

Read the celerite docs to figure out how to solve this problem efficiently.

HINT: I borrowed heavily from this tutorial, so you might want to take a look at it...


In [18]:
import matplotlib.pyplot as plt
from celerite.modeling import Model
import os

# Define the model
class MeanModel(Model):
    parameter_names = ("depth", "t0", "dur")
    def get_value(self, t):
        return -self.depth * np.exp(-0.5 * (t - self.t0) ** 2 / (0.2 * self.dur) ** 2)

mean_model = MeanModel(depth=0.5, t0=0.05, dur=0.7)
mean_model.parameter_bounds = [(0, 1.0), (-0.1, 0.4), (0.1, 1.0)]
true_params = mean_model.get_parameter_vector()

# Simuate the data
np.random.seed(71)
x = np.sort(np.random.uniform(-1, 1, 70))
yerr = np.random.uniform(0.075, 0.1, len(x))
K = 0.2 * np.exp(-0.5 * (x[:, None] - x[None, :]) ** 2 / 10.5)
K[np.diag_indices(len(x))] += yerr ** 2
y = np.random.multivariate_normal(mean_model.get_value(x), K)
y -= np.nanmedian(y)

# Plot the data
plt.errorbar(x, y, yerr=yerr, fmt=".k", capsize=0)
t = np.linspace(-1, 1, 1000)
plt.plot(t, mean_model.get_value(t))
plt.ylabel(r"$y$")
plt.xlabel(r"$t$")
plt.xlim(-1, 1)
plt.gca().yaxis.set_major_locator(plt.MaxNLocator(5))
plt.title("simulated data");

# Save it
X = np.hstack((x.reshape(-1, 1), y.reshape(-1, 1), yerr.reshape(-1, 1)))
if not (os.path.exists("data")):
    os.mkdir("data")
np.savetxt("data/sample_transit.txt", X)



In [19]:
import matplotlib.pyplot as plt
t, y, yerr = np.loadtxt("data/sample_transit.txt", unpack=True)
plt.errorbar(x, y, yerr=yerr, fmt=".k", capsize=0)
plt.xlabel("time")
plt.ylabel("relative flux");



In [20]:
from celerite.modeling import Model
from scipy.optimize import minimize


# Define the transit model as a celerite `Model`
class MeanModel(Model):
    parameter_names = ("depth", "t0", "dur")
    def get_value(self, t):
        return -self.depth * np.exp(-0.5 * (t - self.t0) ** 2 / (0.2 * self.dur) ** 2)

# Instantiate it with some guesses (which are actually the true values in this case!)
mean_model = MeanModel(depth=0.5, t0=0.05, dur=0.7)
mean_model.parameter_bounds = [(0, 1.0), (-0.1, 0.4), (0.1, 1.0)]
true_params = mean_model.get_parameter_vector()

# Set up the GP model
kernel = terms.RealTerm(log_a=np.log(np.var(y)), log_c=0)
gp = celerite.GP(kernel, mean=mean_model, fit_mean=True)
gp.compute(x, yerr)
print("Initial log-likelihood: {0}".format(gp.log_likelihood(y)))

# Define a cost function
def neg_log_like(params, y, gp):
    gp.set_parameter_vector(params)
    return -gp.log_likelihood(y)

def grad_neg_log_like(params, y, gp):
    gp.set_parameter_vector(params)
    return -gp.grad_log_likelihood(y)[1]

# Fit for the maximum likelihood parameters
initial_params = gp.get_parameter_vector()
bounds = gp.get_parameter_bounds()
soln = minimize(neg_log_like, initial_params,
                method="L-BFGS-B", bounds=bounds, args=(y, gp))
gp.set_parameter_vector(soln.x)
print("Final log-likelihood: {0}".format(-soln.fun))

# Make the maximum likelihood prediction
t = np.linspace(-1, 1, 500)
mu, var = gp.predict(y, t, return_var=True)
std = np.sqrt(var)

# Plot the data
color = "#ff7f0e"
plt.errorbar(x, y, yerr=yerr, fmt=".k", capsize=0)
plt.plot(t, mu, color=color)
plt.fill_between(t, mu+std, mu-std, color=color, alpha=0.3, edgecolor="none")
plt.ylabel(r"$y$")
plt.xlabel(r"$t$")
plt.xlim(-1, 1)
plt.gca().yaxis.set_major_locator(plt.MaxNLocator(5))
plt.title("maximum likelihood prediction");


Initial log-likelihood: 61.43379965918291
Final log-likelihood: 66.6772739848445

In [21]:
def log_probability(params):
    gp.set_parameter_vector(params)
    lp = gp.log_prior()
    if not np.isfinite(lp):
        return -np.inf
    try:
        return gp.log_likelihood(y) + lp
    except celerite.solver.LinAlgError:
        return -np.inf

In [22]:
import emcee

initial = np.array(soln.x)
ndim, nwalkers = len(initial), 32
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_probability)

print("Running burn-in...")
p0 = initial + 1e-8 * np.random.randn(nwalkers, ndim)
p0, lp, _ = sampler.run_mcmc(p0, 1000)

print("Running production...")
sampler.reset()
sampler.run_mcmc(p0, 2000);


Running burn-in...
Running production...

In [23]:
# Plot the data.
plt.errorbar(x, y, yerr=yerr, fmt=".k", capsize=0)

# Plot 24 posterior samples.
samples = sampler.flatchain
for s in samples[np.random.randint(len(samples), size=24)]:
    gp.set_parameter_vector(s)
    mu = gp.predict(y, t, return_cov=False)
    plt.plot(t, mu, color=color, alpha=0.3)

plt.ylabel(r"$y$")
plt.xlabel(r"$t$")
plt.xlim(-1, 1)
plt.gca().yaxis.set_major_locator(plt.MaxNLocator(5))
plt.title("posterior predictions");



In [24]:
import corner
names = gp.get_parameter_names()
cols = mean_model.get_parameter_names()
inds = np.array([names.index("mean:"+k) for k in cols])
corner.corner(sampler.flatchain[:, inds], truths=true_params,
              labels=[r"depth", r"$t_0$", r"dur"]);


The transit time is inconsistent with 0 at about 3 sigma, so the claim that TTVs are present is probably true. But remember we're making strong assumptions about the nature of the correlated noise (we assumed a specific kernel), so in reality our uncertainty should be a bit higher.