In [1]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as pl

Gaussian process regression

Lecture 2 - GPs in action

Suzanne Aigrain, University of Oxford

LSST DSFP Session 4, Seattle, Sept 2017

When should I use a GP?

to model a signal (of known form) in the presence of correlated noise
- c.f. transit / eclipse example from Lecture 1

to obtain noise-free estimates of (i.e. emulate) a function without knowing its form
- c.f. photo-$z$ example from Lecture 1

to learn some generic properties of function without specifying its detailed form
- e.g. measuring length scale, or period using a periodic GP

to model a stochastic process we know the dynamics of in detail

How should I use a GP?

chose your kernel with care
- even the most sophisticated Bayesian analysis will only answer the question(s) you ask!

check its predictions are sensible
- cross validation, simulations

marginalise over the hyper-parameters
- HPs can be very correlated; need flavours of MCMC that can cope with that

use as much information as you can
- multiple input and outputs

for large datasets, things will grind to a halt
- use clever tricks and optimized packages: sparse GPs (GPy), HODLR (george), celerite

Exoplanet transmission spectra

Systematics due to telescope pointing and temperature variations dominate the planet's atmospheric signal.

We know measured flux depends on telescope pointing, detector temperature, etc..., but we don't know how.

Simple linear basis models: over-fitting and/or over-confidence. Much controversy over the years...

GPs offer a principled approach to this problem.

See series of papers by Neale Gibson et al.:

2011: identifying the problem
2012a: introducing the GP method
2012b: application to HST/WFC3
2013a, b: application to Gemini/GMOS
2014: testing reliability of parametric and GP models

Cross-validation

Split data into test and training sets.

Use test set to assess predictive performance of model.

Marginalising over hyper-parameters

In astrophysical applications, we are rarely interested in interpolation per se. More frequently, we are interested in (some of) the hyper-parameters, and not in others.

For example, in transit spectra, we care only about $R_p/R_*$.

We need to marginalise over the other hyper-parameters to obtain posterior distributions for the parameters of interest.

GPs are readily incorporated into sampling frameworks such as Markov Chain Monte Carlo (MCMC).

Multiple inputs and automatic relevance determination

Kernel: $$ k_{\rm ARD}(\mathbf{x},\mathbf{x'}) = A \exp \left[ - \sum_{j=1}^{D} \eta_j (x_j-x'_j)^2 \right]. $$

Shrinkage (hyper-) priors (Gamma distribution with shape parameter of unity): $$ \mathrm{p}(\eta) = \frac{1}{l} \exp\left(-\frac{\eta}{l}\right) $$ with large $l$ so the prior is uninformative.

The colour of an exoplanet

Evans et al. (2013): The Deep Blue Color of HD 189733b: Albedo Measurements with Hubble Space Telescope/Space Telescope Imaging Spectrograph at Visible Wavelengths

Heteroskedastic white noise

Gibson, Aigrain et al. (2013b) The optical transmission spectrum of the hot Jupiter HAT-P-32b: clouds explain the absence of broad spectral features?

Varying airmass: $\sigma = \sigma_0 + a \exp\left[-(t-t_0)/b \right]$

Heteroskedastic white noise

Gibson, Aigrain et al. (2013b) The optical transmission spectrum of the hot Jupiter HAT-P-32b: clouds explain the absence of broad spectral features?

Clouds: step-change in $\sigma$ at fixed point in time

Quasi-periodic GP for spotted star light curve

$$ k_{\mathrm{QP}}(x,x') = A \exp \left[ -\Gamma_1 \sin^2\left(\frac{\pi r}{P}\right) -\Gamma_2 r^2 \right]. $$

$k_{\mathrm{QP}} > 0$ for $r=(n+0.5) P$, allowing non-harmonic behaviour
only 4 parameters.

C.f. Tutorial 1 challenge problem.

Used for:

interpolation (Sing et al. 2013, Knutson et al. 2012, Pont et al. 2013 )
smoothing to compute derivatives (Aigrain et al. 2012)
estimating $P$ by constructing "GP periodogram" (Angus et al. 2017)

Interpolating groud-based light curve

Helps estimate "spottiness" at time of transit observations

Cautionary tale 1

Cautionary tale 2

Additive GPs: systematics in K2 data

Aigrain et al. (2015, 2016)

Two-component kernel, to represent intrinsic variability and pointing-related systematics: $$ K_{ij} = K{t,ij} + K_{p,ij} = k_t(t_i,t_j) + k_p((x_i,y_i),(x_j,y_j)) $$

Can separate two components: $$ \begin{array}{rcl} \overline{f}_{t,*} & = & \boldsymbol{k}^{\mathrm{T}}_{t,*} (K + \sigma^2 \mathbb{I})^{-1} \boldsymbol{f} \\ \overline{f}_{t,*} & = & \boldsymbol{k}^{\mathrm{T}}_{p,*} (K + \sigma^2 \mathbb{I})^{-1} \boldsymbol{f} \end{array} $$

Multiple outputs $\rightarrow$ 2-D inputs

Consider two time series $\boldsymbol{y_1}$ and $\boldsymbol{y_2}$.

We want to model both jointly as realisations of the same process, but with (say) different output scaling and white noise variances.

Can simply concatenate the two output time-series, and define 2-D input $\mathbf{x}_i = (t_i,l_i)$, where $t$ is time and $l$ is a label which identifies which subset the observation belongs to.

Example: stellar activity in radial velocities

Starspots also induce noise in RV searches for planets

The RV curve of a spot looks like the time-derivative of its light curve (Aigrain et al. 2012)

Rajpaul et al. (2015): Extend the concept to other (spectral) indicators of activity

Derivative observations

$$ \begin{array}{rcl} y_1(t) & = & A \, G(t) + B \, \dot{G}(t) \\ y_2(t) & = & C \, \dot{G}(t) \\ y_3(t) & = & D \, G(t) + E \, \dot{G}(t) \end{array} $$

Covariance between observations of $G$ and $\dot{G}$: $$ \begin{array}{rcl} \gamma^{(G,dG)}(t_i,t_j) & = & \frac{\partial}{\partial t} \left. \gamma^{(G,G)}(t,t_j) \right|_{(t=t_i)}\\ \gamma^{(dG,dG)}(t_i,t_j) & = & \frac{\partial}{\partial t'} \, \frac{\partial}{\partial t} \left. \left. \gamma^{(G,G)}(t,t') \right|_{(t=t_i)} \right|_{(t'=t_j)} \end{array} $$ (Similar expressions exist for integral observations)

Speeding up GPs

GP regression involves inverting the covariance matrix and computing its determinant.

This requires $\mathcal{O}(N^3)$ operations. Impractical for $N \gtrsim 1000$.

For very large $N$, even evaluating $K$ is prohibitive in terms of memory.

To get around this:

Sub-sample or bin

Ensure covariance matrix is sparse by construction

Make a low-rank approximation (sparse GP)

Factorize the matrix in a clever way

Binning: Spitzer data

Evans et al. (2015) A uniform analysis of HD 209458b Spitzer/IRAC light curves with Gaussian process models

Also illustrates using logarythmic time $\tau = \log(t + h)$ as an input.

Toeplitz covariance matrices

If inputs are regularly sampled and kernel depends on input distances only, then $K$ is Toeplitz (constant along each diagonal).

Toeplitz matrices can be inverted, and their determinant evaluated in $\mathcal{O}(N \log N)$ operations.

However, real astronomical datasets rarely have strictly regular sampling (missing data).

Compact support

Can define kernel so it is zero beyond some input distance, leading to block-diagonal matrix. Challenge is ensuring positive semi-definiteness.

Sparse GPs

$\mathcal{O}(N^3) \rightarrow \mathcal{O}(NM^2)$, where user defines $M \le N$.

Low-rank approximation of $K$ $$ K_{NN} \approx Q_{NN} = K_{NM} K^{-1}_{MM} K_{MN} $$

For full description see e.g. these slides by James Hensman.

Variational sparse GPs

The trick is to "imagine" we had an extra set of $M$ observations $\boldsymbol{u}$ at inputs $\boldsymbol{Z}$:

Variational sparse GPs

$$ \mathrm{p}(\boldsymbol{y}, \boldsymbol{f},\boldsymbol{u}) = \mathrm{p}(\boldsymbol{y}\,|\,\boldsymbol{f}) \, \mathrm{p}(\boldsymbol{f}\,|\,\boldsymbol{u}) \, \mathrm{p}(\boldsymbol{u}) $$

where: $$ \mathrm{p}(\boldsymbol{y}\,|\,\boldsymbol{f}) = \mathcal{N}(\boldsymbol{y}\,|\,\boldsymbol{f},\sigma^2\mathbb{I}), $$

$$ \mathrm{p}(\boldsymbol{f}\,|\,\boldsymbol{u}) = \mathcal{N}(\boldsymbol{f}\,|\,K_{NM} \, K_{MM}^{-1} \boldsymbol{u}, K_{NN} - K_{NM} \, K_{MM}^{-1} \, K_{MN}), $$

$$ \mathrm{p}(\boldsymbol{u}) = \mathcal{N}(\boldsymbol{u}\,|\,\boldsymbol{0},K_{MM}). $$

Variational sparse GPs

Then use variational approach to approximate $\mathrm{p}(\boldsymbol{y} \,|\, \boldsymbol{u})$, and marginalise $\boldsymbol{u}$, giving approximate likelihood:

$\tilde{\mathrm{p}}(\boldsymbol{y}) = \mathcal{N}(\boldsymbol{y}\,|\,\boldsymbol{0},K_{NM} \, K_{MM}^{-1} \, K_{MN} + \sigma^2\mathbb{I}) \, $ $\exp \sum_{i=1}^{N} \left\{ - \frac{1}{2 \sigma^2} \left( \boldsymbol{k}_{Mi}^\mathrm{T} \, K_{MM}^{-1} \boldsymbol{k}_{Mi}\right) \right\}$.

Never need to invert $K_{NN}$.

Depends only on the covariance function parameters $\boldsymbol{\theta}$, and the inducing points $Z$: jointly optimize $\boldsymbol{\theta}$ and $Z$.

This form of sparse GP is implemented in the GPy package.

Hierarchical Off-Diagonal Low-Rank (HODLR) method

Ambikasaran et al. (2014): Fast Direct Methods for Gaussian Processes

Write $K = K_{k} \, K_{k-1} \, K_{k-2} \, \ldots \, K_1 \, K_0$, where $K_{k} \in \mathbb{R}^{N \times N}$ is a block-diagonal matrix with $2^k$ diagonal blocks, each of size $N/2^k \times N/2^k$.

Hierarchical Off-Diagonal Low-Rank (HODLR) method

Furthermore, each diagonal block is a low-rank update to the identity matrix

This allows evaluating $K^{-1}$ and $|K|$ in $\mathcal{O}(N \log^2 N)$, regardless of the number of input dimensions.

Python packages

GPy: general purpose python GP package from Sheffield ML.

Includes sparse GPs using variational approx.

High-level modelling and plotting tools.

george: basic (Cholesky) or HODLR solve, from Dan F-M & co.
celerite: GP regression in \mathcal{O}(N)) for specific family of (1-D) kernels

Both comme with a basic modelling framework (less sophisticated than GPy's).

Easy to combine with MCMC using emcee, or nested sampling using PyMultiNest