In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as pl

Gaussian process regression

Lecture 3: Bayesian Quadrature

Suzanne Aigrain, University of Oxford

LSST DSFP Session 4, Seattle, Sept 2017

Bayesian quadrature

... is a model-based approach to numerical integration, where a GP prior is placed on the integrand.

Setting up a GP model over the function to be integrated enables analytic estimation of the posterior mean and variance of the integrand (as well as the form of the integrand itself)

Consider the integral

$$ Z = \int f(x) \, p(x) \, \mathrm{d}x $$

where $p(x)$ is a probability density, and $f(x)$ is the function we wish to integrate. For example

  • given a model with parameters $x$, $p(x)$ could be the posterior over the parameters, and $f(x)$ the predictions of the model, so that $Z$ is the posterior mean of model predictions, or
  • $p(x)$ could be a prior and $f(x)=\mathrm{p}(y \, | \, x)$ the likelihood, so that $Z$ is the marginal likelihood, or evidence.

Classical Monte Carlo

approximates $Z$ as

$$ Z \approx \frac{1}{T} \sum_{t=1}^T f(x^{(t)}) $$

where $x^{(t)}$ are random (not necessarily independent) draws from $p(x)$.

Importance sampling

Usually, sampling from true $p(x)$ directly is difficult, so draw the samples the samples $x^{(t)}$ from a more tractable importance sampling distribution $q(x)$ instead, then

$$ Z = \int \frac{f(x) \, p(x)}{q(x)} \, q(x) \, \mathrm{d}x \approx \frac{1}{T} \sum_{t=1}^T \frac{f(x^{(t)}) \, p(x^{(t)})}{q(x^{(t)})}. $$

Monte Carlo is fundamentally unsound

(O'Hagan 1989)

Problem 1: depends on irrelevant information

Estimate of $Z$ from importance sampling depends on the values of $f(x^{(t)})\,p(x^{(t)})$, but also on the en- tirely arbitrary choice of the sampling distribution $q(x)$.

If the same set of samples $\{x^{(1)} , \ldots , x^{(T)} \}$, conveying exactly the same information about $Z$, were obtained from two different sampling distributions, two different estimates of $Z$ would be obtained.

Monte Carlo is fundamentally unsound

(O'Hagan 1989)

Problem 2: ignores relevant information

Classical Monte Carlo procedures ignore the values of the $x^{(t)}$.

Imagine three samples $\{ x^{(1)}, x^{(2)}, x^{(3)} \}$, where it just happens by chance that $x^{(1)}=x^{(3)}$. Classical MC simply averages over the 3 sampels ,which is clearly inappropriate. It would be much better to use only $x^{(1)}$ and $x^{(2)}$, or only $x^{(1)}$ and $x^{(2)}$.

Of course such a situation is unlikely in real life, but the values of the $\{ x^{(t)}\}$ clearly contain relevant information that shouldn't be ignored.

Integration as a Bayesian inference problem

Treat $Z$ as a random variable.

Since $Z$ is a function of $f(x)$, which is unknown until we evaluate it, proceed by putting a prior on $f$.

A Gaussian Process is a convenient way of putting a prior on a function. Also, the integral becomes analytic.

There is no reason to expect $f$ to be Gaussian distributed in general, so this prior is not particularly appropriate, but we don't know of a more general family of distributions over functions that would be as tractable.

Bayesian Monte Carlo

(O'Hagan 1991, Rassmussen & Garhamani 2003)

Start with zero-mean GP prior over $f$. Typically use squared exponential covariance function.

Given a set of samples $\mathcal{D} = \{(x^{(i)},f(x^{(i)})\,|\,i=1,\ldots,N\}$, the posterior over $f$, $\mathrm{p}(f\,|\,\mathcal{D})$, is Gaussian (and is given by the standard GP predictive equations).

As the integral $Z = \int f(x) \, p(x) \, \mathrm{d}x$ is just a linear projection (along the direction defined by $f$), the posterior over Z is also Gaussian.

Expectation of $Z$

$$ \begin{array}{rcl} \mathbb{E}_{[f|\mathcal{D}]}[Z] & = & \int \int f(x) \, p(x) \, \mathrm{d}x \, \mathrm{p}(f\,|\,\mathcal{D}) \mathrm{d}f \\ & = & \int \int f(x) \, \mathrm{p}(f\,|\,\mathcal{D}) \, \mathrm{d}f \, p(x) \, \mathrm{d}x \\ & = & \int \overline{f}_{\mathcal{D}} \, p(x) \, \mathrm{d}x \\ \end{array} $$

where $\overline{f}_{\mathcal{D}}$ is the posterior mean, or expectation, of $f(x)$.

Variance of $Z$

$$ \begin{array}{rcl} \mathbb{V}_{[f|\mathcal{D}]}[Z] & = & \int \left[ \int f(x) \, p(x) \, \mathrm{d}x - \int \overline{f}(x') \, p(x') \, \mathrm{d}x' \right] \, \mathrm{p}(f\,|\,\mathcal{D}) \, \mathrm{d}f \\ & = & \int \int \int \left[ f(x) - \overline{f}(x) \right] \, \left[ f(x') - \overline{f}(x') \right] \mathrm{p}(f\,|\,\mathcal{D}) \, \mathrm{d}f \, p(x) \, p(x') \mathrm{d}x \, \mathrm{d}x' \\ & = & \int \mathrm{Cov}_{\mathcal{D}} (f(x), f(x')) \, p(x) \, p(x') \mathrm{d}x \, \mathrm{d}x' \\ \end{array} $$

where $\mathrm{Cov}_{\mathcal{D}}$ is the posterior covariance.

Recall the standard GP equations. The posterior mean is given by:

$$ \overline{f}(x) = k(x, \mathbf{x}) \, K^{-1} \, \mathbf{f} $$

and the posterior covariance by

$$ \mathrm{Cov}(f(x),f(x')) = k(x, x') - k(x,\mathbf{x}) \, K^{-1} \, k(\mathbf{x},x) $$

where $\mathbf{x}$ and $\mathbf{f}$ are the observed inputs and outputs, respectively.

Therefore, defining $\mathbf{z}$ such that $z_t = \int k(x, x^{(t)}) \,p(x) \, \mathrm{d}x$,

$$ \begin{array}{rcl} \mathbb{E}_{[f|\mathcal{D}]}[Z] & = & \mathbf{z}^T \, K^{-1} \, \mathbf{f}\,~\mathrm{and} \\ \mathbb{V}_{[f|\mathcal{D}]}[Z] & = & \int \int k(x,x') \, p(x) \, p(x') \mathrm{d}x \mathrm{d}x' - \mathbf{z}^T K^{-1} \mathbf{z} \end{array} $$

Some things to note:

  • The variance of $Z$ is a natural convergence diagnostic
  • It doesn't depend on $f$ at all! Can devise optimal sampling scheme in advance
  • This expression ignores uncertainty on GP hyper-parameters.

In general the integrals in $\mathbb{E}_{[f|\mathcal{D}]}[Z]$ and $\mathbb{V}_{[f|\mathcal{D}]}[Z]$ are not tractable, but if the prior $p(x)$ is conjugate with the kernel $k(x,x')$, they are.

Performance

Bayes-Hermite Quadrature

(O-Hagan 1991)

In particular, if we have a Gaussian prior $p(x) = \mathcal{N}(\mathbf{b},B)$ with mean $\mathbf{b}$ and variance $B$, and covariance function of the GP over $f$ can be written Gaussian kernels on the data points are $\mathcal{N} (a_i = x^{(i)} , A = \mathrm{diag}(w_1^2 , \ldots, w_D^2 ))$, then the expectation evaluates to:

$$ \mathbb{E}_{[f|\mathcal{D}]}[Z] = \mathbf{z}^⊤ \, K^{−1} \, \mathbf{f}, $$

where

$$ \mathbf{z} = w_0 |A^{−1} B + I|^{−1/2} \exp \left[ −0.5 (\mathbf{a}−\mathbf{b})^⊤ (A+B)^{−1} (\mathbf{a}−\mathbf{b}) \right] $$

and the variance to $$ \mathbb{V}_{[f|\mathcal{D}]}[Z] = w_0 |2 A^{−1} B + I|^{−1/2} - z^T \, K^{-1} \, z $$

Other forms of prior

Similarly, polynomials and mixtures of Gaussians for $p(x)$ lead to analytical expressions for the $\mathbb{E}_{[f|\mathcal{D}]}[Z]$ and $\mathbb{V}_{[f|\mathcal{D}]}[Z]$.

To be able to use any priors, need to resort to importance reweighting trick (Kennedy 1998):

$$ Z = \int \frac{f(x) \, p(x)}{q(x)} \, q(x) \, \mathrm{d}x $$

where the Gaussian process models $f(x) \, p(x)/q(x)$ and $q(x)$ is Gaussian.

To understand what is meant by "the Gaussian kernels on the data points", recall that the predictive mean for $f$ can be expressed as a linear combination of kernel functions centred on the observed inputs: $$ \overline{f}(x) = \sum_{i=1}^N \alpha_i k(x_i,x), $$ where $\alpha_i = (K + \sigma^2 I)^{-1} \, f_i$.

In our case, we can evaluate $f(x)$ exactly, so $\sigma=0$ and $\alpha_i = K^{-1} \, f_i$.

If $k(x,x')= h^2 \exp \left[ -(x-x')/2l^2 \right]$, then ...

MORE WORK NEEDED

GP prior on log likelihood

Osborne et al. (2012)

When using Bayesian quadrature to evaluate evidence, $f$ represents a likelihood. Applying the GP to $\log f$ rather than $f$ ensures it is always positive, and helps match the typically large dynamic range.

This severely affects results if likelihood peak is sharp

GP prior on log likelihood

Osborne et al. (2012)

When using Bayesian quadrature to evaluate evidence, $f$ represents a likelihood. Applying the GP to $\log f$ rather than $f$ ensures it is always positive, and helps match the typically large dynamic range.

Drawback: integration no longer analytic. Approximate using Taylor expansion.

Marginalising over the GP hyper-parameters

Osborne et al. (2012)

This matters a lot for the accuracy of the variance estimate.

Drawback: requires multiple layers of inference. Computational cost per sample high.

BBQ ("doubly Bayesian quadrature")

Osborne et al. (2012)

When marginalising over HPs, variance of $Z$ does depend on samples. Opens up the possibility of active sampling.

Further improvements: WSABI

Gunter et al. (2014)

Model $\sqrt{\mathcal{L}}$ insteald of $\log \mathcal{L}$. Goes some way towards matching dynamic range, while simplifying inference and making it more stable.

RV and astrometric searches for planets

Highly multimodal posteriors, likelihood can be expensive (e.g. if modelling activity)

Need to compare evidence for 0, 1, 2, ... planets. C.f. "evidence challenge" at EPRV3.


In [ ]: