In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as pl
... is a model-based approach to numerical integration, where a GP prior is placed on the integrand.
Setting up a GP model over the function to be integrated enables analytic estimation of the posterior mean and variance of the integrand (as well as the form of the integrand itself)
Usually, sampling from true $p(x)$ directly is difficult, so draw the samples the samples $x^{(t)}$ from a more tractable importance sampling distribution $q(x)$ instead, then
$$ Z = \int \frac{f(x) \, p(x)}{q(x)} \, q(x) \, \mathrm{d}x \approx \frac{1}{T} \sum_{t=1}^T \frac{f(x^{(t)}) \, p(x^{(t)})}{q(x^{(t)})}. $$If the same set of samples $\{x^{(1)} , \ldots , x^{(T)} \}$, conveying exactly the same information about $Z$, were obtained from two different sampling distributions, two different estimates of $Z$ would be obtained.
Imagine three samples $\{ x^{(1)}, x^{(2)}, x^{(3)} \}$, where it just happens by chance that $x^{(1)}=x^{(3)}$. Classical MC simply averages over the 3 sampels ,which is clearly inappropriate. It would be much better to use only $x^{(1)}$ and $x^{(2)}$, or only $x^{(1)}$ and $x^{(2)}$.
Of course such a situation is unlikely in real life, but the values of the $\{ x^{(t)}\}$ clearly contain relevant information that shouldn't be ignored.
Since $Z$ is a function of $f(x)$, which is unknown until we evaluate it, proceed by putting a prior on $f$.
A Gaussian Process is a convenient way of putting a prior on a function. Also, the integral becomes analytic.
There is no reason to expect $f$ to be Gaussian distributed in general, so this prior is not particularly appropriate, but we don't know of a more general family of distributions over functions that would be as tractable.
Given a set of samples $\mathcal{D} = \{(x^{(i)},f(x^{(i)})\,|\,i=1,\ldots,N\}$, the posterior over $f$, $\mathrm{p}(f\,|\,\mathcal{D})$, is Gaussian (and is given by the standard GP predictive equations).
As the integral $Z = \int f(x) \, p(x) \, \mathrm{d}x$ is just a linear projection (along the direction defined by $f$), the posterior over Z is also Gaussian.
where $\overline{f}_{\mathcal{D}}$ is the posterior mean, or expectation, of $f(x)$.
where $\mathrm{Cov}_{\mathcal{D}}$ is the posterior covariance.
Recall the standard GP equations. The posterior mean is given by:
$$ \overline{f}(x) = k(x, \mathbf{x}) \, K^{-1} \, \mathbf{f} $$and the posterior covariance by
$$ \mathrm{Cov}(f(x),f(x')) = k(x, x') - k(x,\mathbf{x}) \, K^{-1} \, k(\mathbf{x},x) $$where $\mathbf{x}$ and $\mathbf{f}$ are the observed inputs and outputs, respectively.
Therefore, defining $\mathbf{z}$ such that $z_t = \int k(x, x^{(t)}) \,p(x) \, \mathrm{d}x$,
$$ \begin{array}{rcl} \mathbb{E}_{[f|\mathcal{D}]}[Z] & = & \mathbf{z}^T \, K^{-1} \, \mathbf{f}\,~\mathrm{and} \\ \mathbb{V}_{[f|\mathcal{D}]}[Z] & = & \int \int k(x,x') \, p(x) \, p(x') \mathrm{d}x \mathrm{d}x' - \mathbf{z}^T K^{-1} \mathbf{z} \end{array} $$Some things to note:
In general the integrals in $\mathbb{E}_{[f|\mathcal{D}]}[Z]$ and $\mathbb{V}_{[f|\mathcal{D}]}[Z]$ are not tractable, but if the prior $p(x)$ is conjugate with the kernel $k(x,x')$, they are.
(O-Hagan 1991)
In particular, if we have a Gaussian prior $p(x) = \mathcal{N}(\mathbf{b},B)$ with mean $\mathbf{b}$ and variance $B$, and covariance function of the GP over $f$ can be written Gaussian kernels on the data points are $\mathcal{N} (a_i = x^{(i)} , A = \mathrm{diag}(w_1^2 , \ldots, w_D^2 ))$, then the expectation evaluates to:
$$ \mathbb{E}_{[f|\mathcal{D}]}[Z] = \mathbf{z}^⊤ \, K^{−1} \, \mathbf{f}, $$where
$$ \mathbf{z} = w_0 |A^{−1} B + I|^{−1/2} \exp \left[ −0.5 (\mathbf{a}−\mathbf{b})^⊤ (A+B)^{−1} (\mathbf{a}−\mathbf{b}) \right] $$and the variance to $$ \mathbb{V}_{[f|\mathcal{D}]}[Z] = w_0 |2 A^{−1} B + I|^{−1/2} - z^T \, K^{-1} \, z $$
Similarly, polynomials and mixtures of Gaussians for $p(x)$ lead to analytical expressions for the $\mathbb{E}_{[f|\mathcal{D}]}[Z]$ and $\mathbb{V}_{[f|\mathcal{D}]}[Z]$.
To be able to use any priors, need to resort to importance reweighting trick (Kennedy 1998):
$$ Z = \int \frac{f(x) \, p(x)}{q(x)} \, q(x) \, \mathrm{d}x $$where the Gaussian process models $f(x) \, p(x)/q(x)$ and $q(x)$ is Gaussian.
To understand what is meant by "the Gaussian kernels on the data points", recall that the predictive mean for $f$ can be expressed as a linear combination of kernel functions centred on the observed inputs: $$ \overline{f}(x) = \sum_{i=1}^N \alpha_i k(x_i,x), $$ where $\alpha_i = (K + \sigma^2 I)^{-1} \, f_i$.
In our case, we can evaluate $f(x)$ exactly, so $\sigma=0$ and $\alpha_i = K^{-1} \, f_i$.
If $k(x,x')= h^2 \exp \left[ -(x-x')/2l^2 \right]$, then ...
MORE WORK NEEDED
This severely affects results if likelihood peak is sharp
Drawback: integration no longer analytic. Approximate using Taylor expansion.
Code available in python:
In [ ]: