In [1]:
import json
import urllib2
import matplotlib
s = json.load ( urllib2.urlopen("https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/raw/master/styles/bmh_matplotlibrc.json"))
matplotlib.rcParams.update(s)
%pylab inline
%config InlineBackend.figure_format = 'svg'
figsize( 7, 7)
from IPython.display import HTML
def css_styling():
styles = "https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/raw/master/styles/custom.css"
return HTML(styles)
css_styling()
Out[1]:
eoldas_ng
implements a variational data assimilation scheme. This first notebook is to given you an indication of the theory behind the chosen approach. It is important to be exposed to this theory, as it will allow you to identify when the problem you are solving breaks some the assumptions taken. This notebook has some Python code to demonstrate the general concepts, but does not use eoldas_ng
at all to introduce them. At a latter stage, we shall see how the current example can be reworked using eoldas_ng
.
The fundamental problem we are trying to solve with EO-LDAS (and indeed most other inversion approaches) is to infer the state of the land surface, given all available observations, as well as any other relevant information. This leads quite naturally to Bayes' Rule, that states
$$ p(\theta|y) \propto p(\theta)\cdot p(y|\theta) $$In other words, the distribution of $\theta$ (the variable of interest) conditioned by $y$ (the observations or evidence) is proportional to the product of whatever information we might have had of $\theta$ to start with (from previous experiments, expert knowledge, etc.) times the probability of the observations given the true parameters (the likelihood function).
The prior distribution is an assessment of what we know before we even start the experiments. Typically, it can include realistic parameter boundaries (for example, weight cannot be less than 0!), typical values, etc. A common approach is to use a Gaussian distribution, controlled by two parameters: the mean($\mu_p$) and the standard deviation ($\sigma_p$). $\mu_p$ controls the typical location of the magnitude, whereas $\sigma_p$ informs us of how certain/uncertain we are about it. A Gaussian pdf is $$ p(x) = \frac{1}{\sqrt{2\pi}\sigma_{p}}\exp\left[-\frac{1}{2}\frac{(x-\mu_p)^2}{\sigma_p^2}\right] $$
In [2]:
def gauss ( x, mu, sigma ):
c1 = 1./(np.sqrt ( 2.*np.pi)*sigma)
c2 = -0.5*( x - mu )**2/sigma**2
return c1*np.exp(c2)
x = np.arange(-10,10,0.02)
plt.plot ( x, gauss(x,0, 1), '-r', label=r'$\sigma_{p}=1$')
plt.plot ( x, gauss(x,0, 2), '-g', label=r'$\sigma_{p}=2$')
plt.plot ( x, gauss(x,0, 10), '-b', label=r'$\sigma_{p}=10$')
plt.plot ( x, gauss(x,0, 0.5), '-k', label=r'$\sigma_{p}=0.5$')
plt.plot ( x, gauss(x,0, 0.025), '-y', label=r'$\sigma_{p}=0.025$')
plt.ylim(0,2)
plt.xlabel (r'$x$')
plt.ylabel (r'$p(x)$')
plt.legend(loc='best', fancybox=True, shadow=True)
Out[2]:
In [3]:
x = np.arange(-10,10,0.02)
plt.plot ( x, gauss(x,-5, 1), '-r', label=r'$\sigma_{p}=1\;\mu_{p}=-5$')
plt.plot ( x, gauss(x,0, 2), '-g', label=r'$\sigma_{p}=2\;\mu_{p}=2$')
plt.plot ( x, gauss(x,3, 10), '-b', label=r'$\sigma_{p}=10\;\mu_{p}=10$')
plt.ylim(0,0.5)
plt.xlabel (r'$x$')
plt.ylabel (r'$p(x)$')
plt.legend(loc='best', fancybox=True, shadow=True)
Out[3]:
The previous snippets show the effect of changing the mean (effectively shifting the location of the maximum of $p(x)$) and standard deviation, that broadens or narrows the distribution. For the example above, the red pdf has to be interpreted that most of the values will be located around -5, with a 95% of them around $\left[-5 - 1.96\cdot 1, -5 + 1.96\cdot 1\right]$. The green pdf indicates that most fo the values will be located around 0, with a 95% interval extending approximately between -4 and 4. The blue pdf indicates a very broad distribution, although the maximum is at 10, the distribution is so broad that it effectively shows no particular preference for any given value.
In addition to the prior, Bayes' Rule requires a likelihood funciton. The likelihood function matches the parameter to the actual observations. The likelihood basically states the probability of measurement $y$ given a value of the parameter $x$. So if the mapping between $x$ and $y$ is $x=y$ (i.e., there are no biases), the only difference is given by additive noise, with a Gaussian distribution: $$ y = x + \epsilon $$
$$ \epsilon \sim \mathcal{N}(\mu, \sigma_{obs}) $$Again we will use Gaussian distributions for this. Although in reality many processes are not Gaussian, the normal distribution is a good starting point if you only know a mean and a variance of the observation noise, for example. There are also some important benefits in this, such as the fact that we can write analytical expressions for posteriors. To wrap this up, let's write the expression for the likelihood, $p(y|x)$: $$ p(y|x)=\frac{1}{\sqrt{2\pi}\sigma_{obs}}\exp\left[-\frac{1}{2}\frac{(x-\mu_{obs})^2}{\sigma_{obs}^2}\right] $$
We have now defined the two ingredients of Bayes' Rule, and we can invoke to say that the posterior distribution of $x$ is the product of prior and likelihood:
$$ p(x|y) = \frac{1}{\sqrt{2\pi}\sigma_P}\exp\left[-\frac{1}{2}\frac{(x-\mu_p)^2}{\sigma_p^2}\right]\cdot \frac{1}{\sqrt{2\pi}\sigma_{obs}}\exp\left[-\frac{1}{2}\frac{(x-\mu_{obs})^2}{\sigma_{obs}^2}\right] $$Ignoring the constant terms... $$ p(x|y) \propto \exp\left[-\frac{1}{2}\frac{(x-\mu_p)^2}{\sigma_p^2}\right]\cdot \exp\left[-\frac{1}{2}\frac{(x-\mu_{obs})^2}{\sigma_{obs}^2}\right] $$
It is useful to work in terms of logarithms so as to get rid of the exponentials... $$ \log(p(x|y)) \propto \left[-\frac{1}{2}\frac{(x-\mu_p)^2}{\sigma_p^2}\right] + \left[-\frac{1}{2}\frac{(x-\mu_{obs})^2}{\sigma_{obs}^2}\right] $$
From the above expressions, we can see that the prior will also be Gaussian, with analytic expressions for both mean and variance. You should try to work them out yourself, but they are $$ \mu_{post} = \frac{\mu_{p}\cdot\sigma^{2}_{obs} +\mu_{obs}\cdot\sigma^{2}_{p}}{\sigma_{obs}^{2} + \sigma_{p}^{2}} $$ $$ \frac{1}{\sigma_{post}^{2}} = \frac{1}{\sigma_{p}^{2}} + \frac{1}{\sigma_{obs}^{2}} $$ Let's see an example of how this works. Assume that the prior distribution is centred at -5 with a standard deviation of 1, and that the likelihood is centered at 0 with a standard distribution of 2. What does this mean? It basically means that before we even do anything, we are quite sure that the value will be quite close to -5. It also means that the evidence (suggesting a value of 0 with a standard deviation of 2) is of little information content, compared to the prior:
In [4]:
prior = gauss(x,-5, 1)
likelihood = gauss(x,0, 2)
posterior = prior * likelihood
posterior_mean_theory = (-5*4. + 0.*1)/(5.)
posterior_std_theory = 1./np.sqrt ( 1./1. + 1./4)
print "posterior mean & std (theory): %.2f\t%.2f" % \
( posterior_mean_theory, posterior_std_theory)
print "Posterior 5-95%%CI: %.2f - %.2f" % ( \
posterior_mean_theory - 1.96*posterior_std_theory, \
posterior_mean_theory + 1.96*posterior_std_theory )
plt.subplot(2,1,1)
plt.plot ( x, prior, '-r', label="Prior")
plt.plot ( x, likelihood, '-g', label="Likelihood")
plt.plot ( x, posterior, '-b', label="Posterior")
plt.axvline ( x[posterior.argmax()], c='k', lw=2.5)
plt.legend(loc='best', fancybox=True, shadow=True)
plt.ylabel(r'$p(x)$')
plt.subplot(2,1,2)
plt.plot ( x, np.log(prior), '-r', label="Prior")
plt.plot ( x, np.log(likelihood), '-g', label="Likelihood")
plt.plot ( x,np.log(likelihood) + np.log(prior) , '-b', label="Posterior")
plt.axvline ( x[posterior.argmax()], c='k', lw=2.5)
plt.legend(loc='best', fancybox=True, shadow=True)
cdf = posterior.cumsum()/posterior.sum()
ci_0025 = x[np.argmin(np.abs(cdf - 0.025))]
ci_0975 = x[np.argmin(np.abs(cdf - 0.975))]
print "Maximum a posteriori: %.2f" % ( x[posterior.argmax()] )
print "Posterior 5-95%%CI: %.2f - %.2f" % ( ci_0025, ci_0975)
plt.xlabel(r'$x$')
plt.ylabel(r'$\log p(x)$')
Out[4]:
In [5]:
prior = gauss(x,-5, 5)
likelihood = gauss(x,0, 1)
posterior = prior * likelihood
posterior_mean_theory = (-5*1. + 0.*25)/(26.)
posterior_std_theory = 1./np.sqrt ( 1./25. + 1./1)
print "posterior mean & std (theory): %.2G\t%.2G" % \
( posterior_mean_theory, posterior_std_theory)
print "Posterior 5-95%%CI: %.2f - %.2f" % ( \
posterior_mean_theory - 1.96*posterior_std_theory, \
posterior_mean_theory + 1.96*posterior_std_theory )
plt.subplot(2,1,1)
plt.plot ( x, prior, '-r', label="Prior")
plt.plot ( x, likelihood, '-g', label="Likelihood")
plt.plot ( x, posterior, '-b', label="Posterior")
plt.axvline ( x[posterior.argmax()], c="k", lw=2.5)
plt.legend(loc='best', fancybox=True, shadow=True)
plt.ylabel(r'$p(x)$')
plt.subplot(2,1,2)
plt.plot ( x, np.log(prior), '-r', label="Prior")
plt.plot ( x, np.log(likelihood), '-g', label="Likelihood")
plt.plot ( x,np.log(likelihood) + np.log(prior) , '-b', label="Posterior")
plt.axvline ( x[posterior.argmax()], c="k", lw=2.5)
plt.legend(loc='best', fancybox=True, shadow=True)
plt.xlabel(r'$x$')
plt.ylabel(r'$\log p(x)$')
print "Maximum a posteriori: ", x[posterior.argmax()]
cdf = posterior.cumsum()/posterior.sum()
ci_0025 = x[np.argmin(np.abs(cdf - 0.025))]
ci_0975 = x[np.argmin(np.abs(cdf - 0.975))]
print "Maximum a posteriori: %.2f" % ( x[posterior.argmax()] )
print "Posterior 5-95%%CI: %.2f - %.2f" % ( ci_0025, ci_0975)
The previous two examples show the prior, likelihood and posterior, as well as their logarithms. In the first example we have fairly informative prior and observations, so the posterior is somewhere between the prior and likelihood mean: since we have a strong belief in the parameter value before we even start, we do not move very far from it, unless the evidence is overwhelming (which is not the case here). In the second case, the prior is much weaker (less informative) than the likelihood, which means that the likelihood will be dominant. We see that the MAP point now shifts markedly close to the likelihood as a result.
What these simple examples show is that having prior information is important. Usually, many inverse problems will be carried out with observations that have little sensitivity to the parameter of interest (in other words, the likelihood is not very sensitive). Priors help to constrain the problem significantly.
Another important aspect is that we work with distributions. That means that uncertainty needs to be reported. In the cases above, the posterior standard deviation fro the first example is around 0.9, whereas that for the second example is around 0.2. That means that the first estimate is between (-5.7, -2.2), whereas the second one lies within (-2.1, 1.7) using the 95% percentile criterion.
The previous examples show that for a trivial case, we can solve for the posterior easily with an analytic expression. Unfortunately, in most real life problems, the observations only have limited information content, and it is not possible to retrieve all the parameters that we want to find out about purely from analysing the observations. We tend to call these problems ill-posed (although this is a more general term). A way around this situation is to add extra constraints from prior information, see e.g. [Combal et al, 2003]. Apart from setting expected distributions of parameters, as we've seen above, we usually also know a lot about e.g. the temporal evolution of things. For example, leaves usually evolve smoothly. So a solution that is very jagged is less likely than one that is smooth. This is the reason one of the first tasks people do to use LAI data from satellites is smoothing. A better way is to enforce smoothness as we solve the problem: since it is something that we expect, it is best to impose smoothness at the retrieval stage, not post facto. In this way, uncertainty in the parameter (LAI in this example) will come from both the observations and how much we trust the trajectory to be smooth.
A smooth discrete time series is characterised by having its discrete differences small (in other words, there is a small change from one time step to the next). So one constraint could be that the discrete differences of the parameter as it evolves in time are as small as possible, while still fitting the observations and any other prior laws. Equivalently, we are effectively stating that the evolution of the value of the parameter from one time step to the next should be as small as possible or, in other words: the value tomorrow should be the same as the value today, but with some added uncertainty. We can see this as a particularly simple model of the evolution of the parameter: $$ x_{t+1} = x_{t} + \epsilon_{t} $$ If $\epsilon_t=0$, then $\mathbf{x}=C$, a constant!, so we know that $\left|\epsilon_{t}\right| > 0$. We can assume that $\epsilon_{t} \sim \mathcal{N}(0, \sigma_{m})$, so we are saying that the change in $x$ is expected to be normally distributed with a mean of zero and a standard deviation of $\sigma_{m}$. At this stage, it is useful to introduce a matrix differential operator, $\Delta$: $$ \begin{split}\mathbf{\Delta} = \begin{pmatrix} 1 & -1 & 0 & \cdots & 0\\ 0 & 1 & -1 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & 0\\ 0 & 0 & 0 & \cdots & -1\\ \end{pmatrix}\end{split} $$
The product $\Delta\cdot \mathbf{x}$ is just a vector with the first order differences. As before, if we take the Gaussian probability for the model fit, and take logs, we have that
$$ \log(p(\mathbf{x}))\propto\gamma \left(\Delta\cdot\mathbf{x}\right)^{T}\mathbf{I}\left(\Delta\cdot\mathbf{x}\right) $$Let's however demonstrate that a smooth function has lower differences than a smooth one first:
In [6]:
x = np.arange(1,366 )
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(365)*0.15
plt.plot(x,y_noisy, '-g', lw=0.5)
plt.plot(x,y_smooth, '-r',lw=1.4)
deriv_smooth = np.diff ( y_smooth )
deriv_noisy = np.diff ( y_noisy )
print "RMS of derivative of smooth function:", np.sqrt(np.mean(deriv_smooth**2))
print "RMS of derivative of noisy function:", np.sqrt(np.mean(deriv_noisy**2))
As mentioned above, the previous example actually encodes a simple model of parameter evolution. Clearly, more complicated models might exist for certain parameters. In this case, this part of the prior can be seen as describing that the difference between the inferred state and the model-predicted state is normally distributed. So we could use this formalism to add a model of the evolution of LAI as a prior (for example, a double logistic curve). This however begs the question of how to select the uncertainty on the fit to the model: if this is set to 0, imperfections in the model compared with reality might bias the inference problem. A very lax uncertainty might make the model irrelevant in the inference. Clearly, we need to address the problem of selecting these quantities.
We have seen above that the MAP (maximum a posteriori) estimate of the parameters is obtained by the maximum value of the posterior, or likewise, the minimum value of minus the log posterior. We have also seen how the likelihood and prior terms look like. Let's put them all together for a univeriate time series, $\mathbf{x}$
$$ J_{obs} = -\frac{1}{2}\left(\mathbf{x}-\mathbf{x}_o\right)^{T} \mathbf{C_o}^{-1} \left( \mathbf{x} - \mathbf{x}_o\right) $$$$ J_{prior} = -\frac{1}{2}\left(\mathbf{x} - \mathbf{x}_{p}\right)^{T}\mathbf{C_p}^{-1}\left(\mathbf{x} - \mathbf{x}_p\right) $$$$ J_{model} = -\frac{1}{2}\left(\Delta\mathbf{x}\right)^{T}\mathbf{C_m}^{-1}\left(\Delta\mathbf{x}\right) $$The posterior is made up of the sum of the previous three terms: $$ J = J_{obs} + J_{prior} + J_{model}, $$ and the MAP estimate will then just be minimum value of $-J$. We can solve for this efficiently using gradient descent methods, that use the derivative to locate the minimum value of the function, and are implemented in a number of computer languages and software packages. For that we can either calculate the derivatives analytically, or approximate them numerically. The former is a better approach, and can either be done "by hand", or by using automatic differentiation tools. We shall use approximations in this example.
In [7]:
import scipy.optimize
def cost_function ( x, y, x_prior, sigma_obs, sigma_prior, gamma, obs_mask ):
"""
Variational cost function with a weak constraint
Parameters
-----------
x: The current estimate of the state
y: The observations vector
x_prior: The prior estimate of the state
sigma_obs: Observational uncertainty (standard deviation)
sigma_prior: Prior uncertainty (standard deviation)
gamma: Inverse of the model term variance
"""
I = np.identity(x.shape[0])
D1 = np.matrix(I - np.roll(I,1))
D1a = D1*D1.T
j_obs = 0.5*np.sum((x[obs_mask]-y)**2/sigma_obs**2)
j_prior = np.sum((x-x_prior)**2/sigma_prior**2)
j_model = 0.5*gamma*(np.sum(np.array(D1a.dot(x))**2))
return j_obs + j_prior + j_model
x = np.arange(1,366, 5 )
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(x.shape[0])*0.15
obs_mask = np.random.rand(x.shape[0])
obs_mask = np.where ( obs_mask>0.7, True, False )
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
sigma_obs = 0.15
Now, we want to solve the problem. We use scipy.optimize.fming_bfgs
to minimise cost_function
. We start from the prior estimates. The loop sweeps over the value of gamma, to demonstrate its effect:
In [8]:
plt.plot(x[obs_mask],y_noisy[obs_mask], '-sg', lw=0.5)
for gamma in np.logspace(0,4,5):
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, \
args=(y_noisy[obs_mask], x_prior, sigma_obs, sigma_prior, gamma, obs_mask), \
disp=1, approx_grad=True)
plt.plot(x,retval[0], '-',lw=1.4, label="%4g"%gamma)
plt.plot ( x, x_prior, '-k', label="Prior")
plt.legend(loc='best')
Out[8]:
In the previous plot, look at the effect of $\gamma$: For low $\gamma$, we have a very good fit to the observations (green squares), but then the estimates drift to the original prior, as (i) the observations don't constrain them, and (ii) the smoothness term is very small. As $\gamma$ becomes more important, we have a gradually smoother solution, that starts departing from the observed points, as the optimizer is balancing fit to the observations with fit to the "process model".
We can see how the effect of noise in the observations (ie information content in the observations) affects the inference problem:
In [9]:
x = np.arange(1,366, 5 )
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy1 = y_smooth + np.random.randn(x.shape[0])*0.15
y_noisy2 = y_smooth + np.random.randn(x.shape[0])*0.5
obs_mask1 = np.random.rand(x.shape[0])
obs_mask1 = np.where ( obs_mask1>0.7, True, False )
obs_mask2 = np.random.rand(x.shape[0])
obs_mask2 = np.where ( obs_mask2>0.7, True, False )
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
plt.subplot(2,1,1)
sigma_obs = 0.15
plt.plot(x[obs_mask],y_noisy[obs_mask], '-sg', lw=0.5)
for gamma in np.logspace(0,4,5):
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, \
args=(y_noisy1[obs_mask1], x_prior, sigma_obs, sigma_prior, gamma, obs_mask1), \
disp=1, approx_grad = True)
plt.plot(x,retval[0], '-',lw=1.4, label="%4g"%gamma)
plt.plot ( x, x_prior, '-k', label="Prior")
plt.subplot(2,1,2)
sigma_obs = 0.5
plt.plot(x[obs_mask],y_noisy[obs_mask], '-sg', lw=0.5)
for gamma in np.logspace(0,4,5):
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, \
args=(y_noisy2[obs_mask2], x_prior, sigma_obs, sigma_prior, gamma, obs_mask2), \
disp=1, approx_grad = True )
plt.plot(x,retval[0], '-',lw=1.4, label="%4g"%gamma)
plt.plot ( x, x_prior, '-k', label="Prior")
plt.legend(loc='best', fancybox=True, shadow=True)
Out[9]:
We can see how the high uncertainty in the observations makes them irrelevant to the inference problem at hand: the system ignores them, and rather goes towards the prior (which in itself is a smooth solution).
We can also see the effect of having an informative prior, maybe from some climatology or something. We will assume that the prior mean is y_smooth
, with a standard deviation of 0.3, and we will use the same noisy observations as the bottom panel in the previous plot:
In [10]:
sigma_prior = 0.4
x_prior = y_smooth + np.random.randn ( y_smooth.shape[0] ) *0.1
sigma_obs = 0.5
plt.plot(x[obs_mask],y_noisy[obs_mask], '-sg', lw=0.5)
for gamma in np.logspace(0,4,5):
retval = scipy.optimize.fmin_bfgs ( cost_function, x_prior*0 + 0.5, args=(y_noisy2[obs_mask2], x_prior, sigma_obs, sigma_prior, gamma, obs_mask2), disp=1, retall=1, fprime=None)
plt.plot(x,retval[0], '-',lw=1.4, label="%4g"%gamma)
plt.plot ( x, x_prior, '-k', label="Prior")
plt.legend(loc='best')
Out[10]:
The optimisation routine works by minimising cost_function
, but it approximates the partial derivatives of cost_function
with respect to x
numerically. These derivatives are easy to calculate, so the exercise is to write some code that calculates them.
Hint You can find expressions in the EO-LDAS paper!
In [11]:
def der_cost_function ( x, y, x_prior, sigma_obs, sigma_prior, gamma, obs_mask ):
"""This function calculates the partial derivatives of the cost function"""
I = np.identity(x.shape[0])
D1 = np.matrix(I - np.roll(I,1))
D1a = D1*D1.T
der_j_obs = x*0
der_j_obs[obs_mask] = (x[obs_mask] - y)/sigma_obs**2
der_j_prior = (x - x_prior ) /sigma_prior**2
der_j_model = np.array(gamma*np.dot((D1).T, D1*np.matrix(x).T)).squeeze()
return der_j_obs + der_j_prior + der_j_model
def cost_function ( x, y, x_prior, sigma_obs, sigma_prior, gamma, obs_mask ):
"""
Variational cost function with a weak constraint
Parameters
-----------
x: The current estimate of the state
y: The observations vector
x_prior: The prior estimate of the state
sigma_obs: Observational uncertainty (standard deviation)
sigma_prior: Prior uncertainty (standard deviation)
gamma: Inverse of the model term variance
"""
I = np.identity(x.shape[0])
D1 = np.matrix(I - np.roll(I,1))
D1a = D1*D1.T
xa = np.matrix ( x )
j_obs = 0.5*np.sum((x[obs_mask]-y)**2/sigma_obs**2)
j_prior = 0.5*np.sum((x-x_prior)**2/sigma_prior**2)
j_model = 0.5*gamma*np.dot((D1*(xa.T)).T, D1*xa.T)
return j_obs + j_prior + j_model
x = np.arange(1,366)
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(x.shape[0])*0.15
obs_mask = np.random.rand(x.shape[0])
obs_mask = np.where ( obs_mask>0.7, True, False )
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
sigma_obs = 0.15
plt.plot(x[obs_mask],y_noisy[obs_mask], '-sg', lw=0.5)
for gamma in np.logspace(0,4,5):
retval = scipy.optimize.fmin_l_bfgs_b(cost_function, x_prior, fprime=der_cost_function,\
args=(y_noisy[obs_mask], x_prior, sigma_obs, sigma_prior, gamma, obs_mask), \
disp=1, factr=1e5)
plt.plot(x,retval[0], '-',lw=1.4, label="%4g"%gamma)
plt.plot ( x, x_prior, '-k', label="Prior")
plt.legend(loc='best')
Out[11]:
We now see that the number of function and gradient evaluations is similar. While the evaluation of the function and its gradient for this problem is fairly fast and even using numerical approximations like we did before is reasonable, in more complex and higher dimensional problems, each gradient evaluation needs many evaluations of the original cost function! So having access to an analytical function that calculates the partial derivatives of the cost function is therefore of great practical importance. If complex models are used, providing a simple expression for this derivative is quite challenging, and the use of specialised automatic differentiation software is required to produce what is callen an "adjoint" of the model. This process is challenging, as the model might need to be modified to cope with discontinuities.
In the previous discussion, the estimation of the full uncertainty associated with our inference on the state has been missing: we have just been concerned with the calculation of the posterior mode, and have paid no attention to the full posterior distribution. It turns out that if everything is Gaussian and linear (like the above problems), then the posterior distribution is also Gaussian, with the mean that is the posterior model we have calculated above, and with a covariance matrix given by the inverse of the Hessian. Given that we have already calculated the derivatives of the cost function, it's easy to calculate the Hessian (do it!!!)
In [12]:
I = np.identity(x.shape[0])
D1 = np.matrix(I - np.roll(I,1))
hess_obs = np.zeros( x.shape[0] )
hess_obs[obs_mask] = (1./sigma_obs**2)
hess_obs = np.diag( hess_obs )
hess_prior = (1./sigma_prior**2)*np.diag(np.ones_like(x))
hess_model = gamma*np.dot ( D1,np.eye(x.shape[0])).dot( D1.T)
hess = hess_obs + hess_prior + hess_model
post_covar = np.linalg.inv ( hess )
plt.figure(figsize=(3,3))
plt.imshow ( hess_prior, interpolation='nearest', cmap=plt.cm.gray)
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.title("Prior contribution to the Hessian")
plt.figure(figsize=(3,3))
plt.imshow ( hess_obs, interpolation='nearest', cmap=plt.cm.gray)
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.title("Obs contribution to the Hessian")
plt.figure(figsize=(3,3))
plt.imshow ( hess_model, interpolation='nearest', cmap=plt.cm.gray)
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.title("Model contribution to the Hessian")
plt.figure(figsize=(5,5))
plt.imshow ( hess_prior + hess_obs + hess_model, interpolation='nearest', cmap=plt.cm.gray)
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.figure(figsize=(5,5))
plt.imshow ( post_covar, interpolation='nearest', cmap=plt.cm.spectral)
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.title("Posterior covariance Matrix")
plt.colorbar()
D = np.sqrt(np.diagonal(post_covar))
Dinv = np.linalg.inv( np.diag(D) ) # Unnecessary, it's diagonal...
post_covar = np.matrix(post_covar)
Dinv = np.matrix(Dinv)
corr = Dinv * post_covar * Dinv
plt.figure(figsize=(5,5))
plt.imshow ( corr, interpolation='nearest', cmap=plt.cm.RdBu_r, vmin=-1, vmax=1)
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.colorbar()
plt.title("Posterior Correlation matrix")
Out[12]:
The previous plots show that the inferred state has a certain temporal correlation (that's why the last plot of the correlation matrix isn't purely diagonal, but has some sort of thickening around the diagonal). You will note that this thickening is consistent with areas that had no observations: what is happening is that the system is bringing information from areas that are well defined (with observations) into areas that are evidence-poor. This form of information transfer is controlled by the choice of $\gamma$ (as well as by the choice of regularisation or model, obviously).
The actual parameter uncertainties are given by the main diagonal elements of the posterior covariance matrix, and since we have expressions for this, we can now put error bars around our MAP estimate of the state. Let's see how this works by looping over $\gamma$ as before:
In [13]:
def posterior_covariance (x, sigma_prior, sigma_obs, gamma):
I = np.identity(x.shape[0])
D1 = np.matrix(I - np.roll(I,1))
hess_obs = np.zeros( x.shape[0] )
hess_obs[obs_mask] = (1./sigma_obs**2)
hess_obs = np.diag( hess_obs )
hess_prior = (1./sigma_prior**2)*np.diag(np.ones_like(x))
hess_model = gamma*np.dot ( D1,np.eye(x.shape[0])).dot( D1.T)
hess = hess_obs + hess_prior + hess_model
post_covar = np.linalg.inv ( hess )
return np.sqrt(np.diagonal(post_covar))
x = np.arange(1,366)
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(x.shape[0])*0.15
obs_mask = np.random.rand(x.shape[0])
obs_mask = np.where ( obs_mask>0.7, True, False )
obs1 = np.zeros_like(x, dtype=np.bool)
obs1[::5] = obs_mask[::5]
obs_mask = obs1
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
sigma_obs = 0.15
iplot=1
for gamma in np.logspace(0,6,10):
plt.subplot(5,2,iplot)
iplot = iplot+1
plt.plot(x[obs_mask],y_noisy[obs_mask], '-+', ms=5)
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, fprime=der_cost_function,\
args=(y_noisy[obs_mask], x_prior, sigma_obs, sigma_prior, gamma, obs_mask), \
iprint=1 )
post_sd = posterior_covariance ( retval[0], sigma_prior, sigma_obs, gamma )
plt.fill_between(x, retval[0] - 1.96*post_sd, retval[0] + 1.96*post_sd, color='0.8' )
plt.plot(x,retval[0], '-r',lw=1.4 )
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.ylim(-0.3, 1.2)
plt.xlim(0, 400)
plt.text(275,0.92,r'$\gamma=%4G$'%gamma, size="smaller")
fig = plt.gcf()
fig.suptitle('\#Obs/year: %d\nNoise: %4.2G'% ( obs_mask.sum(), sigma_obs ))
Out[13]:
We see that as $\gamma$ increases, the uncertainty diminishes. This is unsurprising, as what this means is that we have more and more certainty on the smoothness prior, so parameter trajectories that result in less smoothness are less likely.
In [14]:
gamma = 1500
plt.plot(x[obs_mask],y_noisy[obs_mask], '-+', ms=5)
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, fprime=der_cost_function,\
args=(y_noisy[obs_mask], x_prior, sigma_obs, sigma_prior, gamma, obs_mask), \
iprint=1 )
post_sd = posterior_covariance ( retval[0], sigma_prior, sigma_obs, gamma )
plt.fill_between(x, retval[0] - 1.96*post_sd, retval[0] + 1.96*post_sd, color='0.8' )
plt.plot(x,retval[0], '-r',lw=1.4 )
plt.yticks(visible=False)
plt.xticks(visible=False)
plt.ylim(-0.3, 1.2)
plt.xlim(0, 400)
plt.text(275,0.92,r'$\gamma=%4G$'%gamma, size="x-large")
Out[14]:
We have seen that the result of the inference is critically dependent on the value of $\gamma$. The way we went about choosing its value has been by trial and error. For example, in the last plot, $\gamma$ was chosen so as to make the grey area cover most of the observations, but this was done "by eye". Given how critical the choice of the regularisation parameter/model uncertainty is, we need a better way.
A formal way to deal with $\gamma$ is to make this parameter be a part of the inference. This requires (i) modifying the likelihood function to deal with $\gamma$, and (ii) providing a prior for it. The advantages of this approach is that $\gamma$ is estimated from the data, and its variability is then transferred to the posterior uncertainty. The downside is that this cannot be easily done in a variational scheme like the one we have already used, so it is not practical.
A better and more practical way is to implement some sort of cross-validation: some observations are left out of the problem, and the problem is solved. By assessing how well the left out observations are predicted by the inference, we get some indication of the usefulness of $\gamma$. To this end, we need a simple function that splits a dataset into training and validation datasets:
In [15]:
def k_fold_cross_validation(X, K, randomise = False):
"""
Generates K (training, validation) pairs from the items in X.
Each pair is a partition of X, where validation is an iterable
of length len(X)/K. So each training iterable is of length (K-1)*len(X)/K.
If randomise is true, a copy of X is shuffled before partitioning,
otherwise its order is preserved in training and validation.
## {{{ http://code.activestate.com/recipes/521906/ (r3)
"""
import random
if randomise:
X = list(X)
random.shuffle(X)
for k in xrange(K):
training = [x for i, x in enumerate(X) if i % K != k]
validation = [x for i, x in enumerate(X) if i % K == k]
yield training, validation
In [16]:
x = np.arange(1,366)
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(x.shape[0])*0.15
obs_mask = np.random.rand(x.shape[0])
obs_mask = np.where ( obs_mask>0.7, True, False )
obs1 = np.zeros_like(x, dtype=np.bool)
obs1[::5] = obs_mask[::5]
obs_mask = obs1
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
sigma_obs = 0.15
available_obs = obs_mask.nonzero()[0]
rmse_xval = {}
for gamma in np.logspace(1,6,30):
rmse_xval[gamma] = []
print "Xval for gamma=%g" % gamma
for train, validation in k_fold_cross_validation( \
available_obs, 5, randomise=True):
obs_mask1 = (0*obs_mask).astype(np.bool)
obs_mask1[train] = True
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, fprime=der_cost_function,\
args=(y_noisy[train], x_prior, sigma_obs, sigma_prior, gamma, obs_mask1), \
iprint=50)
rmse_xval[gamma].append ( np.std ( \
retval[0][validation] - y_noisy[validation] ) )
In [20]:
gamma_vals = sort(rmse_xval.keys())
mean_rmse = [ np.mean(rmse_xval[x]) for x in gamma_vals ]
max_rmse = [ np.max(rmse_xval[x]) for x in gamma_vals ]
plt.semilogx ( gamma_vals, mean_rmse, 'o-')
plt.semilogx ( gamma_vals, max_rmse, '+-')
Out[20]:
The above cross-validation plot shows that for low values of $\gamma$ we have a large predictive error. For very large values of $\gamma$ the error is also large (as the inversion results in a flat line, effectively). The sweet spot in the centre suggests that values of $\gamma$ between 1000 and 20000 might be appropriate, with possibly the best result from $\gamma\sim 8000$. Note this choice might suggest the need to add more samples between 1000 and 20000, with possibly a more comprehensive cross-validation scenario.
Finally, note that although cross-validation has taken quite a long time, it is in effect easy to parallelise. Let's look at what our cross-validation result suggest is the best reconstruction:
In [21]:
x = np.arange(1,366)
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(x.shape[0])*0.15
obs_mask = np.random.rand(x.shape[0])
obs_mask = np.where ( obs_mask>0.7, True, False )
obs1 = np.zeros_like(x, dtype=np.bool)
obs1[::5] = obs_mask[::5]
obs_mask = obs1
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
sigma_obs = 0.15
gamma = 8531
plt.plot(x[obs_mask],y_noisy[obs_mask], '-+', ms=5)
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, fprime=der_cost_function,\
args=(y_noisy[obs_mask], x_prior, sigma_obs, sigma_prior, gamma, obs_mask), \
iprint=1 )
post_sd = posterior_covariance ( retval[0], sigma_prior, sigma_obs, gamma )
plt.fill_between(x, retval[0] - 1.96*post_sd, retval[0] + 1.96*post_sd, color='0.8' )
plt.plot(x,retval[0], '-r',lw=1.4 )
plt.plot ( x, y_smooth, '-', lw=1.5)
plt.ylim(-0.3, 1.2)
plt.xlim(0, 400)
plt.text(275,0.92,r'$\gamma=%4G$'%gamma, size="x-large")
Out[21]:
It would appear from the previous plot that the data have been oversmoothed, suggesting that cross-validation in this particular case has not been completely successful. A better cross-validation metric might be required for this case.
In [22]:
x = np.arange(1,366)
y_smooth = (1-np.cos(np.pi*x/366.)**2)
y_noisy = y_smooth + np.random.randn(x.shape[0])*0.15
obs_mask = np.random.rand(x.shape[0])
obs_mask = np.where ( obs_mask>0.7, True, False )
obs1 = np.zeros_like(x, dtype=np.bool)
obs1[::5] = obs_mask[::5]
obs_mask = obs1
x_prior = np.ones_like(x)*0.5
sigma_prior = 1.0
gamma = 100
sigma_obs = 0.15
gamma = 3856
plt.plot(x[obs_mask],y_noisy[obs_mask], '-+', ms=5)
retval = scipy.optimize.fmin_l_bfgs_b ( cost_function, x_prior, fprime=der_cost_function,\
args=(y_noisy[obs_mask], x_prior, sigma_obs, sigma_prior, gamma, obs_mask), \
iprint=1 )
post_sd = posterior_covariance ( retval[0], sigma_prior, sigma_obs, gamma )
plt.fill_between(x, retval[0] - 1.96*post_sd, retval[0] + 1.96*post_sd, color='0.8' )
plt.plot(x,retval[0], '-r',lw=1.4 )
plt.plot ( x, y_smooth, '-', lw=1.5)
plt.ylim(-0.3, 1.2)
plt.xlim(0, 400)
plt.text(275,0.92,r'$\gamma=%4G$'%gamma, size="x-large")
Out[22]:
In [19]: