In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt, seaborn as sn, numpy as np
sn.set_context('notebook')

Model calibration

In contrast to the fairly abstract consideration of distributions in the previous notebook, this one takes a more practical approach to the problem of model calibration. It also introduces the very important concept of the likelihood function.

1.1. Types of model

There are many different types of model. At one end of the spectrum are process- or physically-based models, which use coupled (often differential) equations underpinned by the laws of physics to represent key processes of interest. In principle, if we know all the equations governing the behaviour of a system (and assuming we can measure the parameter values associated with those equations) we should be able to construct an accurate model to predict the state of the system at some point in the future.

At the other end of the spectrum are empirical models. Just like their process-based counterparts, empirical models use equations to represent relationships between variables, but these equations need not have any physical basis and the parameter values are usually chosen so as to maximise “goodness-of-fit", rather than being independently measured.

Empirical models are often simpler to setup and run much faster than process-based equivalents, but their predictions are only as good as the data used to train them. They may therefore perform poorly if used to make predictions under conditions that differ significantly from those encountered in the training dataset (for example by trying to predict river flows under some future climate).

In theory, a well-designed physically-based model will make better "out-of-sample" predictions than an empirical model, because the process knowledge incorporated into the model's structure will constrain a physically reasonable response, even under conditions different to those experienced during calibration. In reality, natural systems are often extraordinarily complex and outside of the lab it is rare to have enough process understanding to build genuinely physically-based models. Instead, we are forced to create conceptual models, which still use physical relationships and differential equations, but which also make dramatic simplifications by abstracting the complexity of the real system into some idealised conceptual framework. As an example, hydrological models often use a conceptual framework consisting of systems of connected "bucket reservoirs", where one bucket represents e.g. the soil water store, another the groundwater store and so on. These abstract conceptualisations are useful – especially if we can write down the physically-based equations that would control our conceptual system (e.g. the water flows between our idealised system of buckets). However, it is important not to confuse a physically-based model of a conceptual system with a physically-based model of the real world.

One of the difficulties associated with conceptual models is that, although the equations appear to be physically-based, the parameters in the equations will often have no concrete meaning in the real world, making them impossible to measure directly. For example, in equations commonly used in hydrological modelling, the time constant (or residence time), $\tau$, of a groundwater reservoir is the length of time, on average, that a molecule of water will spend in that reservoir between flowing in and flowing out. In reality, the true groundwater aquifer is much more complicated than the model's representation of a bucket with a couple of holes in it. This means that values of $\tau$ measured in the field (using e.g. isotopic tracer techniques) will not necessarily be compatible with the $\tau$ parameter as represented in the model.

The upshot of all this is that, in practice, virtually all environmental models - whether empirical or supposedly physically based - will have parameter values that are not physically meaningful, are too difficult/expensive to measure, or cannot be measured at a temporal/spatial scale which is compatible with the model conceptualisation. In order to get our models to give meaningful output, we therefore need to calibrate them by adjusting the poorly constrained parameter values until the output looks sensible.

1.2. Over-parameterisation

Attempts to make physically-based models of complex environmental systems have led to increasingly complex conceptual frameworks. Within the field of hydrology and water quality, some of the most popular models applied today (e.g. SWAT and HYPE) include tens or even hundreds of parameters, many of which have no direct physical meaning and are therefore usually poorly constrained. Although technically impressive, it can be very difficult to apply such models effectively even in data-rich environments. This is because a process-based model with a very large number of unconstrained parameters will behave very much like an overly complex empirical model, simply because the freedom afforded by the unknown parameters will completely swamp any limitations on the model's behaviour imposed by the process knowledge.

In empirical modelling it is usual to choose the simplest possible model that still explains the data. However, in the case of many conceptual and physically-based environmental models, it is often neither possible nor meaningful to "turn off" parameters to test whether a simpler model would suffice. Furthermore, in many cases the amount of calibration data available is limited: in hydrological applications, for example, a model will typically be calibrated against a single streamflow dataset measured at the catchment outflow. There is simply not enough information contained in such a dataset to meaningfully constrain dozens or even hundreds of model parameters, some of which might represent e.g. soil properties or transport coefficients in the upper part of catchment (which have only a limited influence on streamflow).

These issues mean that highly parameterised conceptual and process-based models can produce output which may appear to reproduce observations well, but which have little predictive value. With so much parameter-related flexibility (i.e. many "degrees of freedom"), models can generally do a good job of matching the calibration data, regardless of whether the process representation is correct. In a worst case scenario such models exhibit the "worst of both worlds", in the sense that they have the long runtimes and complexity of process-based models, but with the same limitations and poor out-of-sample predictive power as empirical models.

This does not necessarily mean that complex conceptual models cannot be used effectively, but it does mean they must be used with caution.

1.3. Manual calibration

If we have an environmental model of a real-world system, it is more than likely it will include some poorly constrained parameters that will need calibrating before the model can be used effectively. Calibration usually requires having observed input datasets (e.g. rainfall and evapotranspiration for a hydrological model) together with observed data from the same time period for the variable you're trying to simulate (e.g. streamflow). The observed input data is used to drive the model, and the parameters are adjusted until the simulated output matches the observed data as closely as possible.

As an example, suppose we have a deterministic "black box" model such as the one shown below. Deterministic means that if we run the model with the same inputs and parameter values we will always get the same output, because the model has no stochastic components. The model could represent anything at all, but we'll stick with a hydrological theme for the moment.

We have no knowledge about how the model works internally - all we can do is set values for the two parameters, $\alpha$ and $\beta$, and then press the Run button. The model produces an output time series, $S_i = \{S_1, .., S_n\}$ for time points $t_i = \{t_1, .., t_n\}$. We also have a measured dataset, $M_i = \{M_1, .., M_n\}$, which we'd like to reproduce.

For manual calibration, we start off by choosing some sensible values for $\alpha$ and $\beta$, then we run the model and compare $S_i$ to $M_i$, then we change $\alpha$ and $\beta$ and repeat until the $S_i$ and $M_i$ are as similar as possible.

Manual calibration is clearly a laborious process, but because humans are remarkably good at picking up patterns it's often surprising how quickly experienced modellers can achieve reasonable results. If you're just starting out with a new model (especially one you didn't create yourself), I'd strongly recommend putting some time aside for manual calibration: you'll learn a lot about which parameters the model is sensitive to as well as which ones control different aspects of the output. It also forces you to think about which parameter values are sensible versus which ones give the best calibration (not necessarily the same!). If nothing else, manual calibration gives you an initial benchmark that you can refer to later, once you start applying more sophisticated "auto-calibration" techniques.

1.4. Which parameters to calibrate?

In an ideal world, any parameter whose value is uncertain would be included in the calibration process. You might even decide to include the uncertainty in your input data (because our measurements are never perfect). However, in practice, if you try to do this with a complex conceptual model you might end up with far too many parameters (hundreds?) to stand any chance of achieving a successful calibration. Instead, it is necessary to choose a subset of parameters that (i) are poorly constrained (i.e. you don't already know what the value should be) and (ii) actually have an effect on the model's behaviour/output. After a bit of experimenting with manual calibration (or, more formally, using sensitivity analysis, which I won't cover here), you should be able to get a reasonable idea of which parameters might be suitable.

You will also need to choose fixed values for any parameters you choose not to calibrate. This is best done using system knowledge (e.g. literature values) where possible, although this is often difficult. Beware of studies presenting complex conceptual models where only a few calibrated parameters have been reported. In such cases it is likely that large numbers of other parameters have been fixed arbitrarily in order to avoid over-parameterisation. This may be acceptable, but it should be done transparently and with some discussion of the implications.

1.5. Auto-calibration

Computers are ideally suited to performing laborious, repetitative tasks like the steps involved in model calibration. Based on the "black box" model illustrated above, we need an algorithm that can:

Choose values for $\alpha$ and $\beta$. Based on the image above, it is obvious that whoever created the black box model is pretty certain that $\alpha$ and $\beta$ must lie between 1 and 12. In general, if we can narrow the parameter choices to within a particular range, the calibration process is likely to be more efficient than if the algorithm has to search the entire real number line.

It is also important to consider how we sample from the possible parameter values: are all the numbers (including fractions) between 1 and 12 equally likely? Are $\alpha$ and $\beta$ integers? Do we have reason to believe that e.g. numbers towards the middle of the range are more likely than those at the extremes? In the former case, we might sample randomly from a uniform distribution between 1 and 12, whereas in the latter we might use something Gaussian-like to assign greater weight to the values around 6.

This kind of reasoning leads to the concept of a prior distribution for each of our parameters. Defining priors is a fundamental part of Bayesian inference and it's something we'll return to later.
Run the model with the chosen parameter values. This step is usually pretty straightforward - it's just a question of telling your computer how to feed input data to your model and press the "Run" button. If your model is written in Python it's just a matter of linking your calibration code to your model code. Alternatively, if your model is available as a command line executable you should be able to call it from your Python calibration code using e.g. subprocess.call().
Evaluate "goodness-of-fit". The simplest form of manual calibration involves visually comparing the model output to the observed data to determine the performance of each parameter set. In most cases it is also useful to calculate some simple summary statistics, such as simple least squares (SLS) or the Nash-Sutcliffe efficiency (NS) (the latter being especially common in hydrology).

The range of different summary statistics (sometimes called skill scores) used by the modelling community is huge. Some have useful properties in specific cirumstances (this is a whole topic in itself), but it is important to understand that all skill scores involve making assumptions about your data (e.g. many assume independent, identically distributed Gaussian errors). Often the assumptions are transparent, but in some cases authors seem unaware of the implicit assumptions made by their chosen metric.

Rather than discussing the pros and cons of a whole range of different skill scores, we will initially take a more formal statistical approach by explicitly setting out our assumptions and formulating an appropriate "goodness-of-fit" metric. This is called a likelihood function.

2.1. The likelihood function

Suppose we run the model illustrated above with a particular set of parameters and generate the data shown in red on the image below. The blue curve shows the observations we're trying to simulate.

We want to define a metric that awards higher scores when the simulated (red) points are closer to the observed (blue) ones. However, we know that our model will never be perfect and we also know our observations have error associated with them too, so we don't expect the two curves to coincide exactly. How close we can reasonably expect them to be depends on the quality of our model and the accuracy of our measurements. If we expect both to be very good, we might decide to heavily penalise even small discrepancies between the model results and the observations; on the other hand, we might decide to be more lenient by penalising only very large errors.

The simplest and most common way to formulate an error structure is to assume our model results should differ from the observed series by errors that are normally distributed with a mean, $\mu_\epsilon$, of 0 and some (unknown) standard deviation, $\sigma_\epsilon$. We can write this error structure as a stochastic component added to our deterministic black box model:

$$y = f(x, \theta) + \mathcal{N}(0, \sigma_\epsilon)$$

where $y$ is the observed data, $f$ is a (possibly very complex) function representing the deterministic part of our model, run using input data, $x$, and parameters, $\theta$, and $\mathcal{N}(0, \sigma_\epsilon)$ is the stochastic error term drawn from a normal distribution.

Note that by setting the mean of the error distribution to zero we are assuming our model is unbiased. This is a sensible choice, because if you suspect your model to be biased you'd be better off working out why and fixing the problem (or using a different model), rather than building the bias into the error term by changing $\mu_\epsilon$.

We can visualise this error structure by plotting a small Gaussian, $\mathcal{N}(f(x, \theta), \sigma_\epsilon)$ at each simulated point, as on the image below.

For each pair of points, $S_i$, $M_i$, we can evaluate the probability density of the measured data, $M_i$, being drawn from a Gaussian centred on the simulated data, $S_i$, with standard deviation $\sigma_\epsilon$.

Looking at the above images, you can hopefully see that if $\sigma_\epsilon$ is small, we heavily penalise small differences between simulated and observed values. This is because the Gaussian error distribution is narrow and pointed, meaning that the probability density falls away quickly and so very low likelihood values are assigned when the $S_i$ and $M_i$ are far apart. A larger value of $\sigma_\epsilon$ gives a broader error distribution which penalises errors less severely.

So far we have assumed that our model differs from the observed data by errors that are described by a Gaussian distribution with mean 0 and standard deviation $\sigma_\epsilon$. If we assume that this distribution stays the same for every time point, $t_i$, where $i = \{1, .., n\}$, we can calculate a probability density, $P(M_i)$, for each time step. This is simply the density associated with drawing $M_i$ from a Gaussian with mean $S_i$ and standard deviation $\sigma_\epsilon$, as illustrated on the plot above.

If we further assume that each point in the time series is independent of the others, we can calculate the overall likelihood for the full dataset as the product of the densities for each individual point:

$$L(M|\theta) = \prod_{i=1}^{n} P(M_i)$$

where $L(M|\theta)$ is the likelihood of the observations given the model parameters i.e. the probability that the model, run with a particular set of parameters, will simulate the observed dataset.

If the parameters produce output that is similar to the observed data, the $S_i$ will be similar to the $M_i$ and so the probability densities, $P(M_i)$, will be large and the likelihood will be high. On the other hand, if the parameters produce poor output, the $P(M_i)$ will be small and the likelihood will be low. Higher values of the likelihood therefore correspond to "better" (more likely) parameter sets, as long as the assumptions for the error structure are met. As a recap, these assumptions are:

The errors, $\epsilon_i = (S_i - M_i)$, are normally distributed with mean zero and standard deviation $\sigma_\epsilon$.
The errors are independent i.e. successive values of $\epsilon_i$ are not autocorrelated and do not show heteroscedasticity.

In a later notebook we will look at some simple diagnostic plots to test these assumptions, and we'll also consider how to generalise the likelihood function to make it more widely applicable.

As an aside, it's worth noting that the assumptions described above are identical to those for the simple least squares (SLS) skill score, so using SLS to assess goodness-of-fit is functionally identical to using the simple independent and identically distributed (iid) Gaussian likelihood function described above.

2.2. Log likelihood

Probability densities are always numbers less than 1, and the formula given above for the likelihood involves multiplying lots of them together. Likelihoods therefore become very tiny and it is possible for computers to run into numerical problems ("arithmetic underflow") when calculating them. For this reason, it is usually better to work with the log likelihood, which converts the product in the formula above into a sum of logs:

$$LL(M|\theta) = \sum_{i=1}^{n} ln(P(M_i))$$

where $LL$ is the log likelihood.

Recall from the previous notebook that the equation for a Gaussian is:

$$P(x)=\frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - ( {x - \mu } )^2 /{2\sigma ^2 }}}}$$

We can re-write this for our error distribution at a single time point as:

$$P(M_i)=\frac{1}{{\sigma_\epsilon \sqrt {2\pi } }}e^{{{ - ( {M_i - S_i } )^2 /{2\sigma_\epsilon ^2 }}}}$$

Taking natural logs and re-arranging, this can be written:

$$P(M_i)= \frac{-ln(2\pi{\sigma_\epsilon}^2)}{2} - \frac{(M_i - S_i)^2}{2{\sigma_\epsilon}^2}$$

which we can sum over $n$ time points to give the overall log likelihood (assuming iid Gaussian errors):

$$LL(D|\theta) = \frac{-nln(2\pi{\sigma_\epsilon}^2)}{2} - \sum_{i=1}^{n} \frac{(M_i - S_i)^2}{2{\sigma_\epsilon}^2}$$

2.3. Maximum likelihood estimation

Before going any further, I think it's worth stressing that likelihoods are not an exclusively Bayesian concept - they are relevant in both Bayesian and Frequentist statistics. In many cases, Bayesians and Frequentists will use the same likelihood functions and get the same answers. If you're interested in the differences between Bayesian and Frequentist paradigms, I thoroughly recommend reading this blog post (and the follow-ups) by Jake Vanderplas, as well as his excellent article on arXiv.

Now that we have a likelihood function, we can develop an automated calibration procedure to identify the "best" parameter set, just like we were trying to do with the manual calibration procedure described in section 1.3. Note that as well as wanting to calibrate our black-box model parameters, $\alpha$ and $\beta$, constructing our error model has introduced one additional parameter: $\sigma_\epsilon$. Because we don't know the value of this, we'll simply include it as an additional variable in our optimisation routine.

We want to find values for $\alpha$, $\beta$ and $\sigma_\epsilon$ that maximise the likelihood function. As an illustrative example, we'll assume a particular form for the "true" model, generate some synthetic data from it, and then use maximum likelihood estimation to try to identify the true parameters. If the method works here, perhaps it will also work in the real world where we never get to know the "true" parameter values. Let's suppose our black box model from above is actually just a simple linear model

$$y = \alpha x + \beta$$



In [2]:

    
# Generate some fake data, incorporating Gaussian noise
alpha_true = 3
beta_true = 7
sigma_true = 2

x = np.arange(0, 10, 0.1)
y = alpha_true*x + beta_true + np.random.normal(loc=0, scale=sigma_true, size=len(x)) # The observed data

Next we'll define our log likelihood function. This function takes a vector of estimated values for $\alpha$, $\beta$ and $\sigma_\epsilon$ and estimates the likelihood of the data given the parameters, assuming that:

$$y = \alpha x + \beta + \mathcal{N}(0, \sigma_\epsilon)$$

We want to maximise this function, but Scipy includes optimisation tools for minimising. Thereore we'll also define a function for the negative log likelihood. Minimising this is the same as maximising the log likelihood.



In [3]:

    
def log_likelihood(params, obs):
    """ Returns log likelihood assuming iid Gaussian errors.
        params is a vector of parameter estimates [alpha, beta, sigma]
        obs is the observed dataset we're trying to match
    """
    # Get number of value pairs
    n = len(obs)
    
    # Extract parameter values
    alpha, beta, sigma = params
    
    # Calculate model results with these parameters
    sim = alpha*x + beta
    
    # Calculate log likelihood (see equations above)
    ll = -n*np.log(2*np.pi*sigma**2)/2 - np.sum(((obs - sim)**2)/(2*sigma**2))
    
    return ll

def neg_log_likelihood(params, obs):
    """ Maximising the log likelihood is the same as minimising the negative log
        likelihood.
    """
    return -log_likelihood(params, obs)

Finally, we import the optimiser from Scipy and make some starting guesses for $\alpha$, $\beta$ and $\sigma_\epsilon$. The optimiser does a pretty good job of recovering the "true" values for $\alpha$ and $\beta$, which are what we wanted to find.



In [4]:

    
from scipy import optimize

# Guess some starting values for [alpha, beta, sigma]
param_guess = [6., 6., 1.]

# Run optimiser
param_est = optimize.fmin(neg_log_likelihood, param_guess, args=(y,))

# Print results
print '\n'
print 'Estimated alpha: %.2f. True value %.2f' % (param_est[0], alpha_true) 
print 'Estimated beta: %.2f. True value %.2f' % (param_est[1], beta_true) 
print 'Estimated sigma: %.2f. True value %.2f' % (param_est[2], sigma_true)









    



Optimization terminated successfully.
         Current function value: 219.518861
         Iterations: 130
         Function evaluations: 227


Estimated alpha: 2.98. True value 3.00
Estimated beta: 7.36. True value 7.00
Estimated sigma: 2.17. True value 2.00

So far so good, but although we've estimated the "best" parameter set by maximising the likelihood, we have no indication of how much confidence we should have in this result. If the likelihood function consists of a sharp, well-defined peak, the values for $\alpha$, $\beta$ and $\sigma_\epsilon$ may be tightly constrained (i.e. have narrow confidence intervals). On the other hand, the likelihood function may describe a broad, flat plateau with no clear maximum, or a complex hilly landscape with several widely separated maxima. In such cases a single "point estimate" for each parameter value may obscure the fact that a range of different parameter sets could produce essentially the same answer. The "best" parameter set is therefore not much use without some additional information describing the confidence interval (or credible interval to the Bayesians) around each estimated value.

Aside: A simpler way to calculate the log likelihood

The log_likelihood function above explicitly calculates the result using the formula for a Gaussian. However, scipy has some convenience functions to make coding this kind of calculation easier. The following code does exactly the same thing, and is much less prone to typos.



In [5]:

    
from scipy.stats import norm

def log_likelihood2(params, obs):
    """ An alternative way of coding the log likelihood.
        Returns log likelihood assuming iid Gaussian errors.
        params is a vector of parameter estimates [alpha, beta, sigma]
        obs is the observed dataset we're trying to match.
    """
    # Get number of value pairs
    n = len(obs)
    
    # Extract parameter values
    alpha, beta, sigma = params
    
    # Calculate model results with these parameters
    sim = alpha*x + beta
    
    # Calculate log likelihood
    ll = np.sum(norm(sim, sigma).logpdf(obs))
        
    return ll

# Quick check that results from both functions are the same
# Generate fake obs assuming alpha=6 and beta=3
x = np.arange(0, 10, 0.1)
obs = 6*x+3

# Get log likelihood for alpha=3 and beta=4, if sigma=2
print log_likelihood([3, 4, 2], obs) 
print log_likelihood2([3, 4, 2], obs)









    



-3496.39607138
-3496.39607138

3. Summary

Most models of real world environmental systems are complex enough to need calibrating, because we rarely have sufficiently detailed information to constrain all the parameters.
Calibration can be performed manually, but this is time consuming (although useful!) and may be impossible for models with lots of parameters.
Auto-calibration procedures require us to:
1. Define rules for how to pick parameter values (based on prior knowledge?), and
2. Devise a method for evaluating model performance.
A variety of summary statistics and skill scores are commonly used, but the underlying assumptions for these may not be obvious.
Formal likelihoods involve describing the difference between simulated and observed model output in terms of probabilities. To do this, we need to devise an appropriate error structure which is used as the basis for assessing model performance. This forces us to think about the assumptions being made, but we need to remember to actually go back and check them (more on this in a later notebook).
Log likelihoods are used to avoid numeric errors.
Once we have a likelihood function, we can use an optimiser to identify the most likely parameter set (although this can be difficult in high-dimensional parameter spaces). Note also that this method only finds the "best" parameter set - it gives no indication of how much confidence we should have in the values identified. This is a major limitation and one of the main motivations for everything that follows.