Uncertainty in Environmental Modelling

May 7 2013 - UNESCO IHE - Delft

1. Python-preparation

We will do the exercises in Python, by using some packages:


In [34]:
#This is only relevant for the course environment
%pylab inline


Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [32]:
#How to work with distributions in numpy
import numpy as np #http://docs.scipy.org/doc/numpy/reference/
import matplotlib.pyplot as plt
from scipy.stats import norm
import random

In [35]:
#some thigs to recognize in the code:
x=np.linspace(0.,10,500)    #create an array with values between [0,10] in 100 steps

y=np.random.normal(0.0,10.0,x.size)   #take elements from a uniform distribution

#plt.plot(x,y,'ro')  #plot the numbers (line or markers)
plt.hist(y, edgecolor= 'white', facecolor='gray', normed=True)  #make a histogram, 
#frequency of values in certain intervals)
#calculations
w=x**2
z=np.sqrt(x) #square root
ze = np.ones(11)  #make an array with zero values cfr. np.ones()
print ze  #print this output


[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

In [33]:
ze


Out[33]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [36]:
#getting information by using tab-function!!
#np.arange(

#writing definitions in order to run code multiple times with different inputs:
def example_def(input1,input2):
    '''
    In this part comes the information about what the definition is doing
    eg. multiply the two input values
    '''
    def_output = input1*input2
    return def_output

#Let's use the function:
print example_def(2.,8.)


16.0

2. Error/Uncertainty Propagation

We know the uncertain inputs and their characteristics

In case we know our uncertainty of the inputs in the model, we can propagate these errors towards the model output. As such, the amount of 'variability' attributed by the different sources are propagated towards the model output.

We will focus on three methods:

  • Linear Propagation with 'classic' error-propagation models
  • Differential error analysis (generalisation of the first)
  • Monte Carlo based uncertainty/error propagation

Remark: We also assume that the model structure we're using is 'correct'

2.1 Classic error-propagation linear models

In the case of linear models, where the errors have a normal distribution and the uncertainties have no significant covariance (i.e. are not dependent on eachother), classical error-propagation rules can be defined (no correlation between input factors considered):

  • For addition and subtraction: $$z = x \pm y\pm\ldots \rightarrow \sigma_z = \sqrt{\sigma_x^2 +\sigma_y^2 + \ldots}$$
  • Multiplication by an exact number: $$z = c \cdot x \rightarrow \sigma_z = c \cdot \sigma_x$$
  • For multiplication and division: $$z = x\cdot y\; \text{or} \; z=x/y \rightarrow \frac{\sigma_z}{z} = \sqrt{(\frac{\sigma_x}{x})^2 + (\frac{\sigma_y}{y})^2 + \ldots }$$
  • Also for other functions available, like trigonometric functions

EXAMPLE - linear model

We define the linear 'model', describing the seperation between anorganic and organic nitrogen in a mixed solution: $$ \quad \quad \quad N_{tot} = N_{NH_4} + N_{org}$$

Let' say $N_{NH_4}$ and $N_{org}$ have a known ucertainty, expressed as a variance $\sigma_{NH_4}^2$ and $\sigma_{org}^2$. If we then want to propagate the uncertainty for this linear model, we can calculate the variance on the $N_{tot}$ by following calculation:

$$\sigma_{tot}^2 = \sigma_{NH_4}^2 + \sigma_{org}^2 $$

If the uncertainty of the two measurements would be dependent from eachother, the relation should be incorporated in the uncertainty: $$\sigma_{tot}^2 = \sigma_{NH_4}^2 + \sigma_{org}^2 + 2\cdot \rho \cdot \sigma_{NH_4}\sigma_{org}$$ with $\rho$ the correlation between both.


In [47]:
import numpy as np
import matplotlib.pyplot as plt
import random
#NH4
meanNH4 = 3.
stdNH4 = 0.6
NH4 = np.random.normal(loc=meanNH4, scale=stdNH4, size=2000)
#Norg
meanorg = 5.
stdorg = 0.2
Norg = np.random.normal(loc=meanorg, scale=stdorg, size=2000)

#plot NH4
plt.hist(NH4, bins = 20, color='k')
plt.vlines(meanNH4,0.,300.,color='white') #the mean of the uncertainty of NH4
plt.vlines(meanNH4+1.96*stdNH4,0.,300.,color='white',linestyle='--') # mu + 1.96*std
plt.vlines(meanNH4-1.96*stdNH4,0.,300.,color='white',linestyle='--') # mu - 1.96*std
#95% of the cirve lies in between these two boundaries

#plot the Norg
plt.hist(Norg, bins = 20, color='gray', edgecolor='None')
plt.vlines(meanorg,0.,300.,color='white') #the mean of the uncertainty of NH4
plt.vlines(meanorg+1.96*stdorg,0.,300.,color='white',linestyle='--') # mu + 1.96*std
plt.vlines(meanorg-1.96*stdorg,0.,300.,color='white',linestyle='--') # mu - 1.96*std


Out[47]:
<matplotlib.collections.LineCollection at 0x6bc6f90>

For this simple model, the only information we need is the standard deviation of both model inputs

So if we get measurements values of $N_{org}$ (7mg/l) and $N_{NH_4}$ (3mg/l), we can give an estimate of the model output together with the uncertainty:

  • We calculate the standard deviation of the output for this model based on the formula, $$z = x \pm y\pm\ldots \rightarrow \sigma_z = \sqrt{\sigma_x^2 +\sigma_y^2 + \ldots}$$

In [48]:
#Calculation of the Ntot uncertainty, assuming none dependent errors
rho = 0.7
stdNH4 = 0.6 #uncertainty of the NH4 input
stdorg = 0.2 #uncertaint of the Norg input
std_tot = np.sqrt(stdNH4**2 + stdorg**2) + 2* rho*stdNH4*stdorg
print 'The standard deviation of the linear model output is ', std_tot


The standard deviation of the linear model output is  0.800455532034
  • We get the model output for this model combined with it's uncertainty

In [49]:
N_tot = 3. + 5.
print 'Estimated total Nitrogen is', N_tot,'+/-', std_tot ,'mg/l'


Estimated total Nitrogen is 8.0 +/- 0.800455532034 mg/l

Let's visualize this information (we work with normal distributions!)


In [50]:
#plot the linear model output distribution by doing this for many values
Ntot = np.random.normal(loc=meanNH4+meanorg, scale=std_tot, size=2000)
plt.hist(Ntot, bins = 20, color='k')
plt.vlines(meanNH4+meanorg,0.,300., color='white') #the mean of the uncertainty of NH4
plt.vlines(meanNH4+meanorg+1.96*std_tot,0.,300.,color='white',linestyle='--') # mu + 1.96*std
plt.vlines(meanNH4+meanorg-1.96*std_tot,0.,300.,color='white',linestyle='--') # mu - 1.96*std


plt.hist(NH4, bins = 15, facecolor='None', edgecolor = 'k')
plt.hist(Norg, bins = 20, facecolor='None', edgecolor='gray')


Out[50]:
(array([  2,   5,   8,  20,  33,  88, 131, 195, 242, 255, 265, 227, 215,
       125,  97,  46,  25,  12,   7,   2]),
 array([ 4.32900408,  4.39595956,  4.46291504,  4.52987052,  4.596826  ,
        4.66378148,  4.73073696,  4.79769244,  4.86464791,  4.93160339,
        4.99855887,  5.06551435,  5.13246983,  5.19942531,  5.26638079,
        5.33333627,  5.40029175,  5.46724723,  5.53420271,  5.60115819,
        5.66811367]),
 <a list of 20 Patch objects>)

------------------------------------------------------------------------------------------

EXERCISE: ERROR PROPAGATION FOR A LINEAR MODEL

The python file ex1_linprop.py is prepared for the following exercise; based on the following instructions, adapt the script in the #FILL IN places:

We define the linear 'model', describing the potential damage $P_d$ on a municipality by flooding as the combined effect of the flooding risk $R_f$ and the habitant concentration $H_c$: $$ \quad \quad \quad P_{d} = R_{f} \cdot H_{c}$$

Suppose we know the uncertainty/variability of the flooding risk and habitant concentration, having a normal distribution and being independent from eachother:

  • $\sigma_{R_f}$ = 0.3
  • $\sigma_{H_c}$ = 0.05

Calculate the uncertainty of the model output and visualize the uncertainty for $R_f = 1.5$ and $H_c = 0.5$ in a normal distribution around your model output value.

Rule for multiplication: $$z = x\cdot y\; \rightarrow \frac{\sigma_z}{z} = \sqrt{(\frac{\sigma_x}{x})^2 + (\frac{\sigma_y}{y})^2}$$

if the inputs are correlated with correlation factor $\rho_{x,y}$: $$z = x\cdot y\; \rightarrow \frac{\sigma_z}{z} = \sqrt{(\frac{\sigma_x}{x})^2 + (\frac{\sigma_y}{y})^2} + 2\frac{\sigma_x \sigma_y}{x\cdot y}\rho_{x,y}$$

------------------------------------------------------------------------------------------

Problems!

  • Assumption of normal distributed errors not always valid
  • Most environmental models are non-linear
  • For complex models with many calculations not feasible

As a result, we need more advanced and flexible methods (but basically, the idea remains the same)!!

Intermezzo: linear/non-linear models

Model is linear if $f(aA + bB) = af(A) + bf(B)$, but this is not always easy to check. So it is better to check if the partial derivative of the function (model) $f$ towards the variable $A$ is not dependent on $A$ anymore to evaluate the linearity (if not dependent, linear): $$\frac{\delta f}{\delta A}\neq g(A)$$

For example the model $\mu = \frac{\mu_{max}S}{K_s+S}$ (growth rate of micro-organisms)

  • The linearity of the model towards variable $S$ can be checked by $\frac{\delta \mu}{\delta S}$

(calculation rule for derivative of $f(x)=g(x)/h(x) \rightarrow \frac{g'(x)h(x)-h'(x)g(x)}{h(x)^2}$)

$$\frac{\delta \mu}{\delta S} = \frac{\mu_{max}S-\mu_{max}(K_s+S)}{(K_s +S)^2} = \frac{\mu_{max}K_s}{(K_s+S)^2} = g(S)$$

  • The linearity of the model towards parameter $\mu_{max}$ can be checked by $\frac{\delta \mu}{\delta \mu_{max}}$
$$\frac{\delta \mu}{\delta \mu_{max}} = \frac{S}{K_s + S} \neq g(\mu_{max})$$

2.2 Differential Error Analysis

There are situations where the computations are more complex, so we need a more general way of deriving expressions for the propagation of experimental errors. In linear error propagation, a linear approximation of the model is used to calculate the model output uncertainty, making use of the first order term of the Taylor Series Approach.

The uncertainty on a model output $y$ as a consequence of the different uncertainty model inputs ($n$ uncertaint inputs in total) is given by:

$$y = f(x_1,x_2,\ldots,x_n) \rightarrow \sigma_y^2 = \sum_n \sigma_{x_i}^2 (\frac{\delta y}{\delta x_i})^2$$

with $\frac{\delta y}{\delta x_i}$ the partial derivative of the output y towards the input x (given without proof).

As such, this is actually a generalisation of the 'classic' error propagation, but use is still limited:

  • Only valid if the assumption of linearity of the model is valid
  • No covariance (dependence between the uncertain inputs) taken into account

(Taylor series apporach, see also http://en.wikipedia.org/wiki/Taylor_series, of a function f(x) $f(a)+\frac {f'(a)}{1!} (x-a)+ \frac{f''(a)}{2!} (x-a)^2+\frac{f^{(3)}(a)}{3!}(x-a)^3+ \cdots$, only taking first order means: $f(a)+\frac {f'(a)}{1!} (x-a)$)

EXAMPLE - non-linear model (illustrative)

The following model describes the nitrification rate (in the soil or in water) $r_{NH}$, considering the autotrophic biomass ($X_{B,A}$) constant. The conversion of ammonia is in this model dependent on the concentration of ammonia ($S_{NH}$) and the oxygen concentration ($S_O$)

$$r_NH = \frac{\mu_A}{Y_A}\frac{S_{NH}}{S_{NH}+K_{NH}}\frac{S_O}{S_O + K_{O,A}}X_{B,A}$$

The first order Taylor approximation (not taking into account hte higher order terms) for the model would be (with four uncertain parameter $Y_A, K_{NH}, K_{O,A}$ and $\mu_{A}$):

$$r_{nit}(Y_A,\mu_A,K_{NH},K_O)\approx\, r_{nit}(Y_{A,0},\mu_{A,0},K_{NO,0},K_{O,0})\\ +\frac{\partial r_{nit}}{\partial Y_{A,0}}(Y_A-Y_{A,0})+\frac{\partial r_{nit}}{\partial \mu_{A,0}}(\mu_{A}-\mu_{A,0})\\ +\frac{\partial r_{nit}}{\partial K_{NH,0}}(K_{NH}-K_{NH,0})+\frac{\partial r_{nit}}{\partial K_{O,0}}(K_O-K_{O,0})$$

Applying the linear uncertainty propagation (part 1) results in:

$$\sigma^{2}_{nit}\approx\, \frac{\partial r_{nit}}{\partial Y_{A,0}}\sigma^{2}_{Y_A}+\frac{\partial r_{nit}}{\partial \mu_{A,0}}\sigma^{2}_{\mu_A}\\ +\frac{\partial r_{nit}}{\partial K_{NH,0}}\sigma^{2}_{K_{NH}}+\frac{\partial r_{nit}}{\partial K_{O,0}}\sigma^{2}_{K_O}+\ldots$$

With all the terms resulting in a number and the $\sigma$ values the known uncertainties of the input (purely illustrative).

2.3 Monte Carlo based propagation

The uncertain inputs (parameters, initial conditions,...) are given a stochastic distribution (expert knowledge or specific characteristics). The 'deterministic' model is calculated for a set of N monte carlo runs, each with the inputs sampled from the defined distributions.

All uncertainty sources are lumped together, so he monte carlo offers no knowledge about the importance of the different uncertainty sources to the model output $\rightarrow$ see later in sensitivity analysis!

Starting from a deterministic model, we have certain inputs with an probability, we do multiple shots and get output distirbution:


In [1]:
from IPython.display import Image
Image(filename='monte_carlo2.png', width=500)


Out[1]:

Caption: Random samples are taken from the 'known' input distributions, and for each 'shot', the model gives an output. These outputs distributions are analysed te derive uncertainty information.

A priori, let's use the example of the linear model to see the similarity.

EXAMPLE - the linear model

We take the linear 'model', describing the seperation between anorganic and organic nitrogen in a mixed solution: $$ \quad \quad \quad N_{tot} = N_{NH_4} + N_{org}$$


In [43]:
#NH4
meanNH4 = 3.
stdNH4 = 0.4
#Norg
meanorg = 5.
stdorg = 0.2

#the model
def my_N_model(NH4,Norg):
    #This could be any very complex model
    return (NH4 + Norg)

#Basic Monte-Carlo setup:
number_of_montecarlo = 2000  #play with this value to see the importance of suffiecient samples
Ntot = np.zeros(number_of_montecarlo) #we prepare the model output array to save results in

for i in range(number_of_montecarlo):
    
    #Only a little trick to follow the model running:
    if i%100 == 0:
        print 'Monte Carlo Run ',i,' is running' 
        
    #we sample from the input distributions of the model inputs
    Norg = numpy.random.normal(loc=meanorg, scale=stdorg, size=1)    
    #Norg = numpy.random.uniform(2.,6.0, size=1)     #-> see question later
    NH4 = numpy.random.normal(loc=meanNH4, scale=stdNH4, size=1)
    
    #we calculate the model for these sampled values and save them in our prepared array
    Ntot[i] = my_N_model(NH4,Norg)

#Let's make a histogram of the result
plt.hist(Ntot, bins = 20, color='k', )    

#We don't know the standard deviation of this 'unknown' distribution, 
#but can easily calculate percentile values:
p1 = np.percentile(Ntot, 2.5) #2.5 percentile
p2 = np.percentile(Ntot, 97.5) #97.5 percentile
plt.vlines(p1,0.,100.,color='white',linestyle='--')
plt.vlines(p2,0.,100.,color='white',linestyle='--')


Monte Carlo Run  0  is running
Monte Carlo Run  100  is running
Monte Carlo Run  200  is running
Monte Carlo Run  300  is running
Monte Carlo Run  400  is running
Monte Carlo Run  500  is running
Monte Carlo Run  600  is running
Monte Carlo Run  700  is running
Monte Carlo Run  800  is running
Monte Carlo Run  900  is running
Monte Carlo Run  1000  is running
Monte Carlo Run  1100  is running
Monte Carlo Run  1200  is running
Monte Carlo Run  1300  is running
Monte Carlo Run  1400  is running
Monte Carlo Run  1500  is running
Monte Carlo Run  1600  is running
Monte Carlo Run  1700  is running
Monte Carlo Run  1800  is running
Monte Carlo Run  1900  is running
Out[43]:
<matplotlib.collections.LineCollection at 0x64eedb0>

Basic Monte Carlo simulation techniques mean:

  1. Define wich model attributes have a stochastic identity and defining the stochastic domain for each of them
  2. Generate randomly a set of input values ('shots') from the defined probability density functions (pdf) of the different uncertaint inputs.
  3. Run the model for each of the sampled input values.
  4. Characterise the output uncertainty of the model, eg derive percentile values,...

In essence, this is the same as the error propagation, but we can use the Monte Carlo method towards:

  • Other input distributions (uniform, lognormal,...)
  • More complex, non-linear models

QUESTION: What to change if we would assume a uniform distribution for Norg?

EXAMPLE - Respirometric batch reactor model

The following respirometric model for aerobic degradation of acetate without biomass storage will be used (Gernaey, 2002). It predicts the model output variable exogenous oxygen uptake rate of bacteria: $OUR_{ex}$ ($mg L^{-1} d^{-1}$).

$$\frac{dS}{dt} = -(1-e^{-\frac{t}{\tau}})\mu_{max}\frac{1}{Y}\frac{S}{K_S + S}X$$$$\frac{dX}{dt} = (1-e^{-\frac{t}{\tau}})\mu_{max}\frac{1}{Y}\frac{S}{K_S + S}X-bX$$$$OUR_{ex} = (1-e^{-\frac{t}{\tau}})\mu_{max}\frac{1-Y}{Y}\frac{S}{S+k_S}X$$

with

$S$: substrate concentration ($mgCOD\; L^{-1}$ ),
$X$: biomass concentration ($mg COD\; L^{-1}$),
$\tau$: retardation on biomass activity ($d^{-1}$ ),
$Y$: yield of the biomass (-),
$\mu_{max}$: maximum growth rate ($d^{-1}$),
$K_s$: half-saturation Monod constant ($mg COD\; L^{-1}$)
$\quad$(if $\mu$ equals half of $\mu_{max}$, the substrate $S$ equals $K_s$).


In [2]:
from IPython.display import Image
Image(filename='batcheactor.png', width=300)


Out[2]:

Caption: How the reactor with sludge and bacteria looks like. We add small pulse of substrate (Acetate) to the reactor, the bacteria consume substrate and use oxygen, we measure Dissolved Oxygen and (indirectly) OUR

The parameters in the model ($\tau, Y, K_s$ and $\mu_{max}$) are uncertain and we characterise their uncertainty with uniform distribution. Moreover, since we can not easily measure the initial biomass when starting the experiment, we also consider the initial biomass $X_O$ as an uncertain input of our model.

Reference:

Gernaey, K., Petersen, B., Nopens, I., Comeau Y. and P.A. Vanrolleghem, Modeling aerobic carbon source degradation processes using titrimetric data and combined respirometric-titrimetric data: experimental data and model structure, Biotechnology and bioengineering, 79(7), 741-753, 2002

Implementation of the model

To implement this model, we bring the differential equations into code. Based on two definitions, we will be able to run the model with different parameter sets.


In [14]:
#The model consist of differential equations, which needs integration (solver-based)
from scipy.integrate import odeint

def deriv_works(u,t,Pars,Const): #Derivative used in the general model function
    '''
    Differential equations of the respirometric model in code
    '''
    #Define the parameters
    mumax = np.float64(Pars[0])
    Y = np.float64(Pars[1])
    Ks = np.float64(Pars[2])
    tau = np.float64(Pars[3])
    
    b = np.float64(Const[0])
    kla = np.float64(Const[1])
    SOeq = np.float64(Const[2])
    
    Monod=mumax*(u[1])/(u[1]+Ks)    #Monod Kinetic
    Expo=1.0-np.exp(-t/tau)         #Helpfunction for clearer implementation

    #The model equations
    dXdt = (Expo*Monod-b)*u[0]                    #Biomassa
    dSsdt = -(1.0/Y)*Expo*Monod*u[0]              #Substraat
    dOdt = kla*(SOeq-u[2])-((1-Y)/Y)*Expo*Monod*u[0]   #Oxygen

    return np.array([dXdt,dSsdt,dOdt])

def RespiroModel(Pars,Init_unc,time):
    '''
    Run the respirometric model
    '''
    
    #Define the constants
    b = 0.62
    kla = 369.7334962
    SOeq = 8.4
    Constt = np.array([b,kla,SOeq])

    #Define the initial conditions (Constants)Ss0
    Ss0 = 58.4899
    #Define the initial conditions (Uncertain) -> X0
    X0=Init_unc[0]
    yinit = np.array([X0,Ss0,SOeq])  

    #Define the necessary parameters
    mumax = np.float64(Pars[0])
    Y = np.float64(Pars[1])
    Ks = np.float64(Pars[2])
    tau = np.float64(Pars[3])

    #Solve with LSODA scheme
    y,infodic=odeint(deriv_works,yinit,time,full_output=True, 
                     printmessg=False, args=(Pars,Constt))

    #Get outputs
    X = y[:,0]
    Ss = y[:,1]
    O = y[:,2]

    OUR_ex=((1-np.exp(-time/tau))*mumax*(1-Y)/Y*Ss/(Ss+Ks)*X)/(24*60)
    #when using this deifnition, we get the time, biomass X, substrate S, 
    #oxygen O and Oxygen uptake rate OUR_ex back, 
    #together with some information about the model 
    return [time,X,Ss,O,OUR_ex,infodic]

Once we have the model implementation, we can now run the model based on the RespiroModel-function:


In [52]:
#SET TIME
Mtime_d = np.arange(0.,0.05,0.0005)
#Since everything is in day, 
#this reprecents more or less 1 hour and 12 minutes of time.

#set X0 as initial condition
X0=6.75632395e+02
#set the parameter values
mumax=4.
Y=0.78
Ks=0.4
tau=2.25e-04
Pars=[mumax,Y,Ks,tau]
Init_unc=np.array([X0])
#RUN THE MODEL WITH THE RespiroModel-command, with the specified parameter values 
#and X0 value everything is saved in the outputmodel, which is a list with the outputs
outputmodel=RespiroModel(Pars,Init_unc,Mtime_d)
#check if the integration succesful:
print outputmodel[-1]['message']


Integration successful.

We can make a plot with the different outputs of the model:


In [51]:
fig = plt.figure(figsize=(10,10))
fig.subplots_adjust(hspace=0.2, wspace = 0.3)

ax1 = fig.add_subplot(221)
ax1.plot(outputmodel[0],outputmodel[1],'k',label = 'X')
ax1.set_xticklabels([])
ax1.legend(loc=4)

ax2 = fig.add_subplot(222)
ax2.plot(outputmodel[0],outputmodel[2],'k',label = 'S')
ax2.set_xticklabels([])
ax2.legend(loc=4)

ax3 = fig.add_subplot(223)
ax3.plot(outputmodel[0],outputmodel[3],'k',label = '0')
ax3.legend(loc=4)
ax3.set_xlabel('Time')

ax4 = fig.add_subplot(224)
ax4.plot(outputmodel[0],outputmodel[4],'k',label = 'OUR')
ax4.legend(loc=4)
ax4.set_xlabel('Time')


Out[51]:
<matplotlib.text.Text at 0x70a1530>

Interpretation of the model:

  • So Biomass $X$ is first increasing, since the bacteria are consuming the acetate
  • At the same time, the acetate substrate $S$ is decreasing, since it is consumed
  • The bacterias are using oxygen to consume the substrate, so the oxygen present in the reactor decreases during the actetate consumption
  • The calculated (and measured) oxygen uptake rate $OUR$ is during this period high and drops to zero when the acetate is completely removed
Monte Carlo Analysis

In [53]:
#run Monte Carlo for OUR
#########################
MCruns = 1000

#We save the output in an array with in the rows the different outputs
#and in the columns the timesteps
MCoutputs = np.zeros((MCruns, Mtime_d.size))

for i in range(MCruns):
    #Sample the uncertain parameters, assume we know these values! -> shot
    X0 = np.random.uniform(650.,700.,1)    
    mumax=np.random.uniform(3.5,4.5,1)
    Y=np.random.uniform(0.77,0.79,1)
    Ks=np.random.uniform(0.3,0.5,1)
    tau=np.random.uniform(0.00001,0.001,1)
    
    #Run the model for the sampled parameters
    Init_unc=X0
    Pars=[mumax[0],Y[0],Ks[0],tau[0]]
    modeloutput = RespiroModel(Pars,Init_unc,Mtime_d)
    #Save only the OUR output of the model to analyze
    MCoutputs[i,:]=modeloutput[4]

We have now stored all the OUR model outputs in 1 array. This can be analysed comparable with the linear model, by calculating the percentile values for every timestep and plotting these in time. But first , just let plot all the model runs:


In [54]:
for i in range(MCruns):
    plt.plot(modeloutput[0],MCoutputs[i,:])


We can get more information about the distribution of the model output uncertainty for every timestep and can plot also percentile intervals in function of the time:


In [55]:
#Coose a timestep and plot the histogram for this timestep:
timestep = 0.01
#plot a histogram at the place where the modelled timestep equals the timestep given 
plt.hist(MCoutputs[:,np.where(Mtime_d==timestep)[0]], bins = 20, color='k')

#We plot again the percentile values:
p1 = np.percentile(MCoutputs[:,np.where(Mtime_d==timestep)[0]], 2.5) #2.5 percentile
p2 = np.percentile(MCoutputs[:,np.where(Mtime_d==timestep)[0]], 97.5) #97.5 percentile
plt.vlines(p1,0.,100.,color='white',linestyle='--')
plt.vlines(p2,0.,100.,color='white',linestyle='--')


Out[55]:
<matplotlib.collections.LineCollection at 0x7f68890>

Calculating these percentiles value for every timestep gives use the uncertainty of the model in function of the time


In [56]:
p25 = np.percentile(MCoutputs, 2.5, axis=0) 
#axis 0 means, the percentile is calculated along the rows, 
#so every column gets a summarized version
p975 = np.percentile(MCoutputs, 97.5, axis=0)
plt.plot(modeloutput[0], p25,'k--')
plt.plot(modeloutput[0], p975,'k--')
plt.fill_between(modeloutput[0], p25, p975, color='gray')


Out[56]:
<matplotlib.collections.PolyCollection at 0x86a02d0>

------------------------------------------------------------------------------------------

EXERCISE: ERROR PROPAGATION WITH MONTE CARLO APPROACH

In the python file ex2_respiromodel.py, the respiromtric model is implemented and a setup as presented above is already prepared.

  • Run the script (which runs the model and returns a plot of the model)
  • Change the parameter value of $\mu_{max}$ into $8.0\; d^{-1}$, what does it change with the model output?
  • The MC-run part of the script is hidden, remove the 'comment' tags and run the Monte Carlo simulation
  • Implement yourself the output of the histogram at time 0.01 and 0.04, what is the difference and why?
  • Assume we can measure the initial biomass $X_0$ with a new lab-measurement (we measure $675 mg/l$) . How would you change this and what is the effect on the uncertainty? Tip: X0 = np.array([675.])
------------------------------------------------------------------------------------------

2.4 Sampling

Monte Carlo sampling needs enough 'shots' of the distribution to be able to characterize the input and output distributions. Standard random sampling implementation is not always the most efficient way to sample the entire distribution. Other sampling procedures exist to achieve quicker a reasonably accurate random distribution:

  • Latin Hypercube or stratified sampling procedures
  • Pseudorandom sampling
  • ...

We will focus today also on the Latin Hypercube/Stratefied sampling:

Latin Hypercube or stratified sampling

To perform the stratified sampling, the cumulative probability (100%) is divided into segments, one for each iteration of the Monte Carlo simulation. A probability is randomly picked within each segment using a uniform distribution, and then mapped to the correct representative value in of the variable’s actual distribution.

A simulation with 500 iterations would split the probability into 500 segments, each representing 0.2% of the total distribution. For the first segment, a number would be chosen between 0.0% and 0.2%. For the second segment, a number would be chosen between 0.2% and 0.4%. This number would be used to calculate the actual variable value based upon its distribution (inverse of the cumulative distribution).

Let's start with an example for 10 segments: 0%-10%, 10%-20%,...,90%-100%


In [57]:
#If we take 10 samples, we divide the cumulative distribution into 10 equal bins; 
#so values 0.1,0.2,... 1.0
nsegments = 10

segments = np.linspace(0,1.,nsegments+1) #explain linspace!
print segments


[ 0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1. ]

In [58]:
#In each of these segments, we now sample as we would sample in a uniform distribution:
samples=np.zeros(nsegments) #here we will store our sampled values
for i in range(nsegments):
    #uniform between these values
    samples[i]=np.random.uniform(segments[i],segments[i+1],1) 

print samples


[ 0.00160979  0.15168352  0.24846795  0.30873392  0.44471662  0.57139712
  0.68652724  0.72194842  0.82565734  0.99433705]

In [23]:
#So what have we done now?
fig,ax = plt.subplots()
ax.hlines(segments,0.,1.)
ax.plot(0.5*np.ones(nsegments), samples,'ko',markersize=5)
ax.set_xticklabels([])

#compare this with a standard random sampling procedure:
#ax.plot(0.5*np.ones(nsegments), np.random.uniform(0.,1.,nsegments),'r*',markersize=8)


Out[23]:
[]

This is basically a LH-sampling of 10 values between 0 and 1. However, in reality we also want to

  • sample other distributions
  • sample between other boundaries

To do so,

P(X <= x) = n is solved for x (you're searching the x value for which the probability is n), where n is the random point in the segment. How this is done is different for every distribution and it’s basically the matter of reversing the process of the probability function. However this is not always straightforward. For some continuous distributions, the inverse is not directly available and need to be approximated. here, we're not going into detail for the inverse calculation, but we will use the .ppf function of python scipy to solve this problem; more information/distributions: http://docs.scipy.org/doc/scipy/reference/stats.html

For normal distribution:


In [59]:
#Let's plot what we actually want to do -> normal distribution!
from scipy.stats import norm  
#other distributions: http://docs.scipy.org/doc/scipy/reference/stats.html

#mean 8 and std 2.
norm_samp =  norm.ppf(samples, 8.,2.) #ppf-function of python-scipy solves the inverse distribution! 
fig,ax = plt.subplots()
ax.scatter(norm_samp,samples,c='k',s=30)
ax.set_xlim(0,14),ax.set_ylim(0,1)
ax.hlines(segments,0,2., linestyles='--')

#compare this with a standard random sampling procedure:
standardsam=np.random.uniform(0.,1.,nsegments)
ax.scatter(norm.ppf(standardsam, 8.,2.),standardsam,c='r',marker='*',s=40, edgecolor='None' )


Out[59]:
<matplotlib.collections.PathCollection at 0x71e4730>

In [25]:
#Let's plot what we actually want to do -> uniform distribution
#min value 2; max value 14: [2,14]
#---------------------------------------------
unif_samples = 2. + samples* (14.-2.)
#---------------------------------------------

fig,ax = plt.subplots()
ax.scatter(unif_samples,samples,c='k',s=30)
ax.set_xlim(2,14),ax.set_ylim(0,1)
ax.hlines(segments,2.,14., linestyles='--')
ax.vlines(2+segments*(14.-2.),0,1., linestyles='--')

#compare this with a standard random sampling procedure:
standardsam=np.random.uniform(0.,1.,nsegments)
ax.scatter(2+standardsam*(14-2),standardsam,c='r',marker='*',s=40, edgecolor='None' )


Out[25]:
<matplotlib.collections.PathCollection at 0x563cdb0>

For a more dimensional space, we will mix the combinations of multiple parameters with eachother to get a sampling of the more-dimensional space:


In [5]:
from IPython.display import Image
Image(filename='LatinH_works (Small).png', width=500)


Out[5]:

Caption: 2-dimensional view of the Latin HYpercube sampling strategy.


In [26]:
#Shuffling the elements:
#--------------------------------------------------------
np.random.shuffle(unif_samples)
#--------------------------------------------------------
fig,ax = plt.subplots()
ax.scatter(unif_samples,samples,c='k',s=30)
ax.set_xlim([2,14]), ax.set_yticklabels([])


Out[26]:
((2, 14), [])

We can then make a histogram of the sampling, to check if the sampling corresponds to the distribution we want:


In [27]:
from ex3_latinhypercube import lhs_uniform, lhs_normal

#FILL IN NUMBER OF SAMPLES-----------------------------------------------------
N = 1500
#------------------------------------------------------------------------------

#UNIFORM DISTRIBUTION
un_samp = lhs_uniform(N,2.,4.)
#fig
fig,ax=plt.subplots(1,2)
ax[0].hist(un_samp,facecolor='k', edgecolor='w',bins=20)
#classic random sampling in red
ax[0].hist(np.random.uniform(2,4,N),facecolor='r', edgecolor='w',bins=20)

#Normal distribution
norm_samp = lhs_normal(N,6.,0.3)
#fig
ax[1].hist(norm_samp,facecolor='gray', edgecolor='w',bins=20)


Out[27]:
(array([  3,   4,  12,  24,  45,  78, 118, 158, 190, 204, 194, 165, 124,
        83,  50,  27,  12,   6,   2,   1]),
 array([ 5.01712848,  5.11975303,  5.22237758,  5.32500213,  5.42762667,
        5.53025122,  5.63287577,  5.73550032,  5.83812487,  5.94074942,
        6.04337397,  6.14599852,  6.24862307,  6.35124762,  6.45387217,
        6.55649671,  6.65912126,  6.76174581,  6.86437036,  6.96699491,
        7.06961946]),
 <a list of 20 Patch objects>)

We will focus only on normal and uniform distributions with LH:

For a Normal distribution, the calculation is done in the previous example

  • Sample a random value for each segment, call this $y$
  • from scipy.stats import norm
  • your sampled value $x$ :
    x = norm.ppf(y, mean_of_distribution,std_of_distribution)

For a uniform distribution, the calculation is then straightforward:

  • Sample a random value for each segment, call this $y$
  • The value of the uniform distribution with lower value $a$ and upper value $b$ is: $$x = a + y (b − a)$$

shuffling the elements for larger dimensionalities:

np.random.shuffle(array_to_shuffle)

------------------------------------------------------------------------------------------

EXERCISE: LATIN HYPERCYBE SAMPLING

In the python file ex3_respiromodel.py, the previous description of the latin-hypercube setup is implemented in 2 definitions: * lhs_uniform: for uniform sampling of parameters with Latin-hypercube sampling * lhs_normal: for normal sampling of parameters with Latin-hypercube sampling Run the script. Two figures with the histogram of the sampling should appear, one for each sampling. Questions: * Adapt the number $N$ to change the number of samples ('shot') * Adapt the boundaries of the lhs uniform sampling to [5,10] and do a sampling with N=2000 * Adapt the loc=9. and scale=3 of the lhs normal sampling and do a sampling with N=700 * Try to compare the histogram with a histogram of a classic random sampling procedure ------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------

EXTRA EXERCISE: LATIN HYPERCYBE SAMPLING FOR RESPIROMODEL

In the python script ex4_repiro_lhs.py, the model of the respiromodel is used, but the script is now adapted to apply the lhs_sampling. Instead of taking the 'shot' in each iteration of the for-loop, we now want to prepare the samples before the for-loop. * Why is the from ... import ... statement included? What does this mean? * Check the file and compare the differences with the file ex2_respiromodel in the run Monte Carlo part. What is changed? * Visualize the MCoutputs, how wouldyou do this? * Make a plot of the histogram at one timestep over the different MC-loops. Try to to this also for the exercise 2 file and compare the output figures with different number or monte Carlo runs. * You explicitly know the uncertainty of the $\mu_{max}$ follows a normal distribution with mean 700 and std 20. Change this in the script and see if it has any effect. * Calculate percentile values and show the uncertainty bounds. ------------------------------------------------------------------------------------------

FINAL REMARK: take care of large dimensionalities!

<img src="sampling_comp.png", , width=600>

Summary

  • When the model is linear or can be approximated by a linear (Taylor) approximation
  • When the input uncertainties have a normal distribution (or can be transformed into a normal distribution)
  • When the number of equations is not too large

The 'classic' error propagation or differential analysis are giving a good result of the output uncertainty. The main advantage is computational time, since the Monte Carlo technique is always more computer intensive!

However, since most environmental (spatial) models have many equations or can't be accessed, Monte Carlo based propagation is mostly the most generic solution.

So, consider always first if linear propagation if possible; if not, do Monte Carlo based error propagation, perferable with Latin-Hypercube sampling.

Main drawback of the uncertainty propagation:

The definition of these uncertain inputs is mostly NOT straight forward! Many distributions exist and the choice of the distribution has a direct effect on the model output (prediction) uncertainty. Definition of these distributions/ranges can be done based on

(1) repeated experiments (not easy in the field...);
(2) confidence intervals of the parameters (see later this day; we need data to do this),
(3) literature values or
(4) expert knowledge.

Next step: get more information about the model behaviour, by checking how the uncertain inputs contribute to the output uncertainty => Sensitivity Analysis

After that: Introduce (uncertain) DATA to evaluate the model behaviour!