SCCS2014 Workshop: The Power of Model Validation

So Martin and I (OK, mainly me. I'm "special".) thought it would be a wonderful idea to go through some concepts in time-series analysis.

1: Determining the correlation strength between two 1D timeseries using cross-correlation.

2: Assessing the degree to which two 2D timeseries are related using EOF (or SVD) decomposition.

Applying these ideas we hope to illustrate how to utilise these concepts in an manner designed to illustrate the strength of a correlation and investigate possible mechanistic causes of similar patterns. These methods are ideal for model validation, and thus we hope to touch on both 1D and 2D examples since these need to be treated quite differently. In a nutshell, what we hope you can take away from this little introduction is that accessible methods exist which can boost the impact of your work.

We've chosen to use python to present these ideas, but they are transferrable to other languages such as octave, R, matlab or mathematica.

1D Timeseries: The significance of a cross-correlation

In [4]:
import numpy as np
import matplotlib.pyplot as pyp
import pandas as pnd
from IPython.display import Image
import scipy.sparse.linalg as sp
import h5py
from IPython.display import Math

First off, we're going to generate some data. We'll do this using combinations of the sin function, and adding a little random noise:

In [290]:
a = np.sin(np.arange(1,100,0.1)/np.pi*3.5)+np.sin(np.arange(1,100,0.1)/2.5)*np.random.rand(990)*1.7
b = np.sin(np.arange(1,100,0.1)/np.pi*3.5)+np.sin(np.arange(1,100,0.1)/7)*np.random.rand(990)*1.5

In [291]:
title('Two timeseries')

<matplotlib.text.Text at 0x7fe36319a550>

Now that we have some data, lets look at the correlation between the two:

In [292]:
a = pnd.Series(a) #Here I turn the timeseries into a "Series" to use the pandas library
b = pnd.Series(b)
correlation = a.corr(b)

print "The correlation between a and b is: ", correlation

The correlation between a and b is:  0.559795942816

However, the value of the correlation does not tell us anything about how significant the link/correlation between the two timeseries is. Thus, reporting only the value of the correlation is reasonably meaningless, without also quoting the level of significance of the correlation. This is because the points in the dataset could both be coupled tightly to a third process, which swamps the independent variability the two timeseries actually have in common. An allegory here is a solar cycle, which is easily imagined for the sinusoidal nature of the timeseries a and b. Here the "sun" drowns out the independent variability in timeseries a and timeseries b, so it is difficult to say if it is timeseries a and b correlating significantly.

To assess this independence, we determine the "degrees of freedom". This is a measure of the number of independent datapoints, or points that are free to vary. We assess these by looking at the cross correlation (xcorr). This is a way of telling how similar the two timeseries are as a function of the the time-lag between them:

In [293]:
k = pyp.xcorr(a, b, maxlags = 200)
title('xcorr of timeseries a and b')
ylabel('Normalised correlation strength')

<matplotlib.text.Text at 0x7fe363165110>

In the figure above, a correlation of "0" denotes that the two timeseries are unrelated with that timelag applied. This wave-like pattern is expected for a signal with a sinusoidal nature, since we move between the timeseries being in and out of phase. This is seen in the positive (in phase) and negative (out of phase) correlations. In a negative phase we referr to the timeseries being anti-correlated with that timelag applied.

The degrees of freedom are the number of steps away from the zero lag we take before the correlation becomes zero. We can read this off the figure above, or decide to be difficult and write something to do it for us:

In [294]:
ind = np.where(np.abs(k[:][0][np.where(np.abs(k[:][1]) < 0.03)]) == np.abs(k[:][0][np.where(np.abs(k[:][1]) < 0.03)]).min())
df = k[:][0][np.where(np.abs(k[:][1]) < 0.03)][ind]

print "The degrees of freedom are: ", df

The degrees of freedom are:  [-14  14]

Thus, we have now determined that the degrees of freedom are 16 (we ignore the negative if it's there, if there are two numbers this is because the xcorr is symetric around zero). Now with this, statisticians have gone through huge empirical effort to put together tables like that below:

In [123]:


With this, we can loo at the degrees of freedom (denoted N in the table), and look up what values are appropriate for our example.

Thus, with just a little xcorr function and a table we have been able to significantly improve on the level of sense our statistical test makes! In essesnce, we can now say we have two timeseries that have a correlation of 0.56 significant at the .025 level. Thus, we are 99.975% sure this correlation in not non-sense. This is good, since we did literally just make up some data.

2D timeseries: Empirical Orthogonal Functions (EOFs)

A lot of the time we don't have just one timeseries. Specifically, we can suspect that patterns in out simulations and in real world data are "similar", but quantifying the similarity is complicated by:

1: The spatial patterns not overlapping perfectly

2: Modes of opperation of the natural system outside the research question

3: Determining the degrees of freedom; How many truly independent datapoints exsist?

To combat situations where you have a lot of data, with a huge amount going on in it, the idea is that we try to reduce the "complexity". We want to look at only certain modes of variability.

This type of problem is so common among such a diverse range of fields, that you find this type of analysis under other names as well, such as: EOFs, principal components, proper orthogonal decomposition, singular vectors, Karhunen-Loève functions, optimals, etc. These are all basically the same thing, but we'll go with EOFs. Here we'll go with the most general approach of a Singular Value Decomposition, to get at the EOFs.

EOF/SVD analysis is not limited to timeseries analysis, but rather is a hugely versatile tool. Here we present a simple example of it's usage but it is worth looking into further!

First off, lets get some data:

In [5]:
f = h5py.File('/noc/msm/scratch/valor2/maike/beringDump/concatenateData/N1/')
ssh = squeeze(SSH[:])
sst = squeeze(SST[:])
#Some land data, just por plotting. Oceanographers tend to politely ignore the land.
land = h5py.File("/noc/msm/scratch/valor2/maike/beringDump/land_N1_global.h5", 'r')

This is sea surface height (m) data, from a global general circulation model called NEMO (Nucleus of European Models of the Ocean). We have a global dataset at 1 degree resolution, with 30 years of monthly data from 1978 to 2007. We have a quick look to see how the data look:

In [58]:
cclim = np.array([-2,2])
imshow(flipud(ssh[0,...]), cmap=cm.coolwarm)
title('Sea Surface Height (m) January 1978')

<matplotlib.text.Text at 0x7fec9f480dd0>

Suppose we can write our sea surface height data matrix as:

In [310]:
Math(r'\mathrm{M} = \mathrm{U} \Sigma \mathrm{V}^{*}')

$$\mathrm{M} = \mathrm{U} \Sigma \mathrm{V}^{*}$$

This factorization is called a singular value decomposition, and has many nifty properties! We'll explore some now...

First off, we need to make life easier for ourselves, and collapse the latitude and longitude information into a single vector. This is just a mathematical trick, and does not cost us the spatial information. We then determine the U, Σ and V matrixes using the sparse methods in the ARPACK eigensolver (See "Further Notes" for a more detailed explanation and validation of the sparse approach). The matrix U, Σ and V are:

U: An (m x m) unitary matrix, where the m columns are the left-singular vectors of M.

Σ: An (m × n) rectangular diagonal matrix with non-negative real numbers on the diagonal. The diagonal entries of Σ are the singular values of M

V: An (n x n) unitary matrix, where the n columns are the right-singular vectors of M.

In [6]:
u_ssh,s_ssh,v_ssh=sp.svds(np.transpose(ssh.reshape((360, 292*362))), k=300)

To retrieve the EOFs, we need to understand that the "singular vectors" are what we refer to as EOFs. We will use the entries of U to retrieve the EOFs:

In [61]:
eof1issh = u_ssh[:, -1]
eof1ssh=eof1issh.reshape(292, 362)
eof2issh = u_ssh[:, -2]
eof2ssh=eof2issh.reshape(292, 362)
eof3issh = u_ssh[:, -3]
eof3ssh=eof3issh.reshape(292, 362)

In [62]:
cclim = np.array([-0.01,0.01])
imshow(flipud(eof1ssh), cmap=cm.coolwarm)
title('Sea Surface Height EOF1')
imshow(flipud(eof2ssh), cmap=cm.coolwarm)
title('Sea Surface Height EOF2')
imshow(flipud(eof3ssh), cmap=cm.coolwarm)
title('Sea Surface Height EOF3')

<matplotlib.text.Text at 0x7fec40d48d50>

The figures above tell us about patterns, but one should be careful interpreting and attributing these to physical patterns. We look at the variability caused by a huge collection of different processes, rather than assessing individual processes using the EOF analysis. We can use the EOFs to guide out intuition.

We can now look at what percentage of the variability explained by the different EOFs:

In [7]:
s2_ssh = s_ssh[:-1]**2
percent_ssh = (100*s2_ssh)/sum(s2_ssh)
sumvar_ssh = sum(percent_ssh[:])

In [8]:
plot(flipud(percent_ssh[259:]), 'k*')
title('The first 40 EOFs SSH')
xlabel('EOF number')
ylabel('Percentage of variance (%)')
print sumvar_ssh


Here we see the percentage of the variance in the SSH fields contained in the seperate EOFs. Again, this can confirm simple intuition such as that the yearly signal in solar insolation creates a lot of variance.

Now lets look at comparing two datasets!

The second dataset we'll look at is a sea surface temperature field. First off, we plot the data again:

In [69]:
cclim = np.array([-1,31])
imshow(flipud(sst[0,...]), cmap=cm.coolwarm)
title('Sea Surface Temperature (m) January 1978')

<matplotlib.text.Text at 0x7fec40e01f50>

We repeat the analysis from the SSH again with the SST:

In [9]:
u_sst,s_sst,v_sst=sp.svds(np.transpose(sst.reshape((360, 292*362))), k=300)

In [72]:
eof1isst = u_sst[:, -1]
eof1sst=eof1isst.reshape(292, 362)
eof2isst = u_sst[:, -2]
eof2sst=eof2isst.reshape(292, 362)
eof3isst = u_sst[:, -3]
eof3sst=eof3isst.reshape(292, 362)

In [73]:
cclim = np.array([-0.01,0.01])
imshow(flipud(eof1sst), cmap=cm.coolwarm)
title('Sea Surface Temperature EOF1')
imshow(flipud(eof2sst), cmap=cm.coolwarm)
title('Sea Surface Temperature EOF2')
imshow(flipud(eof3sst), cmap=cm.coolwarm)
title('Sea Surface Temperature EOF3')

<matplotlib.text.Text at 0x7fec40a84090>

In [10]:
s2_sst = s_sst[:-1]**2
percent_sst = (100*s2_sst)/sum(s2_sst)
sumvar_sst = sum(percent_sst[:])

In [11]:
plot(flipud(percent_sst[259:]), 'k*')
title('The first 40 EOFs SST')
xlabel('EOF number')
ylabel('Percentage of variance (%)')
print sumvar_sst


Here we see that there is a very different distribution in the SST field. We can put the two timeseries on the same plot for easier comparison:

In [12]:
plot(flipud(percent_sst[259:]), 'k*')
plot(flipud(percent_ssh[259:]), 'k*')
title('The first 40 EOFs SST and SSH')
xlabel('EOF number')
ylabel('Percentage of variance (%)')

<matplotlib.text.Text at 0x7f62792f5fd0>

In the above plot we see that the percentage of the variance the different EOFs can account for is very different! Thus, looking at the data in this way we have a firmer grasp of how the variability is actually distributet across the data.

For further comparison, it is also possible to project the SST EOFs back using the SSH S and V matrixes. This gives a more quantitative was of camparing the datasets.

In [ ]:

Illustrating the skill of the Sparse methods and Illustrating the Reconstruction

Due to the numerics involved, the EOF calculations with the singular value decomposition can be hugely memory intensive, and thus take a vary long time to compute. This can be hugely problematic if larger datasets need to be handled.

To illustrate, the "gesdd" function the LAPACK library implements uses a workspace O(N^2), assuming M=N for simplicity. However, in this practical we have employed the sparse methods from the ARPACK library. This may seem curious, since we have illustrated that our ssh timeseries is most definitely not sparse. The covariance matrix does looks quite different however, and below we illustrate the reconstruction of our ssh field and it's fit to the original. This is motivated in part by the workspace of ARPACK scaling O(ncv), where ncv are the number of basisvectors used for the Eigenvalue Algorithm. The image below illustrates the substantial difference in memory usage here.

(Thanks to Ian Bush, Edward Smyth (NAG) and Peter Challenor (NOCS). Image from:

In [395]: