Text provided under a Creative Commons Attribution license, CC-BY. All code is made available under the FSF-approved MIT license. (c) Marc Spiegelman

In [ ]:
%matplotlib inline
import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
import csv

Principal Component/EOF analysis

GOAL: Demonstrate the use of the SVD to calculate principal components or "Empirical Orthogonal Functions" in a geophysical data set. This example is modified from a paper by Chris Small (LDEO)

Small, C., 1994. A global analysis of mid-ocean ridge axial topography. Geophys J Int 116, 64–84. doi:10.1111/j.1365-246X.1994.tb02128.x

The Data

Here we will consider a set of topography profiles taken across the global mid-ocean ridge system where the Earth's tectonic plates are spreading apart.

The data consists of 156 profiles from a range of spreading rates. Each profile contains 80 samples so is in effect a vector in $R^{80}$


In [ ]:
# read the data from the csv file
data = np.genfromtxt('m80.csv', delimiter='')
data_mean =  np.mean(data,0)

# and plot out a few profiles and the mean depth.
plt.figure()
rows = [ 9,59,99]
labels = [ 'slow','medium','fast']
for i,row in enumerate(rows):
    plt.plot(data[row,:],label=labels[i])
    plt.hold(True)
plt.plot(data_mean,'k--',label='mean')
plt.xlabel('Distance across axis (km)')
plt.ylabel('Relative Elevation (m)')
plt.legend(loc='best')
plt.title('Example cross-axis topography of mid-ocean ridges')
plt.show()

EOF analysis

While each profile lives in an 80 dimensional space, we would like to see if we can classify the variability in fewer components. To begin we form a de-meaned data matrix $X$ where each row is a profile.


In [ ]:
plt.figure()
X = data - data_mean
plt.imshow(X)
plt.xlabel('Distance across axis (Km)')
plt.ylabel('Relative Spreading Rate')
plt.colorbar()
plt.show()

Applying the SVD

We now use the SVD to factor the data matrix as $X = U\Sigma V^T$


In [ ]:
# now calculate the SVD of the de-meaned data matrix
U,S,Vt = la.svd(X,full_matrices=False)

And begin by looking at the spectrum of singular values $\Sigma$. Defining the variance as $\Sigma^2$ then we can also calculate the cumulative contribution to the total variance as

$$ g_k = \frac{\sum_{i=0}^k \sigma_i^2}{\sum_{i=0}^n \sigma_i^2} $$

Plotting both $\Sigma$ and $g$ shows that $\sim$ 80% of the total variance can be explained by the first 4-5 Components


In [ ]:
# plot the singular values
plt.figure()
plt.semilogy(S,'bo')
plt.grid()
plt.title('Singular Values')
plt.show()

# and cumulative percent of variance
g = np.cumsum(S*S)/np.sum(S*S)
plt.figure()
plt.plot(g,'bx-')
plt.title('% cumulative percent variance explained')
plt.grid()
plt.show()

Plotting the first 4 Singular Vectors in $V$, shows them to reflect some commonly occuring patterns in the data


In [ ]:
plt.figure()
num_EOFs=3
for row in range(num_EOFs):
    plt.plot(Vt[row,:],label='EOF{}'.format(row+1))
plt.grid()
plt.xlabel('Distance (km)')
plt.title('First {} EOFs '.format(num_EOFs))
plt.legend(loc='best')
plt.show()

For example, the first EOF pattern is primarily a symmetric pattern with an axial high surrounded by two off axis troughs (or an axial low with two flanking highs, the EOF's are just unit vector bases for the row-space and can be added with any positive or negative coefficient). The Second EOF is broader and all of one sign while the third EOF encodes assymetry.

Reconstruction

Using the SVD we can also decompose each profile into a weighted linear combination of EOF's i.e.

$$ X = U\Sigma V^T = C V^T $$

where $C = U\Sigma$ is a matrix of coefficients that describes the how each data row is decomposed into the relevant basis vectors. We can then produce a k-rank truncated representation of the data by

$$ X_k = C_k V_k^T $$

where $C_k$ is the first $k$ columns of $C$ and $V_k$ is the first $k$ EOF's.

Here we show the original data and the reconstructed data using the first 5 EOF's


In [ ]:
# recontruct the data using the first 5 EOF's
k=5
Ck = np.dot(U[:,:k],np.diag(S[:k]))
Vtk = Vt[:k,:]
data_k = data_mean + np.dot(Ck,Vtk)

plt.figure()
plt.imshow(data_k)
plt.colorbar()
plt.title('reconstructed data')
plt.show()

And we can consider a few reconstructed profiles compared with the original data


In [ ]:
# show the original 3 profiles and their recontructed values using the first k EOF's
for i,row in enumerate(rows):
    plt.figure()
    plt.plot(data_k[row,:],label='k={}'.format(k))
    plt.hold(True)
    plt.plot(data[row,:],label='original data')
    Cstring = [ '{:3.0f},  '.format(Ck[row,i]) for i in range(k) ]
    plt.title('Reconstruction profile {}:\n C_{}='.format(row,k)+''.join(Cstring))
    plt.legend(loc='best')
    plt.show()

projection of data onto a subspace

We can also use the Principal Components to look at the projection of the data onto a lower dimensional space as the coefficients $C$, are simply the coordinates of our data along each principal component. For example we can view the data in the 2-Dimensional space defined by the first 2 EOF's by simply plotting C_1 against C_2.


In [ ]:
# plot the data in the plane defined by the first two principal components
plt.figure()
plt.scatter(Ck[:,0],Ck[:,1])
plt.xlabel('$V_1$')
plt.ylabel('$V_2$')
plt.grid()
plt.title('Projection onto the first two principal components')
plt.show()

# Or consider the degree of assymetry (EOF 3) as a function of spreading rate
plt.figure()
plt.plot(Ck[:,2],'bo')
plt.xlabel('Spreading rate')
plt.ylabel('$C_3$')
plt.grid()
plt.title('Degree of assymetry')
plt.show()

In [ ]: