PCA Techniques:

Notes on different ways of doing PCA and some pitfalls associated with them

1. Standard PCA: Covariance Eigenvalues and Eigenvectors

Let us look at two standard ways (which rely on simple methods of eigenvalue decomposition) to do PCA, using a simple toy example. First, get the data.


In [53]:
# code for toy is in ../consomme
import sys
sys.path.append('../')

from consomme.data_faker import *

N = 100 # number of samples
noise = 0.01 # a percentage of the mean spectrum

wavelength, clean_spec, noisy_spec, uncert = default_fake(N,noise)

# chop size, subtract mean
# size is now [100,400]
wavelength = wavelength[400:800]
clean_spec = clean_spec[:,400:800]
noisy_spec = noisy_spec[:,400:800]
clean_spec -= np.mean(clean_spec,axis=0)
noisy_spec -= np.mean(noisy_spec,axis=0)

The clean_spec object above is an array of spectra, consisting of 100 examples, each with 400 values. The variation between the spectrum is determined by some set of low dimensional factors, which we will determine with PCA. One method is to do the following:


In [54]:
# compute the covariance matrix
covariance = 1. / (N-1) * np.dot(clean_spec.T,clean_spec)
print 'Covariance matrix has shape = ',covariance.shape

eigval, eigvec = np.linalg.eigh(covariance)
print 'Eigenvalue matrix has shape = ',eigval.shape
print 'Eigenvector matrix has shape = ',eigvec.shape


Covariance matrix has shape =  (400, 400)
Eigenvalue matrix has shape =  (400,)
Eigenvector matrix has shape =  (400, 400)

Numpy's eigh gives the eigenvalues in ascending order, and eigenvectors as columns in the eigvec matrix. Examining the last five eigenvalues:


In [55]:
print eigval[-5:]


[  1.73491869e-15   1.83962469e-15   2.18693240e-15   2.79182304e-15
   8.22123830e+00]

Clearly, all of the variance is in the last component, so the toy data used here has one factor of interest. This eigenvector looks like:


In [56]:
from IPython.display import Image

fig = pl.figure()
pl.plot(wavelength,eigvec[:,-1])
pl.xlabel('Wavelength ($\AA$)')
pl.ylabel('$F_\lambda$ (Arb.)')
fig.savefig('../plots/foo.png',dpi=100)
Image(filename='../plots/foo.png')


Out[56]:

2. Standard PCA: Singular Value Decomposition

The second most common method for calculating principle components is to perform a Singular Value Decomposition on the data.


In [57]:
U, S, V = np.linalg.svd(clean_spec)
print U.shape
print S.shape
print V.shape


(100, 100)
(100,)
(400, 400)

In [58]:
eigvec_svd = V.T

# This may differ from the result in approach 1 by 
# a minus sign, here is a hacky fix
if np.sum(eigvec_svd[:10,0]) < 0.0:
    eigvec_svd *= -1.0

In [59]:
fig = pl.figure()
pl.plot(wavelength,eigvec_svd[:,0])
pl.xlabel('Wavelength ($\AA$)')
pl.ylabel('$F_\lambda$ (Arb.)')
fig.savefig('../plots/foo.png',dpi=100)
Image(filename='../plots/foo.png')


Out[59]:

In [60]:
print np.sum(eigvec_svd[:,0]-eigvec[:,-1])


-2.98475437074e-15

In [61]:
eigval_svd = S ** 2. / (N-1.)

In [62]:
print eigval_svd[0],eigval[-1]


8.22123830286 8.22123830286

3. Expectation-Maximization PCA (EMPCA)

One problem with the standard approaches to PCA (1 & 2 above) is that they can be computationally expensive when the number of samples and/or the number of dimensions become large. When N_samples ~ N_dimensions, computation time is of order $O(N^3)$. If it is known (or can be assumed/modeled) that the underlying dimensionality of the data is low (or at least a good bit lower than the dimensionality of the data), then using EMPCA will be computationally more efficient. We will skip the details here, but see Roweis 1998 for a clear explaination. In short, EMPCA takes advantage of the low dimensionality, allowing expensive matrix inversions to be reduced using a linear algebraic lemma.

Grabbing a copy of empca.py from github.com/rossfadely/EMPCA, we can run EMPCA on our above example.


In [63]:
from empca import EMPCA

# EMPCA takes data as (N_dim,N_samp), following Roweis 1998
# M = Latent dimensionality
M = 1
model = EMPCA(clean_spec.T,M)

# This may differ from the result in approach 1 by 
# a minus sign, here is a hacky fix
if np.sum(model.lam[:10,0]) < 0.0:
    model.lam *= -1.0

This produces a set of vectors similar to (more on this below) the eigenvalues produced by standard PCA.


In [64]:
fig = pl.figure()
pl.plot(wavelength,model.lam[:,0])
pl.xlabel('Wavelength ($\AA$)')
pl.ylabel('$F_\lambda$ (Arb.)')
fig.savefig('../plots/foo.png',dpi=100)
Image(filename='../plots/foo.png')


Out[64]:

Notice above that we had to specify the latent dimensionality $M$. What happens if we don't know the exact value of $M$ and over-guess?


In [66]:
M = 4
model = EMPCA(clean_spec.T,M)
fig = pl.figure()
pl.plot(wavelength,model.lam[:,0],'g')
pl.plot(wavelength,model.lam[:,1],'r')
pl.plot(wavelength,model.lam[:,2],'b')
pl.plot(wavelength,model.lam[:,3],'c')
pl.xlabel('Wavelength ($\AA$)')
pl.ylabel('$F_\lambda$ (Arb.)')
fig.savefig('../plots/foo.png',dpi=100)
Image(filename='../plots/foo.png')


Out[66]:

The above will generally look discouraging - while some 'eigenvectors' look like we might expect (consistent with noise) we seem to have lost the principle component we care about. This is one downside of EMPCA, and in general iterative methods that estimate low dimensional factors - it will converge to a solution that is equivalent to PCA up to an arbitrary orthogonal rotation. Lets apply the varimax to our results.


In [70]:
import sys
sys.path.append('../')
from consumme.data_faker import *

R = ortho_rotation(model.lam,method='varimax')
model.lam = np.dot(model.lam,R)


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-70-d7ffc877cdf7> in <module>()
----> 1 from consumme.data_faker import *
      2 
      3 R = ortho_rotation(model.lam,method='varimax')
      4 model.lam = np.dot(model.lam,R)

ImportError: No module named consumme.data_faker

In [ ]: