Principal Components Analysis (PCA)

a.k.a. Singular Value Decomposition (SVD) a.k.a. Eigenvalue Decomposition (EVD) a.k.a. Empirical Orthogonal Functions (EOF) a.k.a. Karhunen–Loève transform (KLT) a.k.a. Proper Orthogonal Decomposition (POD) a.k.a. the Hotelling transform a.k.a. factor analysis a.k.a. Eckart–Young theorem a.k.a. Schmidt–Mirsky theorem etc.

Instructor: Alessandro Gagliardi
TA: Kevin Perko

Recall

Polynomial Regression & Multicollinearity


In [1]:
%load_ext rmagic

In [2]:
%%R
library(ggplot2)

In [3]:
%%R 
x <- seq(1, 10, 0.1)
print(cor(x^9, x^10))
qplot(x^9, x^10)


[1] 0.9987608

In [4]:
%%R -w 960 -h 480 -u px
x <- seq(1, 10, 0.1)
p <- data.frame(rbind(poly(x, 10, raw=T)[,c(9,10)], 
                      poly(x, 10, raw=F)[,c(9,10)]), 
                raw = rep(c(TRUE, FALSE), each = length(x)))
ggplot(p, aes(X9, X10)) + geom_point() + facet_wrap(~raw, scales="free")


Multicollinearity

poly prevents multicollinearity when using multiple polynomials of the same variable.

But what about when the multicollinearity comes from correlations between separate variables?

Example

Results of the Olympic Heptathlon Competition, Seoul, 1988.

from D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway and E. Ostrowski (1994). A Handbook of Small Datasets, Chapman and Hall/CRC, London


In [5]:
%%R 
data("heptathlon", package = "HSAUR")
summary(heptathlon)


    hurdles         highjump          shot          run200m     
 Min.   :12.69   Min.   :1.500   Min.   :10.00   Min.   :22.56  
 1st Qu.:13.47   1st Qu.:1.770   1st Qu.:12.32   1st Qu.:23.92  
 Median :13.75   Median :1.800   Median :12.88   Median :24.83  
 Mean   :13.84   Mean   :1.782   Mean   :13.12   Mean   :24.65  
 3rd Qu.:14.07   3rd Qu.:1.830   3rd Qu.:14.20   3rd Qu.:25.23  
 Max.   :16.42   Max.   :1.860   Max.   :16.23   Max.   :26.61  
    longjump        javelin         run800m          score     
 Min.   :4.880   Min.   :35.68   Min.   :124.2   Min.   :4566  
 1st Qu.:6.050   1st Qu.:39.06   1st Qu.:132.2   1st Qu.:5746  
 Median :6.250   Median :40.28   Median :134.7   Median :6137  
 Mean   :6.152   Mean   :41.48   Mean   :136.1   Mean   :6091  
 3rd Qu.:6.370   3rd Qu.:44.54   3rd Qu.:138.5   3rd Qu.:6351  
 Max.   :7.270   Max.   :47.50   Max.   :163.4   Max.   :7291  

from A Handbook of Statistical Analyses Using **R**:

To begin it will help to score all the seven events in the same direction, so that ‘large’ values are ‘good’. We will recode the running events to achieve this.


In [6]:
%%R
heptathlon$hurdles <- max(heptathlon$hurdles) - heptathlon$hurdles
heptathlon$run200m <- max(heptathlon$run200m) - heptathlon$run200m
heptathlon$run800m <- max(heptathlon$run800m) - heptathlon$run800m

In [7]:
%R print(round(cor(heptathlon[,1:7]), 2))


         hurdles highjump shot run200m longjump javelin run800m
hurdles     1.00     0.81 0.65    0.77     0.91    0.01    0.78
highjump    0.81     1.00 0.44    0.49     0.78    0.00    0.59
shot        0.65     0.44 1.00    0.68     0.74    0.27    0.42
run200m     0.77     0.49 0.68    1.00     0.82    0.33    0.62
longjump    0.91     0.78 0.74    0.82     1.00    0.07    0.70
javelin     0.01     0.00 0.27    0.33     0.07    1.00   -0.02
run800m     0.78     0.59 0.42    0.62     0.70   -0.02    1.00

One of these things is not like the other. Do you see it?

Scores in of the events with the exception of javelin are correlated with one another.

javelin is considered a 'technical' event while the others are considered 'power'-based events.

Principal Components Analysis (PCA)

PCA was invented multiple times independently to solve different but related problems.

PCA is equivalent to what is known as the singular value decomposition (SVD) of $\bf{X}$ and the eigenvalue decomposition (EVD), a.k.a. the spectral decomposition of $\bf{X^TX}$ in linear algebra.

FYI: SVD and EVD are slightly different and if your linear algebra skills are up to the task, I encourage you to see how they compare, but for our purposes we may treat them as equivalent.

_from [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis):_

Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering . . . factor analysis, Eckart–Young theorem (Harman, 1960), or Schmidt–Mirsky theorem in psychometrics, empirical orthogonal functions (EOF) in meteorological science, empirical eigenfunction decomposition (Sirovich, 1987), empirical component analysis (Lorenz, 1956), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics.

Factor Analysis differs from PCA in that Factor Analysis tests an a priori model while PCA makes no assumptions about the underlying structure of the data. Other methods mentioned may also differ in minor ways from what follows but they are all mathemtically related.

Why so many names for the same thing?

Principal Components Analysis (PCA)

PCA reveals a fundamental mathematical relationship between variables.

It has many uses, but the ones we are most interested in are:

  • transforming collinear variables into orthogonal factors
  • dimensionality reduction

Side note: another, arguably better way to orthogonalize vectors is to use Independent Component Analysis (ICA). The main problem with ICA is that you need to know how many independent components there are to begin with. For that reason, its utility is limited and we will focus on PCA.

Dimensionality Reduction

What is dimensionality reduction?

What are the motivations for dimensionality reduction?

What are different ways to perform dimensionality reduction?

Q: What is dimensionality reduction?

A: A set of techniques for reducing the size (in terms of features, records, and/or bytes) of the dataset under examination.

In general, the idea is to regard the dataset is a matrix and to decompose the matrix into simpler, meaningful pieces.

Dimensionality reduction is frequently performed as a pre-processing step before another learning algorithm is applied.

Q: What are the motivations for dimensionality reduction?

The number of features in our dataset can be difficult to manage, or even misleading (e.g. if the relationships are actually simpler than they appear).

  • reduce computational expense
  • reduce susceptibility to overfitting
  • reduce noise in the dataset
  • enhance our intuition

Q: What are different ways to perform dimensionality reduction?

A: There are two approaches: feature selection and feature extraction.

feature selection – selecting a subset of features using an external criterion (filter) or the learning algorithm accuracy itself (wrapper)

feature extraction – mapping the features to a lower dimensional space

Feature selection is important, but typically when people say dimensionality reduction, they are referring to feature extraction.

The goal of feature extraction is to create a new set of coordinates that simplify the representation of the data.

This is what we are doing with PCA.

How does it work?

We want to find a new matrix $P$ from original data $X$ so that the covariance matrix of $PX$ is diagonal (i.e. the resulting vectors are all orthogonal to one another) and entries on diagonal are descending order. In the process:

  • Maximize the variance.
  • Minimize the projection error.
$$ P_{m\times m}X_{m\times n} = Y_{m\times n} $$

If this works out, $Y_{m\times n}$ is out new data set!

  • N.B.: The X here is the tranpose of a typical design matrix..
    • See the Shlens paper for more info.
  • Its goal is to extract the important information from the data and to express this information as a set of new orthogonal variables called principal components.
  • A linear transformation! This is a big assumption.
  • Is there another basis, which is a linear combination of the original basis, that best re-expresses our data set?
  • This transformation will become the principal components of X.
  • What does the transformation boil down to?
    • Rotation and scale.. so how does that help us?
    • What should our P be doing?
    • What do we want our Y do look like?

Every dataset has noise and signal... How can we bring out the signal?

$$ SNR = \frac{\sigma^2_{signal}}{\sigma^2_{noise}} $$

Rotate to maximize variance

Heptathlon


In [8]:
%%R
library(GGally)
ggpairs(heptathlon[,1:7])


Loading required package: reshape
Loading required package: plyr

Attaching package: ‘reshape’

The following objects are masked from ‘package:plyr’:

    rename, round_any

Principal Components

The hard way:


In [9]:
%%R
# transpose and scale the original data
X <- t(scale(heptathlon[,1:7]))

# Calculate P
A <- X %*% t(X)
E <- eigen(A, symmetric=TRUE)
P <- t(E$vectors)

# Find the new data and standard deviations of the principal components
newdata <- P %*% X
sdev <-  sqrt(diag((1/(dim(X)[2]-1)* P %*% A %*% t(P))))
print(sdev)
print(E$vectors)


[1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617
           [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
[1,] -0.4528710  0.15792058 -0.04514996 -0.02653873  0.09494792 -0.78334101
[2,] -0.3771992  0.24807386 -0.36777902 -0.67999172 -0.01879888  0.09939981
[3,] -0.3630725 -0.28940743  0.67618919 -0.12431725 -0.51165201 -0.05085983
[4,] -0.4078950 -0.26038545  0.08359211  0.36106580  0.64983404  0.02495639
[5,] -0.4562318  0.05587394  0.13931653 -0.11129249  0.18429810  0.59020972
[6,] -0.0754090 -0.84169212 -0.47156016 -0.12079924 -0.13510669 -0.02724076
[7,] -0.3749594  0.22448984 -0.39585671  0.60341130 -0.50432116  0.15555520
            [,7]
[1,] -0.38024707
[2,]  0.43393114
[3,]  0.21762491
[4,]  0.45338483
[5,] -0.61206388
[6,] -0.17294667
[7,]  0.09830963

Principal Components

The easy way:


In [10]:
%%R
heptathlon_pca <- prcomp(heptathlon[,1:7], scale = T) # don't forget to set scale = TRUE
heptathlon_pca


Standard deviations:
[1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617

Rotation:
                PC1         PC2         PC3         PC4         PC5         PC6
hurdles  -0.4528710  0.15792058 -0.04514996  0.02653873 -0.09494792 -0.78334101
highjump -0.3771992  0.24807386 -0.36777902  0.67999172  0.01879888  0.09939981
shot     -0.3630725 -0.28940743  0.67618919  0.12431725  0.51165201 -0.05085983
run200m  -0.4078950 -0.26038545  0.08359211 -0.36106580 -0.64983404  0.02495639
longjump -0.4562318  0.05587394  0.13931653  0.11129249 -0.18429810  0.59020972
javelin  -0.0754090 -0.84169212 -0.47156016  0.12079924  0.13510669 -0.02724076
run800m  -0.3749594  0.22448984 -0.39585671 -0.60341130  0.50432116  0.15555520
                 PC7
hurdles   0.38024707
highjump -0.43393114
shot     -0.21762491
run200m  -0.45338483
longjump  0.61206388
javelin   0.17294667
run800m  -0.09830963

In [11]:
%%R 
ggpairs(heptathlon_pca$x)


Notice the scale on each principal component.

Percent Variance Accounted For


In [12]:
%R print(summary(heptathlon_pca))


Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6    PC7
Standard deviation     2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214
Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070
Cumulative Proportion  0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000

Visually:


In [13]:
%R plot(heptathlon_pca)


Recall: How we fix multicollinearity with polynomials

Replace the correlated predictors with uncorrelated predictors

$$ y = \alpha + \beta_1 f_1(x) + \beta_2 f_2(x^2) + \ldots + \beta_n f_n(x^n) + \epsilon $$

Technical Note: These polynomial functions form an orthogonal basis of the function space.

Where do Principal Components Come From?

$$ PC_1 = W_{1,1} \times x_1 + W_{1,2} \times x_2 + \cdots + W_{1,n} \times x_n \\ PC_2 = W_{2,1} \times x_1 + W_{2,2} \times x_2 + \cdots + W_{2,n} \times x_n \\ \vdots \\ PC_n = W_{n,1} \times x_1 + W_{n,2} \times x_2 + \cdots + W_{n,n} \times x_n $$

...actually more like:

$$ PC_1 = W_{1,1} \times \frac{x_1-\bar{x_1}}{\sigma_{x_1}} + W_{1,2} \times \frac{x_2-\bar{x_2}}{\sigma_{x_2}} + \cdots + W_{1,n} \times \frac{x_n-\bar{x_n}}{\sigma_{x_n}} \\ PC_2 = W_{2,1} \times \frac{x_1-\bar{x_1}}{\sigma_{x_1}} + W_{2,2} \times \frac{x_2-\bar{x_2}}{\sigma_{x_2}} + \cdots + W_{2,n} \times \frac{x_n-\bar{x_n}}{\sigma_{x_n}} \\ \vdots \\ PC_n = W_{n,1} \times \frac{x_1-\bar{x_1}}{\sigma_{x_1}} + W_{n,2} \times \frac{x_2-\bar{x_2}}{\sigma_{x_2}} + \cdots + W_{n,n} \times \frac{x_n-\bar{x_n}}{\sigma_{x_n}} $$

Rotations (or Loadings)

These weights are called "rotations" because they rotate the vector space. (They are also called "loadings")


In [14]:
%R print(heptathlon_pca)


Standard deviations:
[1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617

Rotation:
                PC1         PC2         PC3         PC4         PC5         PC6
hurdles  -0.4528710  0.15792058 -0.04514996  0.02653873 -0.09494792 -0.78334101
highjump -0.3771992  0.24807386 -0.36777902  0.67999172  0.01879888  0.09939981
shot     -0.3630725 -0.28940743  0.67618919  0.12431725  0.51165201 -0.05085983
run200m  -0.4078950 -0.26038545  0.08359211 -0.36106580 -0.64983404  0.02495639
longjump -0.4562318  0.05587394  0.13931653  0.11129249 -0.18429810  0.59020972
javelin  -0.0754090 -0.84169212 -0.47156016  0.12079924  0.13510669 -0.02724076
run800m  -0.3749594  0.22448984 -0.39585671 -0.60341130  0.50432116  0.15555520
                 PC7
hurdles   0.38024707
highjump -0.43393114
shot     -0.21762491
run200m  -0.45338483
longjump  0.61206388
javelin   0.17294667
run800m  -0.09830963

Notice:

  1. the standard devation of each principal component.
  2. the magnitude of the loadings, particularly the first two components with regard to javelin.
    Annoying side effect: PCA often reverse the sign during decomposition. Keep this in mind because it may reverse your interpretation of the results.

In [15]:
%R print(scale(heptathlon[,1:7]) %*% heptathlon_pca$rotation[,1:2])


                             PC1         PC2
Joyner-Kersee (USA) -4.121447626 -1.24240435
John (GDR)          -2.882185935 -0.52372600
Behmer (GDR)        -2.649633766 -0.67876243
Sablovskaite (URS)  -1.343351210 -0.69228324
Choubenkova (URS)   -1.359025696 -1.75316563
Schulz (GDR)        -1.043847471  0.07940725
Fleming (AUS)       -1.100385639  0.32375304
Greiner (USA)       -0.923173639  0.80681365
Lajbnerova (CZE)    -0.530250689 -0.14632191
Bouraga (URS)       -0.759819024  0.52601568
Wijnsma (HOL)       -0.556268302  1.39628179
Dimitrova (BUL)     -1.186453832  0.35376586
Scheider (SWI)       0.015461226 -0.80644305
Braun (FRG)          0.003774223 -0.71479785
Ruotsalainen (FIN)   0.090747709 -0.76304501
Yuping (CHN)        -0.137225440  0.53724054
Hagger (GB)          0.171128651  1.74319472
Brown (USA)          0.519252646 -0.72696476
Mulliner (GB)        1.125481833  0.63479040
Hautenauve (BEL)     1.085697646  1.84722368
Kytola (FIN)         1.447055499  0.92446876
Geremias (BRA)       2.014029620  0.09304121
Hui-Ing (TAI)        2.880298635  0.66150588
Jeong-Mi (KOR)       2.970118607  0.95961101
Launa (PNG)          6.270021972 -2.83919926

Biplot

Particularly useful for visualizing two principal components compared against each other:


In [16]:
%R biplot(heptathlon_pca)


Scaling

For historical reasons (i.e. backwards compatibility with S), prcomp() does not scale inputs by default.

  • Recall: what is scaling?
  • Why is it important? (in other words, why might unscaled variables cause a problem with PCA?)

In [17]:
%%R
heptathlon_unscaled_pca <- prcomp(heptathlon[,1:7], scale = FALSE)
summary(heptathlon_unscaled_pca)


Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     8.3646 3.5910 1.38570 0.58571 0.32382 0.14712 0.03325
Proportion of Variance 0.8207 0.1513 0.02252 0.00402 0.00123 0.00025 0.00001
Cumulative Proportion  0.8207 0.9720 0.99448 0.99850 0.99973 0.99999 1.00000

For comparison:


In [18]:
%R print(summary(heptathlon_pca))


Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6    PC7
Standard deviation     2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214
Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070
Cumulative Proportion  0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000

At first glance, this looks better than the scaled version. PC1 accounts for 82% of the variance (while in the scaled PCA, PC1 only accounted for 64%).
Similarly, with only two components, we account for 97% of the variance. In the scaled version, the first four components only got us to 95%.

What could go wrong?


In [19]:
%%R
sapply(heptathlon[,1:7], var)


   hurdles   highjump       shot    run200m   longjump    javelin    run800m 
 0.5426500  0.0060750  2.2257190  0.9400410  0.2248773 12.5716773 68.7421417 

If one wanted to account for the majority of the overall variance, one need only look at run800m.

In fact, that's exactly what the unscaled PCA does:


In [20]:
%R print(heptathlon_unscaled_pca$rotation[,1:5])


                  PC1           PC2         PC3         PC4         PC5
hurdles  -0.069508692  0.0094891417 -0.22180829  0.32737674  0.80702932
highjump -0.005569781  0.0005647147 -0.01451405  0.02123856  0.14013823
shot     -0.077906090  0.1359282330 -0.88374045 -0.42500654 -0.10442207
run200m  -0.072967545  0.1012004268 -0.31005700  0.81585220 -0.46178680
longjump -0.040369299  0.0148845034 -0.18494319  0.20419828  0.31899315
javelin   0.006685584  0.9852954510  0.16021268 -0.03216907  0.04880388
run800m  -0.990994208 -0.0127652701  0.11655815 -0.05827720 -0.02784756

What else could go wrong?

What if the relationship between the variables is nonlinear?

Alternative Uses for PCA

As illustrated by the many names and variations on PCA, this method of analysis has many uses.

e.g. Hanson, S. J., Gagliardi A.D., Hanson, C. (2009). Solving the brain synchrony eigenvalue problem: conservation of temporal dynamics (fMRI) over subjects doing the same task Journal of Cognitive Neuroscience


Using Roy’s Largest Root Statistic to determine significance of the percent variance accounted for by the first principal component.


PC1 results as a function of complexity and goal directed aspects of the action sequence.

Lab

Use any (or all) of the following on your data:

  • Linear Regression - lm() or glm(family='gaussian')
  • Logistic Regression - glm(family='binomial')
  • K-Nearest Neighbors - knn() in library(class)
  • Naive Bayes - naiveBayes() in library(e1071)
  • Principal Component Analysis - prcomp()

What happens when you put them together? e.g.

  • use knn for imputation of missing category labels and then use naiveBayes or glm on the result
  • use prcomp to reduce and orthogonalize dimensions and then use knn or glm on the result