Instructor: Alessandro Gagliardi
TA: Kevin Perko
In [1]:
%load_ext rmagic
In [2]:
%%R
library(ggplot2)
In [3]:
%%R
x <- seq(1, 10, 0.1)
print(cor(x^9, x^10))
qplot(x^9, x^10)
In [4]:
%%R -w 960 -h 480 -u px
x <- seq(1, 10, 0.1)
p <- data.frame(rbind(poly(x, 10, raw=T)[,c(9,10)],
poly(x, 10, raw=F)[,c(9,10)]),
raw = rep(c(TRUE, FALSE), each = length(x)))
ggplot(p, aes(X9, X10)) + geom_point() + facet_wrap(~raw, scales="free")
But what about when the multicollinearity comes from correlations between separate variables?
In [5]:
%%R
data("heptathlon", package = "HSAUR")
summary(heptathlon)
from A Handbook of Statistical Analyses Using **R**:
To begin it will help to score all the seven events in the same direction, so that ‘large’ values are ‘good’. We will recode the running events to achieve this.
In [6]:
%%R
heptathlon$hurdles <- max(heptathlon$hurdles) - heptathlon$hurdles
heptathlon$run200m <- max(heptathlon$run200m) - heptathlon$run200m
heptathlon$run800m <- max(heptathlon$run800m) - heptathlon$run800m
In [7]:
%R print(round(cor(heptathlon[,1:7]), 2))
One of these things is not like the other. Do you see it?
Scores in of the events with the exception of javelin
are correlated with one another.
javelin
is considered a 'technical' event while the others are considered 'power'-based events.
PCA is equivalent to what is known as the singular value decomposition (SVD) of $\bf{X}$ and the eigenvalue decomposition (EVD), a.k.a. the spectral decomposition of $\bf{X^TX}$ in linear algebra.
FYI: SVD and EVD are slightly different and if your linear algebra skills are up to the task, I encourage you to see how they compare, but for our purposes we may treat them as equivalent.
_from [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis):_
Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering . . . factor analysis, Eckart–Young theorem (Harman, 1960), or Schmidt–Mirsky theorem in psychometrics, empirical orthogonal functions (EOF) in meteorological science, empirical eigenfunction decomposition (Sirovich, 1987), empirical component analysis (Lorenz, 1956), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics.
Factor Analysis differs from PCA in that Factor Analysis tests an a priori model while PCA makes no assumptions about the underlying structure of the data. Other methods mentioned may also differ in minor ways from what follows but they are all mathemtically related.
It has many uses, but the ones we are most interested in are:
Side note: another, arguably better way to orthogonalize vectors is to use Independent Component Analysis (ICA). The main problem with ICA is that you need to know how many independent components there are to begin with. For that reason, its utility is limited and we will focus on PCA.
Q: What is dimensionality reduction?
A: A set of techniques for reducing the size (in terms of features, records, and/or bytes) of the dataset under examination.
In general, the idea is to regard the dataset is a matrix and to decompose the matrix into simpler, meaningful pieces.
Dimensionality reduction is frequently performed as a pre-processing step before another learning algorithm is applied.
Q: What are the motivations for dimensionality reduction?
The number of features in our dataset can be difficult to manage, or even misleading (e.g. if the relationships are actually simpler than they appear).
Q: What are different ways to perform dimensionality reduction?
A: There are two approaches: feature selection and feature extraction.
feature selection – selecting a subset of features using an external criterion (filter) or the learning algorithm accuracy itself (wrapper)
feature extraction – mapping the features to a lower dimensional space
Feature selection is important, but typically when people say dimensionality reduction, they are referring to feature extraction.
The goal of feature extraction is to create a new set of coordinates that simplify the representation of the data.
This is what we are doing with PCA.
We want to find a new matrix $P$ from original data $X$ so that the covariance matrix of $PX$ is diagonal (i.e. the resulting vectors are all orthogonal to one another) and entries on diagonal are descending order. In the process:
If this works out, $Y_{m\times n}$ is out new data set!
Every dataset has noise and signal... How can we bring out the signal?
In [8]:
%%R
library(GGally)
ggpairs(heptathlon[,1:7])
In [9]:
%%R
# transpose and scale the original data
X <- t(scale(heptathlon[,1:7]))
# Calculate P
A <- X %*% t(X)
E <- eigen(A, symmetric=TRUE)
P <- t(E$vectors)
# Find the new data and standard deviations of the principal components
newdata <- P %*% X
sdev <- sqrt(diag((1/(dim(X)[2]-1)* P %*% A %*% t(P))))
print(sdev)
print(E$vectors)
In [10]:
%%R
heptathlon_pca <- prcomp(heptathlon[,1:7], scale = T) # don't forget to set scale = TRUE
heptathlon_pca
In [11]:
%%R
ggpairs(heptathlon_pca$x)
Notice the scale on each principal component.
In [12]:
%R print(summary(heptathlon_pca))
In [13]:
%R plot(heptathlon_pca)
...actually more like:
$$ PC_1 = W_{1,1} \times \frac{x_1-\bar{x_1}}{\sigma_{x_1}} + W_{1,2} \times \frac{x_2-\bar{x_2}}{\sigma_{x_2}} + \cdots + W_{1,n} \times \frac{x_n-\bar{x_n}}{\sigma_{x_n}} \\ PC_2 = W_{2,1} \times \frac{x_1-\bar{x_1}}{\sigma_{x_1}} + W_{2,2} \times \frac{x_2-\bar{x_2}}{\sigma_{x_2}} + \cdots + W_{2,n} \times \frac{x_n-\bar{x_n}}{\sigma_{x_n}} \\ \vdots \\ PC_n = W_{n,1} \times \frac{x_1-\bar{x_1}}{\sigma_{x_1}} + W_{n,2} \times \frac{x_2-\bar{x_2}}{\sigma_{x_2}} + \cdots + W_{n,n} \times \frac{x_n-\bar{x_n}}{\sigma_{x_n}} $$
In [14]:
%R print(heptathlon_pca)
Notice:
javelin
.
In [15]:
%R print(scale(heptathlon[,1:7]) %*% heptathlon_pca$rotation[,1:2])
In [16]:
%R biplot(heptathlon_pca)
In [17]:
%%R
heptathlon_unscaled_pca <- prcomp(heptathlon[,1:7], scale = FALSE)
summary(heptathlon_unscaled_pca)
For comparison:
In [18]:
%R print(summary(heptathlon_pca))
At first glance, this looks better than the scaled version. PC1 accounts for 82% of the variance (while in the scaled PCA, PC1 only accounted for 64%).
Similarly, with only two components, we account for 97% of the variance. In the scaled version, the first four components only got us to 95%.
In [19]:
%%R
sapply(heptathlon[,1:7], var)
If one wanted to account for the majority of the overall variance, one need only look at run800m.
In fact, that's exactly what the unscaled PCA does:
In [20]:
%R print(heptathlon_unscaled_pca$rotation[,1:5])
What if the relationship between the variables is nonlinear?
e.g. Hanson, S. J., Gagliardi A.D., Hanson, C. (2009). Solving the brain synchrony eigenvalue problem: conservation of temporal dynamics (fMRI) over subjects doing the same task Journal of Cognitive Neuroscience
Using Roy’s Largest Root Statistic to determine significance of the percent variance accounted for by the first principal component.
PC1 results as a function of complexity and goal directed aspects of the action sequence.
What happens when you put them together? e.g.
knn
for imputation of missing category labels and then use naiveBayes
or glm
on the resultprcomp
to reduce and orthogonalize dimensions and then use knn
or glm
on the result