PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA analysis is perfromed either using EVD of the covariance matrix or the SVD of the mean-centered data.
The reader should be familiar with linear algebra and statistics concepts.
The reader should be able to perform PCA on a given data set.
For more details see
A data matrix is a matrix $X\in\mathbb{R}^{m\times n}$, where each column corresponds to a feature (say, certain gene), and each row correspond to an observation (say, individual).
A mean of a vector $x\in\mathbb{R}^{n}$ is $\mu(x)=\displaystyle \frac{x_1+x_2+\cdots x_n}{n}$.
A standard deviation of a vector $x$ is $\sigma(x)=\displaystyle \sqrt{\frac{\sum_{i=1}^n (x_i-\mu(x))^2}{n-1}}$. A variance of a vector $x$ is $\mathop{\mathrm{var}}(x)=\sigma^2(x)$.
A vector of means of a data matrix $X$ is a row-vector of means of the columns of $X$, $\mu(X)=\begin{bmatrix}\mu(X_{:,1}) & \mu(X_{:,2}) & \ldots & \mu(X_{:,n})\end{bmatrix}$.
A zero-mean centered data matrix is a matrix $\bar X$ obtained from a data matrix $X$ by subtracting from each column the mean of this column, $$ \bar X= \begin{bmatrix} X_{:,1}-\mu(X_{:,1}) & X_{:,2}-\mu(X_{:,2}) & \cdots & X_{:,n}-\mu(X_{:,n}) \end{bmatrix}\equiv X-\mathbf{1}\mu(X), $$ where $\mathbf{1}=[1,1,\ldots,1]^T$.
A covariance matrix of a data matrix $X$ is a matrix $$ \mathop{\mathrm{cov}}(X)=\displaystyle \frac{1}{n-1}[X-\mathbf{1}\mu(X)]^T[X-\mathbf{1}\mu(X)] \equiv \frac{\bar X^T \bar X}{n-1}. $$
Given a data matrix $X$, let $\mathop{\mathrm{cov}}(X)=U\Lambda U^T$ be the EVD with non-increasingly ordered eigenvalues, $\lambda_1\geq \lambda_2\geq \cdots \geq \lambda_n$.
$\mathop{\mathrm{cov}}(X)$ is a symmetric PSD matrix.
$\mathop{\mathrm{cov}}(X)=\mathop{\mathrm{cov}}(\bar X)$.
Let $T=\bar X U$. The columns of $T$ are the principal components of $\bar X$. In particular:
Let $\bar X=\bar U \Sigma V^T$ be the SVD of $\bar X$. Then $V=U$ and $T=\bar U\Sigma V^T V\equiv \bar U \Sigma$.
Reconstruction of the principal components is the following:
Partial reconstructions can help obtaining various insights about the data. For example, the rows of the matrix $T_{:,1:k}$ can be clustered by the $k$-means algorithm, and the points defined by first three columns of $T$ can be plotted to visualize projections of clusters. Afterwards, the computed clusters can be mapped back to original data.
Heuristical guess for number of important clusters is given be the location of the "knee" in the plot of the singular values of $\bar X$.
In [1]:
# Generate data points
using Random
Random.seed!(456)
n=3
m=500
ax=[8,3,1]
X=Array{Float64}(undef,m,n)
for i=1:n
X[:,i]=rand!(X[:,i])*ax[i]
end
# Parameters
u=(rand(m).-0.5)*π
v=(rand(m).-0.5)*2*π
for i=1:m
X[i,1]=X[i,1]*cos(u[i])*cos(v[i])
X[i,2]=X[i,2]*cos(u[i])*sin(v[i])
X[i,3]=X[i,3]*sin(u[i])
end
In [2]:
X₀=copy(X)
Out[2]:
In [3]:
using LinearAlgebra
sum(abs.(X₀),dims=1)
Out[3]:
In [4]:
using PyPlot
In [5]:
plot3D(X₀[:,1],X₀[:,2],X₀[:,3],"+")
Out[5]:
In [6]:
# Compute the means. How good is the RNG?
using Statistics
μ₀=mean(X,dims=1)
Out[6]:
In [7]:
# Subtract the means
using LinearAlgebra
X=X.-μ₀
# Rotate by a random orthogonal matrix
Q,r=qr(rand(3,3))
X=X*Q;
In [8]:
# Translate
S=[3,-2,4]
Out[8]:
In [9]:
X.+=S'
# plot3D(X0[:,1],X0[:,2],X0[:,3],"b+")
plot3D(X[:,1],X[:,2],X[:,3],"r+" )
Out[9]:
In [10]:
C=cov(X)
Out[10]:
In [11]:
μ=mean(X,dims=1)
Out[11]:
In [12]:
# Fact 2
display(cov(X.-μ))
(X.-μ)'*(X.-μ)/(m-1)
Out[12]:
In [13]:
# Principal components, evals are non-decreasing
λ,U=eigen(C)
Out[13]:
In [14]:
# Largest principal component
T₁=(X.-μ)*U[:,3]
subplot(221)
plot(T₁,zero(T₁),"rx")
subplot(222)
plot(T₁,"b+")
Out[14]:
In [15]:
T₁
Out[15]:
In [16]:
# Two largest principal components
T₂=(X.-μ)*U[:,[3,2]]
axis("equal")
plot(T₂[:,1],T₂[:,2],"r+")
Out[16]:
In [17]:
# All three principal components
T=(X.-μ)*U[:,[3,2,1]]
plot3D(T[:,1],T[:,2],T[:,3],"rx" )
Out[17]:
In [18]:
# Fact 5 - Recovery of the largest component
Y₁=T₁*U[:,3]'.+μ
plot3D(Y₁[:,1],Y₁[:,2],Y₁[:,3],"rx" )
Out[18]:
In [19]:
# Recovery of the two largest components
Y₂=T₂*U[:,[3,2]]'.+μ
plot3D(Y₂[:,1],Y₂[:,2],Y₂[:,3],"rx" )
Out[19]:
In [20]:
# Recovery of all three components (exact)
Y₃=T*U[:,[3,2,1]]'.+μ
plot3D(Y₃[:,1],Y₃[:,2],Y₃[:,3],"r+" )
plot3D(X[:,1],X[:,2],X[:,3],"b+" )
Out[20]:
In [21]:
# Fact 4 - PCA using SVD
function myPCA(X::Array{T}, k::Int) where T
μ=mean(X,dims=1)
U=svd(X.-μ)
U.U[:,1:k]*Diagonal(U.S[1:k])
end
Out[21]:
In [22]:
T₁s=myPCA(X,1)
[T₁s T₁]
Out[22]:
In [23]:
# The two largest components using SVD
T₂s=myPCA(X,2)
[T₂s T₂]
Out[23]:
We will cluster three datasets from Workshop "Principal Manifolds-2006":
Data set | # of genes (m) | # of samples (n) |
---|---|---|
D1 | 17816 | 286 |
D2 | 3036 | 40 |
D3 | 10383 | 103 |
In [24]:
# Data set D1
using DelimitedFiles
f=readdlm("files/d1.txt")
Out[24]:
In [25]:
sizeof(f)
Out[25]:
In [27]:
X=map(Float64,f[2:end,2:end])
Out[27]:
In [31]:
sizeof(X)
Out[31]:
In [32]:
# Clear a big variable
f=[]
Out[32]:
In [33]:
minimum(μ)
Out[33]:
In [34]:
# Fact 7 - Plot σ's and observe the knee
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)
Out[34]:
In [35]:
# PCA on X, keep 20 singular values
k=20
T=myPCA(X,k)
Out[35]:
In [36]:
# Find k clusters
using Clustering
out=kmeans(T',k)
Out[36]:
In [37]:
# Plot points defined by the first three principal components
function myPlot(C::Vector, T::Array, k::Int)
P=Array{Any}(undef,k)
for j=1:k
P[j]=T[C.==j,1:3]
plot3D(P[j][:,1],P[j][:,2],P[j][:,3],"x")
end
end
Out[37]:
In [38]:
myPlot(out.assignments,T,k)
In [39]:
# Data set D2
X=readdlm("files/d2fn.txt")
Out[39]:
In [40]:
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)
Out[40]:
In [41]:
k=4
T=myPCA(X,k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)
We use the package CSV.jl and replace the "NULL" string with missing
.
In [42]:
using CSV
In [43]:
sf = CSV.read("files/d3efn.txt", header=0,missingstring="NULL")
Out[43]:
In [44]:
# sf[:,103]
In [45]:
typeof(sf)
Out[45]:
In [46]:
# Set missing values to zero
X=coalesce.(Matrix(sf[:,1:end-1]),0.0)
Out[46]:
In [47]:
μ=mean(X,dims=1)
σ=svdvals(X.-μ)
plot(σ)
Out[47]:
In [48]:
k=4
T=myPCA(X,k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)
In [49]:
# Clustering the transpose
k=20
T=myPCA(Matrix(X'),k)
out=kmeans(T',k)
myPlot(out.assignments,T,k)