Principal Component Analysis (PCA) is widely used in Machine Learning pipelines as a means to compress data or help visualization. This notebook aims to walk through the basic idea of the PCA and build the algorithm from scratch in Python.
Before diving directly into the PCA, let's first talk about several import concepts - the "eigenvectors & eigenvalues" and "Singular Value Decomposition (SVD)".
An eigenvector of a square matrix is a column vector that satisfies:
$$Av=\lambda v$$Where A is a $[n\times n]$ square matrix, v is a $[n\times 1]$ eigenvector, and $\lambda$ is a scalar value which is also known as the eigenvalue.
If A is both a square and symmetric matrix (like a typical variance-covariance matrix), then we can write A as:
$$A=U\Sigma U^T$$Here columns of matrix U are eigenvectors of matrix A; and $\Sigma$ is a diaonal matrix containing the corresponding eigenvalues.
This is also a special case of the well-known theorem "Singular Value Decomposition" (SVD), where a rectangular matrix M can be expressed as:
$$M=U\Sigma V^T$$The goal of the PCA is to find a lower dimension surface to maxmize total variance of the projection, or in other means, to minimize the projection error. The entire algorithm can be summarized as the following:
1) Given a data matrix $X$ with $m$ rows (number of records) and $n$ columns (number of dimensions), we should first substract the column mean for each dimension.
2) Then we can calculate the variance-covariance matrix using the equation (X here already has zero mean for each column from step 1):
3) We can then use SVD to compute the eigenvectors and corresponding eigenvalues of the above covariance matrix "$cov$":
4) If our target dimension is $p$ ($p<n$), then we will select the first $p$ columns of the $U$ matrix and get matrix $U_{reduce}$.
5) To get the compressed data set, we can do the transformation as below:
6) To appoximate the original data set given the compressed data, we can use:
Note this is true because $U_{reduce}^{-1}=U_{reduce}^T$ (in this case, all the eigenvectors are unit vectors).
Where $S_{ii}$ is the $ith$ diagonal element of the $\Sigma$ matrix, $p$ is the number of reduced dimension, and $n$ is the dimension of the original data.
In [1]:
import numpy as np
class PCA():
def __init__(self, num_components):
self.num_components = num_components
self.U = None
self.S = None
def fit(self, X):
# perform pca
m = X.shape[0]
X_mean = np.mean(X, axis=0)
X -= X_mean
cov = X.T.dot(X) * 1.0 / m
self.U, self.S, _ = np.linalg.svd(cov)
return self
def project(self, X):
# project data based on reduced dimension
U_reduce = self.U[:, :self.num_components]
X_reduce = X.dot(U_reduce)
return X_reduce
def inverse(self, X_reduce):
# recover the original data based on the reduced form
U_reduce = self.U[:, :self.num_components]
X = X_reduce.dot(U_reduce.T)
return X
def explained_variance(self):
# print the ratio of explained variance with the pca
explained = np.sum(self.S[:self.num_components])
total = np.sum(self.S)
return explained * 1.0 / total
We will use the Iris Data set as always.
In [2]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = iris['target']
print X.shape
We can find that the dimension of the original $X$ matrix is 4. We can then compress it to 2 using PCA technique with the PCA() class that we defined above.
In [3]:
pca = PCA(num_components=2)
pca.fit(X)
X_reduce = pca.project(X)
print X_reduce.shape
Now that the data has been compressed, we can check the ratianed variance.
In [4]:
print "{:.2%}".format(pca.explained_variance())
We have 97.76% of variance retained. This is okay for data visulization purposes. But if we used PCA in supervised learning pipelines, we might want to add more dimension to keep more than 99% of the variation from the original data.
Finally, with the compressed dimension, we can plot to see the distribution of iris dataset.
In [5]:
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)
from matplotlib import pyplot as plt
for c, marker, class_num in zip(['green', 'r', 'cyan'], ['o', '^', 's'], np.unique(y)):
plt.scatter(x=X_reduce[:, 0][y == class_num], y=X_reduce[:, 1][y == class_num], c=c, marker=marker,
label="Class {}".format(class_num), alpha=0.7, s=30)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.legend()
plt.show()
From the above example, we can see that PCA can help us visualize data with more than 3 feature dimensions. The general use of PCA is for dimensionality reductions in Machine Learning Pipelines. It can speed up the learning process and save memory when running supervised and unsupervised algorithms on large dataset. However, it also throws away some information when reducing the feature dimension. Thus it is always beneficial to test whether using PCA on top of something else since it's pretty easy to set up.