Principle Component Analysis (PCA)

Key Equation: $Ax = \lambda b \text{for} n \times n $

PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This is an Unsupervised Learning Techniques - Which means we don't have a target variable.



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)

From 2 Dimension to 1 Dimension

Let us generate a two variable data set - $a,b$

$$ b = 50 + 3a + \epsilon$$



In [3]:

    
np.random.seed(123)



In [4]:

    
a = np.arange(12, 56, 0.5)
e = np.random.normal(0, 100, a.size)
b = 500 + 20*a + e



In [5]:

    
X = np.c_[a,b]



In [6]:

    
def plot2var (m, xlabel, ylabel):
    x = m[:,0]
    y = m[:,1]
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(x, y, s = 40, alpha = 0.8)
    sns.rugplot(x, color="m", ax=ax)
    sns.rugplot(y, color="m", vertical=True, ax=ax)
       
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)



In [7]:

    
plot2var(X, 'a', 'b')

Standardizing the Variables

Centering the Variables (Remove mean and divide by std dev)



In [8]:

    
X_mean = np.mean(X, axis=0)



In [9]:

    
X_mean









    Out[9]:





array([   33.75      ,  1176.10149293])



In [10]:

    
X_sd = np.std(X, axis=0)



In [11]:

    
X_sd









    Out[11]:





array([  12.7008858 ,  277.28193421])



In [12]:

    
X_std = np.subtract(X, X_mean) / X_sd



In [13]:

    
def plot2var_std (m, xlabel, ylabel):
    x = m[:,0]
    y = m[:,1]
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(x, y, s = 40, alpha = 0.8)
    sns.rugplot(x, color="m", ax=ax)
    sns.rugplot(y, color="m", vertical=True, ax=ax)
       
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.xlim([-3,3])
    plt.ylim([-3,3])



In [14]:

    
plot2var_std(X_std, "a", "b")

Calculate the Covariance Matrix



In [15]:

    
cov_mat_2var = np.cov(X_std.T)



In [16]:

    
cov_mat_2var









    Out[16]:





array([[ 1.01149425,  0.92226013],
       [ 0.92226013,  1.01149425]])

So now this is the symetric $A$ matrix we are trying to solve

$$ Ax = \lambda x $$

where $$ A = \begin{bmatrix} 1.01 & -0.92 \\ -0.92 & 1.01 \end{bmatrix} $$

Get Eigen-vectors and Eigen-values

Lets get the eigen-vectors for this matrix



In [17]:

    
eigen_val_2var, eigen_vec_2var = np.linalg.eig(cov_mat_2var)



In [18]:

    
eigen_val_2var









    Out[18]:





array([ 1.93375438,  0.08923413])



In [19]:

    
eigen_vec_2var









    Out[19]:





array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]])



In [20]:

    
eigen_vec_2var[1].dot(eigen_vec_2var[0])









    Out[20]:





0.0

So our eigen vectors and eigen values are: $$ \lambda_1 = 1.93, \lambda_2 = 0.09 $$

$$ \vec{v_1} = \begin{bmatrix} 0.707 \\ -0.707\end{bmatrix} $$$$ \vec{v_2} = \begin{bmatrix} 0.707 \\ 0.707\end{bmatrix} $$

These are orthogonal to each other. Let us plots to see these eigen vectors



In [21]:

    
def plot2var_eigen (m, xlabel, ylabel):
    x = m[:,0]
    y = m[:,1]
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(x, y, s = 40, alpha = 0.8)
    sns.rugplot(x, color="m", ax=ax)
    sns.rugplot(y, color="m", vertical=True, ax=ax)
       
    cov_mat = np.cov(m.T)
    eigen_val, eigen_vec = np.linalg.eig(cov_mat)
    
    plt.quiver(eigen_vec[0, 0], eigen_vec[0, 1], angles='xy', scale_units='xy', scale=1, color='brown')
    plt.quiver(eigen_vec[1, 0], eigen_vec[1, 1], angles='xy', scale_units='xy', scale=1, color='brown')

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.xlim(-3,3)
    plt.ylim(-3,3)



In [22]:

    
plot2var_eigen(X_std, 'a' ,'b')

Projection Matrix

Let us project our orginal values to see the new results



In [23]:

    
eigen_vec_2var









    Out[23]:





array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]])



In [24]:

    
X_std.T.shape









    Out[24]:





(2, 88)



In [25]:

    
eigen_vec_2var.shape









    Out[25]:





(2, 2)



In [26]:

    
X_proj = eigen_vec_2var.dot(X_std.T)



In [27]:

    
plot2var_eigen(X_proj.T, 'pca1' ,'pca2')

Using PCA from SKlearn



In [28]:

    
from sklearn.decomposition import PCA



In [29]:

    
pca = PCA(n_components=2)



In [30]:

    
pca.fit(X_std)









    Out[30]:





PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)



In [31]:

    
X_pca_proj = pca.transform(X_std)



In [32]:

    
plot2var_std(X_pca_proj, 'pca1', 'pca2')



In [33]:

    
pca.explained_variance_









    Out[33]:





array([ 1.9117799,  0.0882201])

From 4 Dimensions to 2 Dimensions

Run PCA with 2 dimensions on the cars dataset



In [34]:

    
pop = pd.read_csv('data/cars_small.csv')



In [35]:

    
pop.head()









    Out[35]:






  
    
      
      brand
      model
      price
      kmpl
      bhp
      type
    
  
  
    
      0
      Chevrolet
      Beat
      421
      18.6
      79
      Hatchback
    
    
      1
      Chevrolet
      Sail
      551
      18.2
      82
      Sedan
    
    
      2
      Chevrolet
      Sail Hatchback
      468
      18.2
      82
      Hatchback
    
    
      3
      Chevrolet
      Spark
      345
      16.2
      62
      Hatchback
    
    
      4
      Fiat
      Linea Classic
      612
      14.9
      89
      Sedan

Preprocessing - brand, price, kmpl, bhp



In [36]:

    
pop = pop.drop(['model'], axis = 1)



In [37]:

    
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df = pop.apply(le.fit_transform)



In [38]:

    
df.head()



In [39]:

    
g = sns.PairGrid(df, hue = 'type')
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter, alpha = 0.8)









    Out[39]:





<seaborn.axisgrid.PairGrid at 0x112011668>

Standardizing



In [40]:

    
X = df.iloc[:,:4]



In [41]:

    
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

Eigendecomposition - Computing Eigenvectors and Eigenvalues



In [42]:

    
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)









    



Covariance matrix 
[[ 1.02439024 -0.01272799  0.25803108 -0.07738538]
 [-0.01272799  1.02439024 -0.63611083  0.82877403]
 [ 0.25803108 -0.63611083  1.02439024 -0.46279843]
 [-0.07738538  0.82877403 -0.46279843  1.02439024]]



In [43]:

    
# Doing this directly using np.cov
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))









    



NumPy covariance matrix: 
[[ 1.02439024 -0.01272799  0.25803108 -0.07738538]
 [-0.01272799  1.02439024 -0.63611083  0.82877403]
 [ 0.25803108 -0.63611083  1.02439024 -0.46279843]
 [-0.07738538  0.82877403 -0.46279843  1.02439024]]



In [44]:

    
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)



In [45]:

    
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)









    



Eigenvectors 
[[ 0.14151503 -0.92968525 -0.31696215  0.12326319]
 [-0.61287306 -0.22032736  0.0791158  -0.75470733]
 [ 0.52320382 -0.22131442  0.77419401 -0.27910809]
 [-0.57499381 -0.19534868  0.54212482  0.58079402]]

Eigenvalues 
[ 2.34792167  1.06653843  0.52967314  0.15342773]

How do you select which 2 axis to choose?

Sorting the Eigenvalues and Eigenvectors

In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.



In [46]:

    
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]



In [47]:

    
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)



In [48]:

    
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])









    



Eigenvalues in descending order:
2.34792167381
1.06653842541
0.529673143408
0.153427732985

Explained Variance

The explained variance tells us how much information (variance) can be attributed to each of the principal components.



In [49]:

    
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)



In [50]:

    
plt.bar(range(4), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
             label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()

Projection Matrix

The “projection matrix” is just a matrix of our concatenated top k eigenvectors. Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our $n×k$-dimensional eigenvector matrix $W$.



In [51]:

    
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))

print('Matrix W:\n', matrix_w)









    



Matrix W:
 [[ 0.14151503 -0.92968525]
 [-0.61287306 -0.22032736]
 [ 0.52320382 -0.22131442]
 [-0.57499381 -0.19534868]]



In [52]:

    
X_proj = X_std.dot(matrix_w)



In [53]:

    
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_proj[:,0], X_proj[:,1], c = df.type, s = 100, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')









    Out[53]:





<matplotlib.text.Text at 0x1131ffa90>

PCA using sklearn



In [54]:

    
from sklearn.decomposition import PCA



In [55]:

    
pca = PCA(n_components=2)



In [56]:

    
pca.fit(X_std)









    Out[56]:





PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)



In [57]:

    
X_proj_sklearn = pca.transform(X_std)



In [58]:

    
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_proj_sklearn[:,0], X_proj_sklearn[:,1], c = df.type, 
            s = 100, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')









    Out[58]:





<matplotlib.text.Text at 0x1133a2240>



In [59]:

    
pca.explained_variance_









    Out[59]:





array([ 2.29201878,  1.04114465])

From 784 Dimensions to 2 Dimensions



In [60]:

    
digits = pd.read_csv('data/digits.csv')



In [61]:

    
digits.head()









    Out[61]:






  
    
      
      num
      x1
      x2
      x3
      x4
      x5
      x6
      x7
      x8
      x9
      ...
      x775
      x776
      x777
      x778
      x779
      x780
      x781
      x782
      x783
      x784
    
  
  
    
      0
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 785 columns



In [62]:

    
digits.shape









    Out[62]:





(5234, 785)



In [63]:

    
digitsX = digits.iloc[:,1:785]



In [64]:

    
digitsX.head()









    Out[64]:






  
    
      
      x1
      x2
      x3
      x4
      x5
      x6
      x7
      x8
      x9
      x10
      ...
      x775
      x776
      x777
      x778
      x779
      x780
      x781
      x782
      x783
      x784
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 784 columns



In [65]:

    
pca = PCA(n_components=2)



In [66]:

    
pca.fit(digitsX)









    Out[66]:





PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)



In [67]:

    
digits_trans = pca.transform(digitsX)



In [68]:

    
digits_trans









    Out[68]:





array([[ 1068.06173056,   117.11012503],
       [ -547.99896591,   650.78733687],
       [ -823.76820597,   -60.24173439],
       ..., 
       [ -806.20311998,   123.3731127 ],
       [  415.26422894,   325.05441395],
       [ 1054.28476296,   387.85690578]])



In [69]:

    
plt.scatter(digits_trans[:,0], digits_trans[:,1], c = digits.num, 
            s = 20, alpha = 0.8, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')









    Out[69]:





<matplotlib.text.Text at 0x10b877be0>

	brand	model	price	kmpl	bhp	type
0	Chevrolet	Beat	421	18.6	79	Hatchback
1	Chevrolet	Sail	551	18.2	82	Sedan
2	Chevrolet	Sail Hatchback	468	18.2	82	Hatchback
3	Chevrolet	Spark	345	16.2	62	Hatchback
4	Fiat	Linea Classic	612	14.9	89	Sedan

	num	...
0	2	...
1	1	...
2	1	...
3	1	...
4	2	...

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	...	x775	x776	x777	x778	x779	x780	x781	x782	x783	x784
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	num	...
0	2	...
1	1	...
2	1	...
3	1	...
4	2	...

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	...	x775	x776	x777	x778	x779	x780	x781	x782	x783	x784
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

Principle Component Analysis (PCA)

Key Equation: $Ax = \lambda b ~~ \text{for} ~~ n \times n $

From 2 Dimension to 1 Dimension

Standardizing the Variables

Calculate the Covariance Matrix

Get Eigen-vectors and Eigen-values

Projection Matrix

Using PCA from SKlearn

From 4 Dimensions to 2 Dimensions

Preprocessing - brand, price, kmpl, bhp

Standardizing

Eigendecomposition - Computing Eigenvectors and Eigenvalues

Sorting the Eigenvalues and Eigenvectors

Explained Variance

Projection Matrix

PCA using sklearn

From 784 Dimensions to 2 Dimensions

Key Equation: $Ax = \lambda b \text{for} n \times n $

	num	...
0	2	...
1	1	...
2	1	...
3	1	...
4	2	...

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	...	x775	x776	x777	x778	x779	x780	x781	x782	x783	x784
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0