Principle Component Analysis (PCA)

Key Equation: $Ax = \lambda b ~~ \text{for} ~~ n \times n $

PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This is an Unsupervised Learning Techniques - Which means we don't have a target variable.


In [1]:
import numpy as np
import pandas as pd

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)

From 2 Dimension to 1 Dimension

Let us generate a two variable data set - $a,b$

$$ b = 50 + 3a + \epsilon$$


In [3]:
np.random.seed(123)

In [7]:
a = np.arange(12, 56, 0.5)
e = np.random.normal(0, 100, a.size)
b = 500 + 20*a + e

In [8]:
X = np.c_[a,b]

In [9]:
def plot2var (m, xlabel, ylabel):
    x = m[:,0]
    y = m[:,1]
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(x, y, s = 40, alpha = 0.8)
    sns.rugplot(x, color="m", ax=ax)
    sns.rugplot(y, color="m", vertical=True, ax=ax)
       
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

In [10]:
plot2var(X, 'a', 'b')


Standardizing the Variables

Centering the Variables (Remove mean and divide by std dev)


In [11]:
X_mean = np.mean(X, axis=0)

In [12]:
X_mean


Out[12]:
array([   33.75      ,  1176.10149293])

In [13]:
X_sd = np.std(X, axis=0)

In [14]:
X_sd


Out[14]:
array([  12.7008858 ,  277.28193421])

In [15]:
X_std = np.subtract(X, X_mean) / X_sd

In [16]:
def plot2var_std (m, xlabel, ylabel):
    x = m[:,0]
    y = m[:,1]
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(x, y, s = 40, alpha = 0.8)
    sns.rugplot(x, color="m", ax=ax)
    sns.rugplot(y, color="m", vertical=True, ax=ax)
       
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.xlim([-3,3])
    plt.ylim([-3,3])

In [17]:
plot2var_std(X_std, "a", "b")


Calculate the Covariance Matrix


In [20]:
cov_mat_2var = np.cov(X_std.T)

In [21]:
cov_mat_2var


Out[21]:
array([[ 1.01149425,  0.92226013],
       [ 0.92226013,  1.01149425]])

So now this is the symetric $A$ matrix we are trying to solve

$$ Ax = \lambda x $$

where $$ A = \begin{bmatrix} 1.01 & -0.92 \\ -0.92 & 1.01 \end{bmatrix} $$

Get Eigen-vectors and Eigen-values

Lets get the eigen-vectors for this matrix


In [22]:
eigen_val_2var, eigen_vec_2var = np.linalg.eig(cov_mat_2var)

In [23]:
eigen_val_2var


Out[23]:
array([ 1.93375438,  0.08923413])

In [24]:
eigen_vec_2var


Out[24]:
array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]])

In [25]:
eigen_vec_2var[1].dot(eigen_vec_2var[0])


Out[25]:
0.0

So our eigen vectors and eigen values are: $$ \lambda_1 = 1.93, \lambda_2 = 0.09 $$

$$ \vec{v_1} = \begin{bmatrix} 0.707 \\ -0.707\end{bmatrix} $$$$ \vec{v_2} = \begin{bmatrix} 0.707 \\ 0.707\end{bmatrix} $$

These are orthogonal to each other. Let us plots to see these eigen vectors


In [26]:
def plot2var_eigen (m, xlabel, ylabel):
    x = m[:,0]
    y = m[:,1]
    fig, ax = plt.subplots(figsize=(6, 6))
    plt.scatter(x, y, s = 40, alpha = 0.8)
    sns.rugplot(x, color="m", ax=ax)
    sns.rugplot(y, color="m", vertical=True, ax=ax)
       
    cov_mat = np.cov(m.T)
    eigen_val, eigen_vec = np.linalg.eig(cov_mat)
    
    plt.quiver(eigen_vec[0, 0], eigen_vec[0, 1], angles='xy', scale_units='xy', scale=1, color='brown')
    plt.quiver(eigen_vec[1, 0], eigen_vec[1, 1], angles='xy', scale_units='xy', scale=1, color='brown')

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.xlim(-3,3)
    plt.ylim(-3,3)

In [27]:
plot2var_eigen(X_std, 'a' ,'b')


Projection Matrix

Let us project our orginal values to see the new results


In [28]:
eigen_vec_2var


Out[28]:
array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]])

In [29]:
X_std.T.shape


Out[29]:
(2, 88)

In [30]:
eigen_vec_2var.shape


Out[30]:
(2, 2)

In [31]:
X_proj = eigen_vec_2var.dot(X_std.T)

In [32]:
plot2var_eigen(X_proj.T, 'pca1' ,'pca2')


Using PCA from SKlearn


In [82]:
from sklearn.decomposition import PCA

In [134]:
pca = PCA(n_components=2)

In [135]:
pca.fit(X_std)


Out[135]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [136]:
X_pca_proj = pca.transform(X_std)

In [141]:
plot2var_std(X_pca_proj, 'pca1', 'pca2')



In [87]:
pca.explained_variance_


Out[87]:
array([ 1.9117799,  0.0882201])

From 4 Dimensions to 2 Dimensions

Run PCA with 2 dimensions on the cars dataset


In [33]:
pop = pd.read_csv('data/cars_small.csv')

In [34]:
pop.head()


Out[34]:
brand model price kmpl bhp type
0 Chevrolet Beat 421 18.6 79 Hatchback
1 Chevrolet Sail 551 18.2 82 Sedan
2 Chevrolet Sail Hatchback 468 18.2 82 Hatchback
3 Chevrolet Spark 345 16.2 62 Hatchback
4 Fiat Linea Classic 612 14.9 89 Sedan

Preprocessing - brand, price, kmpl, bhp


In [35]:
pop = pop.drop(['model'], axis = 1)

In [36]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df = pop.apply(le.fit_transform)

In [37]:
df.head()


Out[37]:
brand price kmpl bhp type
0 0 13 20 11 0
1 0 28 18 12 1
2 0 17 18 12 0
3 0 6 9 4 0
4 1 31 2 16 1

In [39]:
g = sns.PairGrid(df, hue = 'type')
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter, alpha = 0.8)


Out[39]:
<seaborn.axisgrid.PairGrid at 0x1110b85c0>

Standardizing


In [40]:
X = df.iloc[:,:4]

In [41]:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

Eigendecomposition - Computing Eigenvectors and Eigenvalues


In [42]:
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)


Covariance matrix 
[[ 1.02439024 -0.01272799  0.25803108 -0.07738538]
 [-0.01272799  1.02439024 -0.63611083  0.82877403]
 [ 0.25803108 -0.63611083  1.02439024 -0.46279843]
 [-0.07738538  0.82877403 -0.46279843  1.02439024]]

In [43]:
# Doing this directly using np.cov
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))


NumPy covariance matrix: 
[[ 1.02439024 -0.01272799  0.25803108 -0.07738538]
 [-0.01272799  1.02439024 -0.63611083  0.82877403]
 [ 0.25803108 -0.63611083  1.02439024 -0.46279843]
 [-0.07738538  0.82877403 -0.46279843  1.02439024]]

In [44]:
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

In [45]:
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)


Eigenvectors 
[[ 0.14151503 -0.92968525 -0.31696215  0.12326319]
 [-0.61287306 -0.22032736  0.0791158  -0.75470733]
 [ 0.52320382 -0.22131442  0.77419401 -0.27910809]
 [-0.57499381 -0.19534868  0.54212482  0.58079402]]

Eigenvalues 
[ 2.34792167  1.06653843  0.52967314  0.15342773]

How do you select which 2 axis to choose?

Sorting the Eigenvalues and Eigenvectors

In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.


In [46]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

In [47]:
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

In [48]:
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])


Eigenvalues in descending order:
2.34792167381
1.06653842541
0.529673143408
0.153427732985

Explained Variance

The explained variance tells us how much information (variance) can be attributed to each of the principal components.


In [49]:
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

In [50]:
plt.bar(range(4), var_exp, alpha=0.5, align='center',
            label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
             label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()


Projection Matrix

The “projection matrix” is just a matrix of our concatenated top k eigenvectors. Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our $n×k$-dimensional eigenvector matrix $W$.


In [51]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
                      eig_pairs[1][1].reshape(4,1)))

print('Matrix W:\n', matrix_w)


Matrix W:
 [[ 0.14151503 -0.92968525]
 [-0.61287306 -0.22032736]
 [ 0.52320382 -0.22131442]
 [-0.57499381 -0.19534868]]

In [52]:
X_proj = X_std.dot(matrix_w)

In [53]:
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_proj[:,0], X_proj[:,1], c = df.type, s = 100, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')


Out[53]:
<matplotlib.text.Text at 0x11377cef0>

PCA using sklearn


In [60]:
from sklearn.decomposition import PCA

In [61]:
pca = PCA(n_components=2)

In [62]:
pca.fit(X_std)


Out[62]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [63]:
X_proj_sklearn = pca.transform(X_std)

In [64]:
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_proj_sklearn[:,0], X_proj_sklearn[:,1], c = df.type, 
            s = 100, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')


Out[64]:
<matplotlib.text.Text at 0x113d86a58>

In [65]:
pca.explained_variance_


Out[65]:
array([ 2.29201878,  1.04114465])

From 784 Dimensions to 2 Dimensions


In [66]:
digits = pd.read_csv('data/digits.csv')

In [67]:
digits.head()


Out[67]:
num x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x775 x776 x777 x778 x779 x780 x781 x782 x783 x784
0 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns


In [68]:
digits.shape


Out[68]:
(5234, 785)

In [69]:
digitsX = digits.iloc[:,1:785]

In [70]:
digitsX.head()


Out[70]:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ... x775 x776 x777 x778 x779 x780 x781 x782 x783 x784
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 784 columns


In [71]:
pca = PCA(n_components=2)

In [72]:
pca.fit(digitsX)


Out[72]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [73]:
digits_trans = pca.transform(digitsX)

In [74]:
digits_trans


Out[74]:
array([[ 1068.06173056,   117.11012503],
       [ -547.99896591,   650.78733687],
       [ -823.76820597,   -60.24173439],
       ..., 
       [ -806.20311998,   123.3731127 ],
       [  415.26422894,   325.05441395],
       [ 1054.28476296,   387.85690578]])

In [75]:
plt.scatter(digits_trans[:,0], digits_trans[:,1], c = digits.num, 
            s = 20, alpha = 0.8, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')


Out[75]:
<matplotlib.text.Text at 0x114364748>

Exercise - Run PCA on the Bank Data


In [ ]: