Dimensionality Reduction via Principal Component Analysis (PCA)

analyze data to find patterns
find patterns to reduce dimensions with minimal loss of information

Sections

The step by step approach
Using the PCA() function from the matplotlib.mlab library

The step by step approach



In [228]:

    
%pylab inline

import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.patches import FancyArrowPatch
from mpl_toolkits.mplot3d import proj3d

np.random.seed(234234782384239784)









    



Populating the interactive namespace from numpy and matplotlib

Generating some (3-dimensional) sample data for 2 classes



In [229]:

    
mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20)

mu_vec2 = np.array([1,1,1])
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20)
    
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')
mpl.rcParams['legend.fontsize'] = 10   
ax.plot(class1_sample[:,0], class1_sample[:,1], class1_sample[:,2], 'o', markersize=10, color='blue', alpha=0.5, label='class1')
ax.plot(class2_sample[:,0], class2_sample[:,1], class2_sample[:,2], '^', markersize=10, alpha=0.5, color='red', label='class2')

mpl.rcParams['legend.fontsize'] = 10
plt.title('Samples for class 1 and class 2')
ax.legend(loc = 'upper right')
plt.draw()


plt.show()

Dropping the class labels



In [230]:

    
samples = np.concatenate((class1_sample, class2_sample), axis=0)
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')
ax.plot(samples[:,0], samples[:,1], samples[:,2], 'o', markersize=10, color='green', alpha=0.5)

ax.set_xlabel('x_values')
ax.set_ylabel('y_values')
ax.set_zlabel('z_values')

plt.title('All samples')

plt.draw()
plt.show()

Calculate the Covariance matrix

$ \Sigma_i = \Bigg[ \begin{array}{cc} \sigma_{11}^2 & \sigma_{12}^2 & \sigma_{13}^2\\ \sigma_{21}^2 & \sigma_{22}^2 & \sigma_{23}^2\\ \sigma_{31}^2 & \sigma_{32}^2 & \sigma_{33}^2\\ \end{array} \Bigg]$



In [231]:

    
cov_mat = np.cov([samples[:,0],samples[:,1],samples[:,2]])
print('Covariance Matrix:\n', cov_mat)









    



Covariance Matrix:
 [[ 1.25425468  0.1824987   0.18482315]
 [ 0.1824987   0.97253923  0.07018075]
 [ 0.18482315  0.07018075  0.91375323]]

Calculating Eigenvectors and Eigenvalues of the Covariance matrix



In [232]:

    
eig_val, eig_vec = np.linalg.eig(cov_mat)
for i in range(len(eig_val)):
    eigv = eig_vec[:,i].reshape(1,3).T
    print('Eigenvector {}: \n{}'.format(i+1, eigv))
    print('Eigenvalue {}: {}'.format(i+1, eig_val[i]))
    print(40 * '-')









    



Eigenvector 1: 
[[-0.84190486]
 [-0.39978877]
 [-0.36244329]]
Eigenvalue 1: 1.4204834860846778
----------------------------------------
Eigenvector 2: 
[[-0.44565232]
 [ 0.13637858]
 [ 0.88475697]]
Eigenvalue 2: 0.8314755900749454
----------------------------------------
Eigenvector 3: 
[[ 0.30428639]
 [-0.90640489]
 [ 0.29298458]]
Eigenvalue 3: 0.888588059693973
----------------------------------------

Checking the Eigenvector and Eigenvalue calculation

We expect $ \Sigma\vec{v} = \lambda\vec{v} $
where
$ \Sigma = Covariance \; matrix, \; \vec{v} = \; Eigenvector, \; \lambda = \; Eigenvalue$



In [233]:

    
for i in range(len(eig_val)):
    eigv = eig_vec[:,i].reshape(1,3).T
    print('{}\n = \n{}'.format(cov_mat.dot(eigv), eig_val[i] * eigv))
    np.testing.assert_array_almost_equal(cov_mat.dot(eigv), eig_val[i] * eigv, 
                                         decimal=6, err_msg='', verbose=True)
    print(40 * '-')









    



[[-1.19591196]
 [-0.56789334]
 [-0.51484471]]
 = 
[[-1.19591196]
 [-0.56789334]
 [-0.51484471]]
----------------------------------------
[[-0.37054903]
 [ 0.11339546]
 [ 0.73565382]]
 = 
[[-0.37054903]
 [ 0.11339546]
 [ 0.73565382]]
----------------------------------------
[[ 0.27038526]
 [-0.80542056]
 [ 0.2603426 ]]
 = 
[[ 0.27038526]
 [-0.80542056]
 [ 0.2603426 ]]
----------------------------------------

Visualizing the Eigenvectors

Calculating the mean for every dimension to center the eigenvectors



In [234]:

    
mean_x = np.mean(samples[:,0])
mean_y = np.mean(samples[:,1])
mean_z = np.mean(samples[:,2])

class Arrow3D(FancyArrowPatch):
    def __init__(self, xs, ys, zs, *args, **kwargs):
        FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs)
        self._verts3d = xs, ys, zs

    def draw(self, renderer):
        xs3d, ys3d, zs3d = self._verts3d
        xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M)
        self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
        FancyArrowPatch.draw(self, renderer)

fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')

ax.plot(samples[:,0], samples[:,1], samples[:,2], 'o', markersize=10, color='green', alpha=0.2)
ax.plot([mean_x], [mean_y], [mean_z], 'o', markersize=10, color='red', alpha=0.5)
for v in eig_vec:
    #ax.plot([mean_x, v[0]], [mean_y, v[1]], [mean_z, v[2]], color='red', alpha=0.8, lw=3)
    a = Arrow3D([mean_x, v[0]], [mean_y, v[1]], [mean_z, v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r")
    ax.add_artist(a)
ax.set_xlabel('x_values')
ax.set_ylabel('y_values')
ax.set_zlabel('z_values')

plt.title('Eigenvectors')

plt.draw()
plt.show()

Choosing the components of the new feature vector

eigenvector with the highest eigenvalue is the principal component of the data set
since it is the one that has the most significant relationship with respect to it dimensions
usually one orders the eigenvectors by their eigenvalue from high to low and keep the $k$ vectors with the highest eigenvalues, where $k$ will then represent the dimensions of the transformed data set.
Let's assume, we want to reduce the 3D dataset to a 2D dataset, then we can just keep all vectors but the one with the lowest eigenvalue.



In [235]:

    
import operator

max_index, max_eig_val = max(enumerate(eig_val), key=operator.itemgetter(1))
print('max. eigenvalue:', max_eig_val)
print('max. eigenvector:\n',  eig_vec[:,max_index].reshape(1,3).T)

min_index, min_eig_val = min(enumerate(eig_val), key=operator.itemgetter(1))
print('\nmin. eigenvalue:', min_eig_val)
print('min. eigenvector:\n',  eig_vec[:,min_index].reshape(1,3).T)









    



max. eigenvalue: 1.42048348608
max. eigenvector:
 [[-0.84190486]
 [-0.39978877]
 [-0.36244329]]

min. eigenvalue: 0.831475590075
min. eigenvector:
 [[-0.44565232]
 [ 0.13637858]
 [ 0.88475697]]

Instead of ordering the eigenvectors from highest to lowest eigenvalue and drop the last column, I chose a slightly different approach since I found the implementation easier since we only want to drop 1 eigenvector. If we want to drop more then one eigenvector, we can either repeat this step, or we can order the eigenvectors and keep the $k$ eigenvectors with the highest eigenvalues. Both approaches are equivalent in terms of the PCA result.

Form a feature vector (a matrix of the eigenvectors) for all but the lowest eigenvalue, where each eigenvector shall represent 1 column in the matrix



In [236]:

    
feature_vect = np.zeros(shape=(eig_vec.shape[0], eig_vec.shape[1]-1))

index = 0
for i in range(eig_vec.shape[0]):
    if i != min_index:
        feature_vect[:,index] = eig_vec[:,i]
        index += 1
print('Feature vector matrix:\n', feature_vect)









    



Feature vector matrix:
 [[-0.84190486  0.30428639]
 [-0.39978877 -0.90640489]
 [-0.36244329  0.29298458]]

Transforming the dataset

1) Subtract the mean

the adjusted dataset will have column means 0 afterwards



In [237]:

    
adjusted_samples = np.zeros(shape=(samples.shape[0], samples.shape[1]))
for i in range(samples.shape[1]):
        adjusted_samples[:,i] = samples[:,i] - np.mean(samples[:,i])
print('Adjusted dataset:\n', adjusted_samples)









    



Adjusted dataset:
 [[-1.19886766 -1.73953131 -0.19488401]
 [ 1.45386838  0.73275671  0.32093193]
 [-0.01957561 -0.09765938  0.22728493]
 [-1.99330532 -0.05029589 -0.57520179]
 [-1.68827921 -0.37632299  0.07286815]
 [ 0.15065018 -0.10509763  0.91822244]
 [ 1.08371188 -0.16400839 -0.47114851]
 [-0.74453555 -0.21712991  0.14275885]
 [ 0.25924434  1.10678761 -0.77757694]
 [-0.48335312 -1.14150187 -0.81189972]
 [ 0.78745097 -0.44348085 -0.17903902]
 [-1.06549359 -0.85623312 -0.14001637]
 [-1.88412396 -0.75034527 -1.61353791]
 [ 1.80289763 -0.32872896 -1.34718012]
 [-1.11919027  1.03614966 -0.38072179]
 [-0.06319857 -0.82098599 -0.29419998]
 [-0.77316424 -0.71454299 -0.04668327]
 [-1.23482567 -0.71635901 -0.45113143]
 [-0.14494314  0.86172721 -2.19084525]
 [-0.7331331  -1.2847809   0.45645997]
 [ 1.02247512 -1.1796506  -0.32712619]
 [-0.47221204  1.70749592  1.87889725]
 [ 0.69107197  0.10923993  0.84247461]
 [-1.09217118 -0.25230097 -0.7538699 ]
 [ 1.25963423  1.5842181   1.45858013]
 [ 0.62146264 -0.407502    0.84057653]
 [-0.37287386 -1.4121997   0.6085398 ]
 [ 1.80574638 -2.04777022  0.74619908]
 [ 0.55664668  0.57333998  0.66067394]
 [ 1.40599911  0.96616486  1.06593341]
 [-1.20289685  0.66967245 -1.47352242]
 [-0.43883072  1.61911124 -1.2238501 ]
 [-0.46625215  1.36759124  1.70634845]
 [-1.80232302 -0.63977628  1.19639681]
 [ 0.04416145  0.52093637  0.60521467]
 [ 0.67303936  1.38593914 -1.12187965]
 [ 1.35000147 -0.43736431  1.20218217]
 [ 1.1090882   0.3067161  -0.16098353]
 [ 1.97376686  0.05611339 -1.04407449]
 [ 0.942632    1.57960861  0.62882923]]



In [238]:

    
print(feature_vect.shape)
print(adjusted_samples.shape)









    



(3, 2)
(40, 3)



In [209]:

    
transformed_samples = adjusted_samples.dot(feature_vect)
print('Transformed samples:\n', transformed_samples )









    



Transformed samples:
 [[ 1.89836163  1.92732261]
 [ 1.4352144  -1.39744334]
 [ 0.77864833 -0.33141983]
 [ 0.74272264 -1.36650875]
 [ 1.93239755  0.36948181]
 [-0.29884906  0.19585348]
 [-0.67094083 -0.10847891]
 [ 0.34593766 -1.00750695]
 [ 1.24744019 -0.04827783]
 [ 1.68737995  1.85307189]
 [ 0.08255139 -0.60450288]
 [ 1.87433854  0.25488856]
 [ 1.1225675   0.33459471]
 [ 0.58389391  0.18371489]
 [ 2.1825471  -0.54859686]
 [ 1.05489597 -0.90791139]
 [ 0.71873966  0.90507714]
 [ 1.56824565  0.10443086]
 [ 0.34280323  0.28184558]
 [-0.39431995 -1.32622128]
 [-1.68008715 -0.28825022]
 [ 0.56007219 -1.19383425]
 [ 1.63134481  0.27655449]
 [-2.22478231  1.25086132]
 [-0.03568121 -2.11361756]
 [-3.84054448 -0.14256683]
 [-0.98957227 -0.13792958]
 [-0.8901378   1.63412296]
 [-0.47218875  1.67083039]
 [-0.17106862  2.2154261 ]
 [-2.1375134   0.17642789]
 [-0.86238945  0.51361933]
 [ 0.00986093 -1.08485621]
 [-3.28114387 -0.42998085]
 [-0.57046358  0.27167903]
 [ 0.20780669 -0.72382118]
 [-0.93290163  1.21991756]
 [-0.12725036 -0.41836153]
 [-2.13011074 -0.4031446 ]
 [-0.29782447 -1.05648979]]



In [239]:

    
plt.plot(transformed_samples[:,0], transformed_samples[:,1], 'o', markersize=7, color='green', alpha=0.5)

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,6])
plt.ylim([-4,6])

plt.title('Transformed samples')

plt.draw()
plt.show()



In [240]:

    
plt.plot(transformed_samples[0:20,0], transformed_samples[0:20,1], 'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(transformed_samples[20:40,0], transformed_samples[20:40,1], '^', markersize=7, color='red', alpha=0.5, label='class2')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,6])
plt.ylim([-4,6])
plt.legend()
plt.title('Transformed samples with class labels')

plt.draw()
plt.show()

Using the `PCA()` function from the `matplotlib.mlab` library



In [241]:

    
from matplotlib.mlab import PCA
 
results = PCA(samples) 

#this will return an array of the data projected into PCA space
print(results.Y)









    



[[-1.76464879 -1.13653839  0.05852699]
 [ 1.43905455  0.29026097  0.50516462]
 [ 0.066269   -0.23806384 -0.08537818]
 [-1.53123575  0.41198791 -1.05293767]
 [-1.15593938 -0.3077984  -1.02719971]
 [ 0.56045658 -0.74950164 -0.3173577 ]
 [ 0.27663747  0.20433893  1.05811821]
 [-0.47389475 -0.25446784 -0.48637399]
 [ 0.30711377  1.38755191  0.07657736]
 [-1.37368069 -0.25220715  0.59540703]
 [ 0.11851677 -0.21225391  0.83147638]
 [-1.17361337 -0.52003766 -0.27515812]
 [-2.44649206  0.64476095 -0.13377678]
 [ 0.10518897  0.70864667  2.07263704]
 [-0.31345039  1.06690754 -1.04050878]
 [-0.65373151 -0.39678731  0.47584995]
 [-0.87074027 -0.4868642  -0.18572844]
 [-1.37630964 -0.18654887 -0.2957547 ]
 [-0.8783789   2.24048441  0.6296909 ]
 [-0.86779196 -1.27909097 -0.1582361 ]
 [-0.22943285 -0.65680342  1.39947908]
 [ 1.73463062 -0.08843235 -2.0500832 ]
 [ 0.95138683 -0.54280409 -0.00405028]
 [-1.21415092  0.37774702 -0.2521997 ]
 [ 2.4439545   0.0993283  -0.59113121]
 [ 0.62797432 -0.92559277  0.18154456]
 [-0.63764945 -1.49037223  0.06893655]
 [ 0.37700722 -2.09780437  1.78067087]
 [ 1.02027964 -0.06232996 -0.21243173]
 [ 1.96726016 -0.07766722 -0.01246942]
 [-1.19190493  1.59098606 -0.37607617]
 [-0.08164937  2.10524681 -0.40572706]
 [ 1.45362963 -0.21623914 -1.80514398]
 [-0.71910161 -1.32085344 -1.5597507 ]
 [ 0.65835645 -0.05315859 -0.51250722]
 [ 0.50411734  1.84012533  0.41067921]
 [ 1.24858026 -1.22240105  0.51092084]
 [ 0.72659744  0.32895233  0.70570308]
 [ 0.58989275  0.77214999  1.86235804]
 [ 1.77689231  0.70514369 -0.38375984]]



In [242]:

    
plt.plot(results.Y[20:40,0], results.Y[20:40,1], 'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(results.Y[0:20,0], results.Y[0:20,1], '^', markersize=7, color='red', alpha=0.5, label='class2')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,6])
plt.ylim([-4,6])
plt.legend()
plt.title('Transformed samples with class labels')

plt.draw()
plt.show()



In [ ]: