Dimensionality Reduction via Principal Component Analysis (PCA)

  • analyze data to find patterns
  • find patterns to reduce dimensions with minimal loss of information

Sections

  • The step by step approach
  • Using the PCA() function from the matplotlib.mlab library

The step by step approach


In [228]:
%pylab inline

import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.patches import FancyArrowPatch
from mpl_toolkits.mplot3d import proj3d

np.random.seed(234234782384239784)


Populating the interactive namespace from numpy and matplotlib

Generating some (3-dimensional) sample data for 2 classes


In [229]:
mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20)

mu_vec2 = np.array([1,1,1])
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20)
    
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')
mpl.rcParams['legend.fontsize'] = 10   
ax.plot(class1_sample[:,0], class1_sample[:,1], class1_sample[:,2], 'o', markersize=10, color='blue', alpha=0.5, label='class1')
ax.plot(class2_sample[:,0], class2_sample[:,1], class2_sample[:,2], '^', markersize=10, alpha=0.5, color='red', label='class2')

mpl.rcParams['legend.fontsize'] = 10
plt.title('Samples for class 1 and class 2')
ax.legend(loc = 'upper right')
plt.draw()


plt.show()


Dropping the class labels


In [230]:
samples = np.concatenate((class1_sample, class2_sample), axis=0)
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')
ax.plot(samples[:,0], samples[:,1], samples[:,2], 'o', markersize=10, color='green', alpha=0.5)

ax.set_xlabel('x_values')
ax.set_ylabel('y_values')
ax.set_zlabel('z_values')

plt.title('All samples')

plt.draw()
plt.show()


Calculate the Covariance matrix

$ \Sigma_i = \Bigg[ \begin{array}{cc} \sigma_{11}^2 & \sigma_{12}^2 & \sigma_{13}^2\\ \sigma_{21}^2 & \sigma_{22}^2 & \sigma_{23}^2\\ \sigma_{31}^2 & \sigma_{32}^2 & \sigma_{33}^2\\ \end{array} \Bigg]$


In [231]:
cov_mat = np.cov([samples[:,0],samples[:,1],samples[:,2]])
print('Covariance Matrix:\n', cov_mat)


Covariance Matrix:
 [[ 1.25425468  0.1824987   0.18482315]
 [ 0.1824987   0.97253923  0.07018075]
 [ 0.18482315  0.07018075  0.91375323]]

Calculating Eigenvectors and Eigenvalues of the Covariance matrix


In [232]:
eig_val, eig_vec = np.linalg.eig(cov_mat)
for i in range(len(eig_val)):
    eigv = eig_vec[:,i].reshape(1,3).T
    print('Eigenvector {}: \n{}'.format(i+1, eigv))
    print('Eigenvalue {}: {}'.format(i+1, eig_val[i]))
    print(40 * '-')


Eigenvector 1: 
[[-0.84190486]
 [-0.39978877]
 [-0.36244329]]
Eigenvalue 1: 1.4204834860846778
----------------------------------------
Eigenvector 2: 
[[-0.44565232]
 [ 0.13637858]
 [ 0.88475697]]
Eigenvalue 2: 0.8314755900749454
----------------------------------------
Eigenvector 3: 
[[ 0.30428639]
 [-0.90640489]
 [ 0.29298458]]
Eigenvalue 3: 0.888588059693973
----------------------------------------

Checking the Eigenvector and Eigenvalue calculation

We expect $ \Sigma\vec{v} = \lambda\vec{v} $
where
$ \Sigma = Covariance \; matrix, \; \vec{v} = \; Eigenvector, \; \lambda = \; Eigenvalue$


In [233]:
for i in range(len(eig_val)):
    eigv = eig_vec[:,i].reshape(1,3).T
    print('{}\n = \n{}'.format(cov_mat.dot(eigv), eig_val[i] * eigv))
    np.testing.assert_array_almost_equal(cov_mat.dot(eigv), eig_val[i] * eigv, 
                                         decimal=6, err_msg='', verbose=True)
    print(40 * '-')


[[-1.19591196]
 [-0.56789334]
 [-0.51484471]]
 = 
[[-1.19591196]
 [-0.56789334]
 [-0.51484471]]
----------------------------------------
[[-0.37054903]
 [ 0.11339546]
 [ 0.73565382]]
 = 
[[-0.37054903]
 [ 0.11339546]
 [ 0.73565382]]
----------------------------------------
[[ 0.27038526]
 [-0.80542056]
 [ 0.2603426 ]]
 = 
[[ 0.27038526]
 [-0.80542056]
 [ 0.2603426 ]]
----------------------------------------

Visualizing the Eigenvectors

Calculating the mean for every dimension to center the eigenvectors


In [234]:
mean_x = np.mean(samples[:,0])
mean_y = np.mean(samples[:,1])
mean_z = np.mean(samples[:,2])

class Arrow3D(FancyArrowPatch):
    def __init__(self, xs, ys, zs, *args, **kwargs):
        FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs)
        self._verts3d = xs, ys, zs

    def draw(self, renderer):
        xs3d, ys3d, zs3d = self._verts3d
        xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M)
        self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
        FancyArrowPatch.draw(self, renderer)

fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')

ax.plot(samples[:,0], samples[:,1], samples[:,2], 'o', markersize=10, color='green', alpha=0.2)
ax.plot([mean_x], [mean_y], [mean_z], 'o', markersize=10, color='red', alpha=0.5)
for v in eig_vec:
    #ax.plot([mean_x, v[0]], [mean_y, v[1]], [mean_z, v[2]], color='red', alpha=0.8, lw=3)
    a = Arrow3D([mean_x, v[0]], [mean_y, v[1]], [mean_z, v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r")
    ax.add_artist(a)
ax.set_xlabel('x_values')
ax.set_ylabel('y_values')
ax.set_zlabel('z_values')

plt.title('Eigenvectors')

plt.draw()
plt.show()


Choosing the components of the new feature vector

  • eigenvector with the highest eigenvalue is the principal component of the data set
  • since it is the one that has the most significant relationship with respect to it dimensions
  • usually one orders the eigenvectors by their eigenvalue from high to low and keep the $k$ vectors with the highest eigenvalues, where $k$ will then represent the dimensions of the transformed data set.
  • Let's assume, we want to reduce the 3D dataset to a 2D dataset, then we can just keep all vectors but the one with the lowest eigenvalue.

In [235]:
import operator

max_index, max_eig_val = max(enumerate(eig_val), key=operator.itemgetter(1))
print('max. eigenvalue:', max_eig_val)
print('max. eigenvector:\n',  eig_vec[:,max_index].reshape(1,3).T)

min_index, min_eig_val = min(enumerate(eig_val), key=operator.itemgetter(1))
print('\nmin. eigenvalue:', min_eig_val)
print('min. eigenvector:\n',  eig_vec[:,min_index].reshape(1,3).T)


max. eigenvalue: 1.42048348608
max. eigenvector:
 [[-0.84190486]
 [-0.39978877]
 [-0.36244329]]

min. eigenvalue: 0.831475590075
min. eigenvector:
 [[-0.44565232]
 [ 0.13637858]
 [ 0.88475697]]

Instead of ordering the eigenvectors from highest to lowest eigenvalue and drop the last column, I chose a slightly different approach since I found the implementation easier since we only want to drop 1 eigenvector. If we want to drop more then one eigenvector, we can either repeat this step, or we can order the eigenvectors and keep the $k$ eigenvectors with the highest eigenvalues. Both approaches are equivalent in terms of the PCA result.

Form a feature vector (a matrix of the eigenvectors) for all but the lowest eigenvalue, where each eigenvector shall represent 1 column in the matrix


In [236]:
feature_vect = np.zeros(shape=(eig_vec.shape[0], eig_vec.shape[1]-1))

index = 0
for i in range(eig_vec.shape[0]):
    if i != min_index:
        feature_vect[:,index] = eig_vec[:,i]
        index += 1
print('Feature vector matrix:\n', feature_vect)


Feature vector matrix:
 [[-0.84190486  0.30428639]
 [-0.39978877 -0.90640489]
 [-0.36244329  0.29298458]]

Transforming the dataset

1) Subtract the mean

  • the adjusted dataset will have column means 0 afterwards

In [237]:
adjusted_samples = np.zeros(shape=(samples.shape[0], samples.shape[1]))
for i in range(samples.shape[1]):
        adjusted_samples[:,i] = samples[:,i] - np.mean(samples[:,i])
print('Adjusted dataset:\n', adjusted_samples)


Adjusted dataset:
 [[-1.19886766 -1.73953131 -0.19488401]
 [ 1.45386838  0.73275671  0.32093193]
 [-0.01957561 -0.09765938  0.22728493]
 [-1.99330532 -0.05029589 -0.57520179]
 [-1.68827921 -0.37632299  0.07286815]
 [ 0.15065018 -0.10509763  0.91822244]
 [ 1.08371188 -0.16400839 -0.47114851]
 [-0.74453555 -0.21712991  0.14275885]
 [ 0.25924434  1.10678761 -0.77757694]
 [-0.48335312 -1.14150187 -0.81189972]
 [ 0.78745097 -0.44348085 -0.17903902]
 [-1.06549359 -0.85623312 -0.14001637]
 [-1.88412396 -0.75034527 -1.61353791]
 [ 1.80289763 -0.32872896 -1.34718012]
 [-1.11919027  1.03614966 -0.38072179]
 [-0.06319857 -0.82098599 -0.29419998]
 [-0.77316424 -0.71454299 -0.04668327]
 [-1.23482567 -0.71635901 -0.45113143]
 [-0.14494314  0.86172721 -2.19084525]
 [-0.7331331  -1.2847809   0.45645997]
 [ 1.02247512 -1.1796506  -0.32712619]
 [-0.47221204  1.70749592  1.87889725]
 [ 0.69107197  0.10923993  0.84247461]
 [-1.09217118 -0.25230097 -0.7538699 ]
 [ 1.25963423  1.5842181   1.45858013]
 [ 0.62146264 -0.407502    0.84057653]
 [-0.37287386 -1.4121997   0.6085398 ]
 [ 1.80574638 -2.04777022  0.74619908]
 [ 0.55664668  0.57333998  0.66067394]
 [ 1.40599911  0.96616486  1.06593341]
 [-1.20289685  0.66967245 -1.47352242]
 [-0.43883072  1.61911124 -1.2238501 ]
 [-0.46625215  1.36759124  1.70634845]
 [-1.80232302 -0.63977628  1.19639681]
 [ 0.04416145  0.52093637  0.60521467]
 [ 0.67303936  1.38593914 -1.12187965]
 [ 1.35000147 -0.43736431  1.20218217]
 [ 1.1090882   0.3067161  -0.16098353]
 [ 1.97376686  0.05611339 -1.04407449]
 [ 0.942632    1.57960861  0.62882923]]

In [238]:
print(feature_vect.shape)
print(adjusted_samples.shape)


(3, 2)
(40, 3)

In [209]:
transformed_samples = adjusted_samples.dot(feature_vect)
print('Transformed samples:\n', transformed_samples )


Transformed samples:
 [[ 1.89836163  1.92732261]
 [ 1.4352144  -1.39744334]
 [ 0.77864833 -0.33141983]
 [ 0.74272264 -1.36650875]
 [ 1.93239755  0.36948181]
 [-0.29884906  0.19585348]
 [-0.67094083 -0.10847891]
 [ 0.34593766 -1.00750695]
 [ 1.24744019 -0.04827783]
 [ 1.68737995  1.85307189]
 [ 0.08255139 -0.60450288]
 [ 1.87433854  0.25488856]
 [ 1.1225675   0.33459471]
 [ 0.58389391  0.18371489]
 [ 2.1825471  -0.54859686]
 [ 1.05489597 -0.90791139]
 [ 0.71873966  0.90507714]
 [ 1.56824565  0.10443086]
 [ 0.34280323  0.28184558]
 [-0.39431995 -1.32622128]
 [-1.68008715 -0.28825022]
 [ 0.56007219 -1.19383425]
 [ 1.63134481  0.27655449]
 [-2.22478231  1.25086132]
 [-0.03568121 -2.11361756]
 [-3.84054448 -0.14256683]
 [-0.98957227 -0.13792958]
 [-0.8901378   1.63412296]
 [-0.47218875  1.67083039]
 [-0.17106862  2.2154261 ]
 [-2.1375134   0.17642789]
 [-0.86238945  0.51361933]
 [ 0.00986093 -1.08485621]
 [-3.28114387 -0.42998085]
 [-0.57046358  0.27167903]
 [ 0.20780669 -0.72382118]
 [-0.93290163  1.21991756]
 [-0.12725036 -0.41836153]
 [-2.13011074 -0.4031446 ]
 [-0.29782447 -1.05648979]]

In [239]:
plt.plot(transformed_samples[:,0], transformed_samples[:,1], 'o', markersize=7, color='green', alpha=0.5)

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,6])
plt.ylim([-4,6])

plt.title('Transformed samples')

plt.draw()
plt.show()



In [240]:
plt.plot(transformed_samples[0:20,0], transformed_samples[0:20,1], 'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(transformed_samples[20:40,0], transformed_samples[20:40,1], '^', markersize=7, color='red', alpha=0.5, label='class2')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,6])
plt.ylim([-4,6])
plt.legend()
plt.title('Transformed samples with class labels')

plt.draw()
plt.show()


Using the PCA() function from the matplotlib.mlab library


In [241]:
from matplotlib.mlab import PCA
 
results = PCA(samples) 

#this will return an array of the data projected into PCA space
print(results.Y)


[[-1.76464879 -1.13653839  0.05852699]
 [ 1.43905455  0.29026097  0.50516462]
 [ 0.066269   -0.23806384 -0.08537818]
 [-1.53123575  0.41198791 -1.05293767]
 [-1.15593938 -0.3077984  -1.02719971]
 [ 0.56045658 -0.74950164 -0.3173577 ]
 [ 0.27663747  0.20433893  1.05811821]
 [-0.47389475 -0.25446784 -0.48637399]
 [ 0.30711377  1.38755191  0.07657736]
 [-1.37368069 -0.25220715  0.59540703]
 [ 0.11851677 -0.21225391  0.83147638]
 [-1.17361337 -0.52003766 -0.27515812]
 [-2.44649206  0.64476095 -0.13377678]
 [ 0.10518897  0.70864667  2.07263704]
 [-0.31345039  1.06690754 -1.04050878]
 [-0.65373151 -0.39678731  0.47584995]
 [-0.87074027 -0.4868642  -0.18572844]
 [-1.37630964 -0.18654887 -0.2957547 ]
 [-0.8783789   2.24048441  0.6296909 ]
 [-0.86779196 -1.27909097 -0.1582361 ]
 [-0.22943285 -0.65680342  1.39947908]
 [ 1.73463062 -0.08843235 -2.0500832 ]
 [ 0.95138683 -0.54280409 -0.00405028]
 [-1.21415092  0.37774702 -0.2521997 ]
 [ 2.4439545   0.0993283  -0.59113121]
 [ 0.62797432 -0.92559277  0.18154456]
 [-0.63764945 -1.49037223  0.06893655]
 [ 0.37700722 -2.09780437  1.78067087]
 [ 1.02027964 -0.06232996 -0.21243173]
 [ 1.96726016 -0.07766722 -0.01246942]
 [-1.19190493  1.59098606 -0.37607617]
 [-0.08164937  2.10524681 -0.40572706]
 [ 1.45362963 -0.21623914 -1.80514398]
 [-0.71910161 -1.32085344 -1.5597507 ]
 [ 0.65835645 -0.05315859 -0.51250722]
 [ 0.50411734  1.84012533  0.41067921]
 [ 1.24858026 -1.22240105  0.51092084]
 [ 0.72659744  0.32895233  0.70570308]
 [ 0.58989275  0.77214999  1.86235804]
 [ 1.77689231  0.70514369 -0.38375984]]

In [242]:
plt.plot(results.Y[20:40,0], results.Y[20:40,1], 'o', markersize=7, color='blue', alpha=0.5, label='class1')
plt.plot(results.Y[0:20,0], results.Y[0:20,1], '^', markersize=7, color='red', alpha=0.5, label='class2')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.xlim([-4,6])
plt.ylim([-4,6])
plt.legend()
plt.title('Transformed samples with class labels')

plt.draw()
plt.show()



In [ ]: