Ndèye Gagnessiry Ndiaye and Christin Seifert
This work is licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/
This notebook:
In [7]:
import pandas as pd
import numpy as np
import pylab as plt
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica) with 4 attributes: sepal length, sepal width, petal length and petal width.
In [8]:
from sklearn import datasets
iris = datasets.load_iris()
x = pd.DataFrame(iris.data)
x.columns = ['SepalLength','SepalWidth','PetalLength','PetalWidth']
x.head()
Out[8]:
We apply Principal Component Analysis to the Iris dataset with 4-dimensions (all components are keeped).
In [9]:
pca = PCA(n_components=4)
pca.fit(iris.data)
Out[9]:
In [10]:
eigen_values =pca.explained_variance_
print(eigen_values)
In [11]:
eigen_vectors = pca.components_
print(eigen_vectors)
We project data in the PCA 4-dimensionnal space.
In [12]:
projection = pca.transform(iris.data)
x = pd.DataFrame(projection)
x.columns = ['PC1','PC2','PC3','PC4']
x.head()
Out[12]:
The following figure shows successively the projections on (x=PC1,y=PC2), (x=PC1,y=PC3),(x=PC1,y=PC4),(x=PC2,y=PC3),(x=PC2,y=PC4) and (x=PC3,y=PC4). Data is best separated with the components with largest eigenvalues (highest variance).
In [13]:
# Show projections
y = iris.target
target_names = iris.target_names
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
plt.figure(figsize=(25,30))
plt.subplot(231)
for color, i, target_name in zip(colors, [0, 1,2], target_names):
plt.scatter(projection[y == i, 0], projection[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA((x=PC1,y=PC2))')
plt.subplot(232)
for color, i, target_name in zip(colors, [0, 1,2], target_names):
plt.scatter(projection[y == i, 0], projection[y == i, 2], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('PC1')
plt.ylabel('PC3')
plt.title('PCA((x=PC1,y=PC3))')
plt.subplot(233)
for color, i, target_name in zip(colors, [0, 1,2], target_names):
plt.scatter(projection[y == i, 0], projection[y == i, 3], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('PC1')
plt.ylabel('PC4')
plt.title('PCA((x=PC1,y=PC4))')
plt.subplot(234)
for color, i, target_name in zip(colors, [0, 1,2], target_names):
plt.scatter(projection[y == i, 1], projection[y == i, 2], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('PC2')
plt.ylabel('PC3')
plt.title('PCA((x=PC2,y=PC3))')
plt.subplot(235)
for color, i, target_name in zip(colors, [0, 1,2], target_names):
plt.scatter(projection[y == i, 1], projection[y == i, 3], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('PC2')
plt.ylabel('PC4')
plt.title('PCA((x=PC2,y=PC4))')
plt.subplot(236)
for color, i, target_name in zip(colors, [0, 1,2], target_names):
plt.scatter(projection[y == i, 2], projection[y == i, 3], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('PC3')
plt.ylabel('PC4')
plt.title('PCA((x=PC3,y=PC4))')
plt.show()
In [ ]: