To understand PCA, check out the slides that accompany this code sample at bitly.com/adi-pca-slides.
We'll be using PCA from scikit-learn, a commonly used machine learning library in Python.
You will need to install some dependencies first if you haven't already. On Linux or Mac, you can run
sudo easy_install pip
to get a Python package manager called pip. Then run
pip install -U numpy
pip install -U scipy
pip install -U scikit-learn
to get scikit-learn.
In [1]:
from sklearn.decomposition import PCA
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # 3D Plotting
from scipy import stats
import ipywidgets # interactions
%matplotlib inline
In [2]:
np.random.seed(4) # Reproducible results
mean = np.zeros(2)
cov = [[1, 0.9],
[0.9, 1]]
data = np.random.multivariate_normal(mean, cov, size=500)
xs = data[:, 0]
ys = data[:, 1]
plt.scatter(xs, ys, marker=".", color="blue")
Out[2]:
As you can see, most of the variation is in just one direction. We can formalize this intuition by using PCA to draw out the appropriate axes:
In [3]:
pca = PCA(n_components=2)
pca.fit(data)
axis1 = pca.components_[0] # axis of most variation
axis2 = pca.components_[1] # axis of second-most variation
plt.scatter(xs, ys, marker='.', color="blue")
plt.plot(axis1[0] * np.arange(-4, 5), axis1[1] * np.arange(-4, 5), linewidth=4,
color="red")
plt.plot(axis2[0] * np.arange(-1, 2), axis2[1] * np.arange(-1, 2), linewidth=2,
color="red")
Out[3]:
The two red lines define our new informative axis. As you can see, most of the information is stored in how far along the thicker red line the points are.
PCA doesn't just work for two dimensions. We could do it for large dimensional data. Here's a visualization with three dimensions:
In [4]:
from ipywidgets import interact
mean = np.zeros(3)
cov = [[1, 0, 0],
[0, 1, 0.9],
[0, 0.9, 1]]
# here dimension 1 is independent but dimensions 2 and 3 covary
data = np.random.multivariate_normal(mean, cov, 1000)
xs = data[:, 0]
ys = data[:, 1]
zs = data[:, 2]
figure1 = plt.figure()
ax1 = Axes3D(figure1)
ax1.scatter(xs, ys, zs, marker='.')
plt.close(figure1) # prevent double-display with interact
# You must be running the Jupyter notebook for interactions to work.
# It will just be a static image when viewed on GitHub or Nbviewer
@interact(elev=(0, 180), azim=(0, 180))
def plot_point_cloud(elev, azim):
ax1.view_init(elev=elev, azim=azim)
return figure1
In [5]:
# Apply PCA, pick 2 highest principal components and fit data to them
pca = PCA(n_components=2)
pca.fit(data)
axis1 = pca.components_[0]
axis2 = pca.components_[1]
In [6]:
a, b, c = np.cross(axis1, axis2)
# By definition of cross product, <a, b, c> is orthogonal to the plane
# spanned by axis1 and axis2 through (0, 0, 0). The plane's equation is thus:
# ax + by + cz = 0
# or z = -(ax + by) / c
xx, yy = np.meshgrid(np.arange(-4, 4), np.arange(-4, 4))
zz = -(a * xx + b * yy) / c
figure2 = plt.figure()
ax2 = Axes3D(figure2)
ax2.plot_surface(xx, yy, zz, color="red", alpha=0.5)
ax2.scatter(xs, ys, zs, marker='.')
plt.close(figure2)
@interact(elev=(0, 180), azim=(0, 180))
def plot_point_cloud(elev, azim):
ax2.view_init(elev=elev, azim=azim)
return figure2
We'll be using an old dataset of creepy grayscale faces from AT&T called the Olivetti Faces. Images of human faces are really complex. If we want to do facial recognition, we need to simplify them first. PCA is a great way to capture the distinctive features of a person's face that might help us in classifying them. So in this example, we're going to extract and plot only the principal components from the faces. Be warned, the results look a bit spooky.
In [7]:
from numpy.random import RandomState
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn import decomposition
# Load faces data
faces = fetch_olivetti_faces(shuffle=True, random_state=RandomState(1)).data
n_samples, n_features = faces.shape
# Center the faces
faces_centered = faces - faces.mean(axis=0)
faces_centered -= faces_centered.mean(axis=1) \
.reshape(n_samples, -1)
print("Dataset consists of %d faces" % n_samples)
def plot_gallery(title, images, n_col=3, n_row=2):
"""
Helper function to plot images.
"""
image_shape = (64, 64)
plt.figure(figsize=(2. * n_col, 2.26 * n_row))
plt.suptitle(title, size=16)
for i, comp in enumerate(images):
plt.subplot(n_row, n_col, i + 1)
vmax = max(comp.max(), -comp.min())
plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,
interpolation='nearest',
vmin=-vmax, vmax=vmax)
plt.xticks([])
plt.yticks([])
plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
n_components = 6
# Plot a sample of the input data
plot_gallery("Input", faces_centered[:n_components])
# Apply PCA and plot results
print("Extracting the top %d components" % (n_components))
data = faces_centered
# We use a variant of PCA called Randomized PCA for efficiency. It uses stochastic SVD.
estimator = decomposition.RandomizedPCA(n_components=n_components, whiten=True)
estimator.fit(data)
plot_gallery('PCA', estimator.components_[:n_components])
plt.show()