In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
PCA is a useful preprocessing technique for both visualizing data in 2 or 3 dimensions, and for improving the performance of downstream algorithms such as classifiers. We will see more details about using PCA as part of a machine learning pipeline in the net section, but here we will explore the intuition behind what PCA does, and why it is useful for certain tasks.
The goal of PCA is to find the dimensions of maximum variation in the data, and project onto them. This is helpful for data that is stretched in a particular dimension. Here we show an example in two dimensions, to get an understanding for how PCA can help classification.
In [ ]:
import numpy as np
random_state = np.random.RandomState(1999)
X = np.random.randn(500, 2)
red_idx = np.where(X[:, 0] < 0)[0]
blue_idx = np.where(X[:, 0] >= 0)[0]
# Stretching
s_matrix = np.array([[1, 0],
[0, 20]])
# Rotation
r_angle = 33
r_rad = np.pi * r_angle / 180
r_matrix = np.array([[np.cos(r_rad), -np.sin(r_rad)],
[np.sin(r_rad), np.cos(r_rad)]])
X = np.dot(X, s_matrix).dot(r_matrix)
plt.scatter(X[red_idx, 0], X[red_idx, 1], color="darkred")
plt.scatter(X[blue_idx, 0], X[blue_idx, 1], color="steelblue")
# Fix axes to show mismatched dimensions
plt.axis('off')
plt.title("Skewed Data")
In [ ]:
from sklearn.decomposition import PCA
pca = PCA()
X_t = pca.fit_transform(X)
plt.scatter(X_t[red_idx, 0], X_t[red_idx, 1], color="darkred")
plt.scatter(X_t[blue_idx, 0], X_t[blue_idx, 1], color="steelblue")
plt.axis('off')
plt.title("PCA Corrected Data")
We can also use PCA to visualize complex data in low dimensions in order to see how "close" and "far" different datapoints are in a 2D space. There are many different ways to do this visualization, and some common algorithms are found in sklearn.manifold. PCA is one of the simplest and most common methods for quickly visualizing a dataset.
Now we'll take a look at unsupervised learning on a facial recognition example. This uses a dataset available within scikit-learn consisting of a subset of the Labeled Faces in the Wild data. Note that this is a relatively large download (~200MB) so it may take a while to execute.
In [ ]:
from sklearn import datasets
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4,
data_home='datasets')
lfw_people.data.shape
If you're on a unix-based system such as linux or Mac OSX, these shell commands can be used to see the downloaded dataset:
In [ ]:
!ls datasets
In [ ]:
!du -sh datasets/lfw_home
Let's visualize these faces to see what we're working with:
In [ ]:
fig = plt.figure(figsize=(8, 6))
# plot several images
for i in range(15):
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)
We'll do a typical train-test split on the images before performing unsupervised learning:
In [ ]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lfw_people.data, lfw_people.target, random_state=0)
print(X_train.shape, X_test.shape)
We can use PCA to reduce the original 1850 features of the face images to a manageable
size, while maintaining most of the information in the dataset. Here it is useful to use a variant
of PCA called RandomizedPCA
, which is an approximation of PCA that can be much faster for large
datasets.
In [ ]:
from sklearn import decomposition
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)
One interesting part of PCA is that it computes the "mean" face, which can be interesting to examine:
In [ ]:
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)
The principal components measure deviations about this mean along orthogonal axes. It is also interesting to visualize these principal components:
In [ ]:
print(pca.components_.shape)
In [ ]:
fig = plt.figure(figsize=(16, 6))
for i in range(30):
ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
ax.imshow(pca.components_[i].reshape((50, 37)), cmap=plt.cm.bone)
The components ("eigenfaces") are ordered by their importance from top-left to bottom-right. We see that the first few components seem to primarily take care of lighting conditions; the remaining components pull out certain identifying features: the nose, eyes, eyebrows, etc.
With this projection computed, we can now project our original training and test data onto the PCA basis:
In [ ]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
In [ ]:
print(X_train_pca.shape)
print(X_test_pca.shape)
These projected components correspond to factors in a linear combination of component images such that the combination approaches the original face.
In [ ]: