In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
Now we'll take a look at unsupervised learning on a facial recognition example. This uses a dataset available within scikit-learn consisting of a subset of the Labeled Faces in the Wild data. Note that this is a relatively large download (~200MB) so it may take a while to execute.
In [4]:
from sklearn import datasets
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4,
data_home='datasets')
lfw_people.data.shape
Out[4]:
Let's visualize these faces to see what we're working with:
In [ ]:
fig = plt.figure(figsize=(8, 6))
# plot several images
for i in range(15):
ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)
We'll do a typical train-test split on the images before performing unsupervised learning:
In [ ]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(lfw_people.data, lfw_people.target, random_state=0)
print(X_train.shape, X_test.shape)
We can use PCA to reduce the original 1850 features of the face images to a manageable
size, while maintaining most of the information in the dataset. Here it is useful to use a variant
of PCA called RandomizedPCA
, which is an approximation of PCA that can be much faster for large
datasets.
In [ ]:
from sklearn import decomposition
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)
One interesting part of PCA is that it computes the "mean" face, which can be interesting to examine:
In [ ]:
plt.imshow(pca.mean_.reshape((50, 37)), cmap=plt.cm.bone)
The principal components measure deviations about this mean along orthogonal axes. It is also interesting to visualize these principal components:
In [ ]:
print(pca.components_.shape)
In [ ]:
fig = plt.figure(figsize=(16, 6))
for i in range(30):
ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])
ax.imshow(pca.components_[i].reshape((50, 37)), cmap=plt.cm.bone)
The components ("eigenfaces") are ordered by their importance from top-left to bottom-right. We see that the first few components seem to primarily take care of lighting conditions; the remaining components pull out certain identifying features: the nose, eyes, eyebrows, etc.
With this projection computed, we can now project our original training and test data onto the PCA basis:
In [ ]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
In [ ]:
print(X_train_pca.shape)
print(X_test_pca.shape)
These projected components correspond to factors in a linear combination of component images such that the combination approaches the original face.
In [ ]: