In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
We'll use this code to load the data:
In [2]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X, y = faces.data, faces.target
In [3]:
from sklearn.decomposition import PCA
pca = PCA().fit(X)
pca
Out[3]:
In [4]:
pca.n_components_
Out[4]:
In [5]:
plt.axes(xscale='log')
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cubulative variance ratio');
We see that given about 100 components, we'll retain 90% of the variance.
Note that we could also have determined this automatically, using the following:
In [6]:
pca = PCA(n_components=0.90)
pca.fit(X)
pca.n_components_
Out[6]:
The mean of the data (found in the mean_ attribute) and each component of the data (found in the rows of the components_ attribute) can be reshaped and interpreted as an image.
plt.imshowcomponents_ matrixYou'll have to play around with the colormap and grid settings to make this look OK
In [7]:
imshape = faces.images.shape[-2:]
In [8]:
plt.axes(xticks=[], yticks=[])
plt.imshow(pca.mean_.reshape(imshape), cmap='binary_r');
In [9]:
fig, ax = plt.subplots(2, 5, figsize=(14, 6),
subplot_kw=dict(xticks=[], yticks=[]))
for i in range(10):
ax.flat[i].imshow(pca.components_[i].reshape(imshape),
cmap='binary_r')
We see that the main components measure things like how off-center the face is, how much shadow there is, how deep the eye sockets are, etc.
For several faces, plot the true image plus the reconstruction (computed using inverse_transform) for several different values of n_components. (you might even use IPython's interactive functions to make this exploration easier).
Does the 90% variance choice seem to correspond to a good visual representation of each picture?
Note: As you experiment with this, you may want to use RandomizedPCA rather than PCA for this task. RandomizedPCA is an approximate method with the same interface as PCA, but operates much more quickly.
In [10]:
pca = PCA().fit(X)
def plot_face(i=279):
fig, ax = plt.subplots(1, 6, figsize=(14, 3),
subplot_kw=dict(xticks=[], yticks=[]))
ax[0].imshow(X[i].reshape(imshape), cmap='binary_r');
for j, ncomp in enumerate([10, 20, 40, 80, 100]):
approx = pca.mean_ + np.dot(pca.transform(X[i:i + 1])[:, :ncomp],
pca.components_[:ncomp])
ax[j + 1].imshow(approx.reshape(imshape), cmap='binary_r')
ax[j + 1].set_title('{0} components'.format(ncomp))
In [11]:
plot_face(700)
In [12]:
from IPython.html.widgets import interact
interact(plot_face, i=(0, X.shape[0] - 1));