Breakout: EigenFaces

In this breakout, we'll be using Principal Component Analysis to explore how it interacts with the faces dataset that we saw earlier.



In [1]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
import seaborn as sns; sns.set()

We'll use this code to load the data:



In [2]:

    
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

X, y = faces.data, faces.target

1. Compute a PCA of the data

Compute a Principal Component Analysis of the data, using all components
Plot the cumulative explained variance ratio. How many components do we need to recover 90% of the variance?



In [3]:

    
from sklearn.decomposition import PCA
pca = PCA().fit(X)
pca









    Out[3]:





PCA(copy=True, n_components=None, whiten=False)



In [4]:

    
pca.n_components_









    Out[4]:





1850



In [5]:

    
plt.axes(xscale='log')
plt.plot(pca.explained_variance_ratio_.cumsum())
plt.xlabel('number of components')
plt.ylabel('cubulative variance ratio');

We see that given about 100 components, we'll retain 90% of the variance.

Note that we could also have determined this automatically, using the following:



In [6]:

    
pca = PCA(n_components=0.90)
pca.fit(X)
pca.n_components_









    Out[6]:





105

2. Plot the "eigenfaces"

The mean of the data (found in the mean_ attribute) and each component of the data (found in the rows of the components_ attribute) can be reshaped and interpreted as an image.

Display the mean face using plt.imshow
Display the first few "eigenfaces" (given by the rows of the components_ matrix

You'll have to play around with the colormap and grid settings to make this look OK



In [7]:

    
imshape = faces.images.shape[-2:]



In [8]:

    
plt.axes(xticks=[], yticks=[])
plt.imshow(pca.mean_.reshape(imshape), cmap='binary_r');



In [9]:

    
fig, ax = plt.subplots(2, 5, figsize=(14, 6),
                       subplot_kw=dict(xticks=[], yticks=[]))
for i in range(10):
    ax.flat[i].imshow(pca.components_[i].reshape(imshape),
                      cmap='binary_r')

We see that the main components measure things like how off-center the face is, how much shadow there is, how deep the eye sockets are, etc.

3. Plot the reconstructed faces

For several faces, plot the true image plus the reconstruction (computed using inverse_transform) for several different values of n_components. (you might even use IPython's interactive functions to make this exploration easier).

Does the 90% variance choice seem to correspond to a good visual representation of each picture?

Note: As you experiment with this, you may want to use RandomizedPCA rather than PCA for this task. RandomizedPCA is an approximate method with the same interface as PCA, but operates much more quickly.



In [10]:

    
pca = PCA().fit(X)

def plot_face(i=279):
    fig, ax = plt.subplots(1, 6, figsize=(14, 3),
                           subplot_kw=dict(xticks=[], yticks=[]))
    ax[0].imshow(X[i].reshape(imshape), cmap='binary_r');
        
    for j, ncomp in enumerate([10, 20, 40, 80, 100]):
        approx = pca.mean_ + np.dot(pca.transform(X[i:i + 1])[:, :ncomp],
                                    pca.components_[:ncomp])
        ax[j + 1].imshow(approx.reshape(imshape), cmap='binary_r')
        ax[j + 1].set_title('{0} components'.format(ncomp))



In [11]:

    
plot_face(700)



In [12]:

    
from IPython.html.widgets import interact
interact(plot_face, i=(0, X.shape[0] - 1));