Machine Learning Breakout: Facial Recognition

This exercise will walk you through the process of using machine learning for facial recognition.


In [1]:
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# use seaborn for better matplotlib styles
import seaborn; seaborn.set(style='white')

1. Fetch & explore the data

The data we'll use is a number of snapshots of the faces of world leaders. We'll fetch the data as follows:


In [2]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
  • Explore this data, which is layed out very similarly to the digits data we saw earlier. How many samples are there? How many features? How many classes, or targets?
  • Use subplots and plt.imshow to plot several of the images. How many pixels are in each image?
  • Use sklearn.metrics.train_test_split to split the data into a training set and a test set.

In [3]:
faces.keys()


Out[3]:
dict_keys(['DESCR', 'target', 'images', 'target_names', 'data'])

In [4]:
n_samples, n_features = faces.data.shape
print(n_samples, n_features)


1288 1850

In [5]:
print(faces.target_names)


['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Tony Blair']

In [6]:
fig, axes = plt.subplots(4, 8, figsize=(12, 9))

for i, ax in enumerate(axes.flat):
    ax.imshow(faces.images[i], cmap='binary_r')
    ax.set_title(faces.target_names[faces.target[i]], fontsize=10)
    ax.set_xticks([]); ax.set_yticks([])


2. Projecting the Data

Lets use some dimensionality reduction routines to try and understand the data. Just a warning: you'll probably find that, unlike in the case of the handwritten digits, the projections will be a bit too jumbled to gain much insight. Still, it's always a useful step in understanding your data!

  • Project the data to two-dimensions with Principal Component Analysis, and scatter-plot the results
  • Project the data to two dimensinos with Isomap and scatter-plot the results

In [7]:
X = faces.data
y = faces.target

In [8]:
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap

X_pca = PCA(n_components=2).fit_transform(X)
X_iso = Isomap(n_components=2).fit_transform(X)

In [9]:
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=faces.target,
            cmap='Blues')
plt.title('PCA projection');



In [10]:
plt.scatter(X_iso[:, 0], X_iso[:, 1], c=faces.target,
            cmap='Blues')
plt.title('Isomap projection');


It's not obvious from these projections that the data can be well-separated; on the other hand, we've reduced our 1200 dimensional data to two!

3: Classification of unknown images

Here we'll perform a classification task on our data. Given a training set, we want to build a classifier that will accurately predict the test set

  • Start by splitting your data into a train and test set (you can use sklearn.cross_validation.train_test_split)
  • We'll use a support vector classifier (sklearn.svm.SVC) to classify the data. Import this and instantiate the estimator.
  • Perform an initial fit on the data, predict the test labels, and use sklearn.metrics.accuracy_score to see how well you're doing.
  • The estimator can be tuned to make the fit better. we'll do this by adjusting the C parameter of SVC. Look at the SVC doc string and try some choices for the kernel, for C and for gamma. What's the best accuracy you can find?
  • For this best estimator, print the sklearn.metrics.classification_report and sklearn.metrics.confusion_matrix, and plot some of the images with the true and predicted label. How well does it do?

In [11]:
# split the data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape, X_test.shape)


(966, 1850) (322, 1850)

In [12]:
# instantiate the estimator
from sklearn.svm import SVC
clf = SVC()

In [13]:
# Do a fit and check accuracy
from sklearn.metrics import accuracy_score

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred)


Out[13]:
0.40993788819875776

In [14]:
# Note that we can also do this:
clf.score(X_test, y_test)


Out[14]:
0.40993788819875776

In [15]:
# Try out various hyper parameters
for kernel in ['linear', 'rbf', 'poly']:
    clf = SVC(kernel=kernel).fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("{0}: accuracy = {1}".format(kernel, score))


linear: accuracy = 0.8260869565217391
rbf: accuracy = 0.40993788819875776
poly: accuracy = 0.8012422360248447

It looks like the linear kernel gives the best results.


In [16]:
best_clf = SVC(kernel='linear').fit(X_train, y_train)
y_pred = best_clf.predict(X_test)

In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=faces.target_names))


                   precision    recall  f1-score   support

     Ariel Sharon       0.76      0.79      0.77        28
     Colin Powell       0.79      0.84      0.82        63
  Donald Rumsfeld       0.65      0.71      0.68        24
    George W Bush       0.91      0.86      0.88       132
Gerhard Schroeder       0.76      0.80      0.78        20
      Hugo Chavez       0.90      0.82      0.86        22
       Tony Blair       0.77      0.82      0.79        33

      avg / total       0.83      0.83      0.83       322


In [18]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)


Out[18]:
array([[ 22,   4,   0,   2,   0,   0,   0],
       [  3,  53,   1,   2,   0,   1,   3],
       [  2,   2,  17,   1,   1,   0,   1],
       [  0,   7,   8, 113,   0,   1,   3],
       [  1,   0,   0,   2,  16,   0,   1],
       [  0,   1,   0,   1,   2,  18,   0],
       [  1,   0,   0,   3,   2,   0,  27]])

In [19]:
shape = faces.images.shape[-2:]
last_names = [label.split()[-1] for label in faces.target_names]

titles = ["True: {0}\nPred: {1}".format(last_names[i_test],
                                        last_names[i_pred])
          for (i_test, i_pred) in zip(y_test, y_pred)]
    
fig, axes = plt.subplots(4, 8, figsize=(12, 9),
                         subplot_kw=dict(xticks=[], yticks=[]))

for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i].reshape(shape), cmap='binary_r')
    ax.set_title(titles[i], fontsize=10)


It still amazes me that with such a simple algorithm, we can get ~80% prediction accuracy on data like this!