Machine Learning Breakout: Facial Recognition

This exercise will walk you through the process of using machine learning for facial recognition.



In [ ]:

    
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# use seaborn for better matplotlib styles
import seaborn; seaborn.set(style='white')

1. Fetch & explore the data

The data we'll use is a number of snapshots of the faces of world leaders. We'll fetch the data as follows:



In [ ]:

    
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

Explore this data, which is layed out very similarly to the digits data we saw earlier. How many samples are there? How many features? How many classes, or targets?
Use subplots and plt.imshow to plot several of the images. How many pixels are in each image?
Use sklearn.metrics.train_test_split to split the data into a training set and a test set.

2. Projecting the Data

Lets use some dimensionality reduction routines to try and understand the data. Just a warning: you'll probably find that, unlike in the case of the handwritten digits, the projections will be a bit too jumbled to gain much insight. Still, it's always a useful step in understanding your data!

Project the data to two-dimensions with Principal Component Analysis, and scatter-plot the results
Project the data to two dimensinos with Isomap and scatter-plot the results

3: Classification of unknown images

Here we'll perform a classification task on our data. Given a training set, we want to build a classifier that will accurately predict the test set

Start by splitting your data into a train and test set (you can use sklearn.cross_validation.train_test_split)
We'll use a support vector classifier (sklearn.svm.SVC) to classify the data. Import this and instantiate the estimator.
Perform an initial fit on the data, predict the test labels, and use sklearn.metrics.accuracy_score to see how well you're doing.
The estimator can be tuned to make the fit better. we'll do this by adjusting the C parameter of SVC. Look at the SVC doc string and try some choices for the kernel, for C and for gamma. What's the best accuracy you can find?
For this best estimator, print the sklearn.metrics.classification_report and sklearn.metrics.confusion_matrix, and plot some of the images with the true and predicted label. How well does it do?