Classifiers Exercises

Timothy Helton

Labeled Faces in the Wild

These exercises use pictures of famous people collected over the internet. Scikit-Learn Reference



NOTE:
This notebook uses code found in the k2datascience.classification module. To execute all the cells do one of the following items:

  • Install the k2datascience package to the active Python interpreter.
  • Add k2datascience/k2datascience to the PYTHON_PATH system variable.
  • Create a link to the classification.py file in the same directory as this notebook.


Imports


In [ ]:
from k2datascience import classification
from k2datascience import plotting

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Load Data


Exercise 1: Explore Data

1.1: Open dataset and only select those faces for which we have 70 or more images.


In [ ]:
wf = classification.WildFaces(n_faces=70)
wf.data.shape

1.2: Print a few of the faces to familiarized yourself with the data.


In [ ]:
wf.faces_plot()

1.3: Graph the count vs different labels.


In [ ]:
wf.targets_barplot()

1.4: Notice that the number of features in our dataset is fairly large. This is a good moment to apply PCA to reduce the dimensionality of our dataset. Lets choose 150 components.


In [ ]:
wf.calc_pca()
wf.var_pct[wf.var_pct.cumsum() < .99].tail(1).index[0]

1.5: A really cool thing about PCA is that it lets you compute the mean of each entry which we can then use to obtain the 'average' face in our dataset.


In [ ]:
wf.avg_face_plot()

1.6: Plot the components of the PCA. Notice that these are always ordered by importance.


In [ ]:
wf.components_plot()

Exercise 2: Models

2.1a. Logistic Regression


In [ ]:
wf.classify_data(model='LR')
print(wf.score)
print(wf.log_loss)
wf.confusion
print(wf.classification)

2.1b: K-Neighbors Classifier


In [ ]:
wf.accuracy_vs_k()

In [ ]:
wf.classify_data(model='KNN', n=9)
print(wf.score)
print(wf.log_loss)
wf.confusion
print(wf.classification)

2.1c: Linear Discriminant


In [ ]:
wf.classify_data(model='LDA')
print(wf.score)
print(wf.log_loss)
wf.confusion
print(wf.classification)

2.1d: Naive Bayes


In [ ]:
wf.classify_data(model='NB')
print(wf.score)
print(wf.log_loss)
wf.confusion
print(wf.classification)

2.1d: Quadratic Discriminat Analysis


In [ ]:
wf.classify_data(model='QDA')
print(wf.score)
print(wf.log_loss)
wf.confusion
print(wf.classification)

2.2: Which one had the best performance? Which one had the worst performance?

FINDINGS
  • Logistic Regression had the best score of 0.820.
  • Naive Bayes had the worst score of 0.453.

2.3: Any idea why the score on the top two differs so drastically from the last two?

The linear models more accurately describe this dataset.

2.4: Find the log_loss, precision, recall, f_score of the best model.


In [ ]:

2.5: Plot the Confusion Matrix of the best model.


In [ ]:
plotting.confusion_heatmap_plot(wf.confusion, wf.target_names,
                                title='Labeled Faces in the Wild')

2.6 (optional): Edit the code from Step 2 to display not only the image but also the label and color code the label in red if your model got it wrong or black if it got it right.


In [ ]: