Breakout: Model Validation

Here we'll practice the process of model validation, and evaluating how we can improve our model. We'll return to the Labeled Faces in the Wild dataset that we saw previously, and use the cross-validation techniques we covered to find the best possible model.



In [1]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
# If this causes an error, you can comment it out.
import seaborn as sns
sns.set()



In [2]:

    
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

X, y = faces.data, faces.target

1. Validation with Random Forests

Use a RandomForestClassifier with the default parameters, and use 10-fold cross-validation to determine the optimal accuracy.
Construct validation curves for the random forest classifier on this data, exploring the effect of max_depth on the result.
What is the best value for max_depth (approximately)? What is the best score for this estimator?
Construct a learning curve for the Random Forest Classifier using this value for max_depth.
Given the validation and learning curves, how do you think you could improve this classifier – should you seek a better model/more features, or should you seek more data?

2. Validation with Support Vector Machines

The Support Vector Classifier is often a much more powerful model than random forests, especially for smaller datasets.

Here we'll repeat the above exercise, but use sklearn.svm.SVC instead. The support vector classifier that we'll use below does not scale well with data dimension. For this reason, we'll start by doing a dimensionality reduction of the data.

Use the SVC with the default parameters, and use 3-fold cross-validation to determine the optimal accuracy.

You'll notice that this computation takes a relatively long time in comparison to the Random Forest Classifier. This is because the data has a very high dimension, and SVC does not scale well with data dimension. In order to make the remaining tasks computationally viable, we'll reduce the dimension of the data

Use the PCA estimator to project the data down to 100 dimensions
re-compute the SVC cross-validation on this result. Is the score similar?

Now we'll carry-on with the learning/validation curves using this projected data.

Construct validation curves for SVC on this (projected) data, using a linear kernel (kernel='linear') and exploring the effect of C on the result. Note that the effect of C only changes on very large scales: you should try logarithmically-spaced values between, say $10^{-10}$ and $10^-1$.
What is the optimal value for C? What is the best score for this estimator? What is the score for this value if you use the entire dataset? Is this much different than for the projected data?
Construct a learning curve for the Support Vector Machine using this value for max_depth.
Given the validation and learning curves, how do you think you could improve this classifier – should you seek a better model/more features, or should you seek more data?
Overall, how does this compare to the Random Forest Classifier?