Lesson 13 - Cross Validation

Need to have a way to test that the algorithm is doing what it is supposed to do

Why we use training and testing data sets?

  • gives estimate of performance on an independent data set
  • serves as a check for overfitting

Training Testing split

  • sklearn has libraries for doing the split

In [1]:
def submitAcc():
    return clf.score(features_test, labels_test)

#!/usr/bin/python

""" this example borrows heavily from the example
    shown on the sklearn documentation:

    http://scikit-learn.org/stable/modules/cross_validation.html

"""

from sklearn import datasets
from sklearn.svm import SVC

iris = datasets.load_iris()
features = iris.data
labels = iris.target

In [2]:
###############################################################
### YOUR CODE HERE
###############################################################

### import the relevant code and make your train/test split
### name the output datasets features_train, features_test,
### labels_train, and labels_test

### set the random_state to 0 and the test_size to 0.4 so
### we can exactly check your result
from sklearn import cross_validation

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)


###############################################################

clf = SVC(kernel="linear", C=1.)
clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)


0.966666666667

Where to use training and testing data?

K-Fold Cross Validation

problems in splitting the data into training and testing data

  • more training data => best learning results
  • more test data => best validation
  • there is a tradeoff between these two

  • so we do something different. we split the data into k bins of equal size

  • we run the experiment k times
  • we average the results

K-fold in sklearn

  • if we have low accuracy we should consider that maybe majority of one type of events are going into the test set and majority of other type of events go to training set
  • k fold in sklearn does not shuffle your data

There's a simple way to randomize the events in sklearn k-fold CV: set the shuffle flag to true.

Then you'd go from something like this:

cv = KFold( len(authors), 2 )

To something like this:

cv = KFold( len(authors), 2, shuffle=True )

Cross Validation for Parameter Tuning

  • we were tuning the parameters of many algorithms so far by guess and check. It is clunky and does not really give us. Cross validation can help with that

GridSearchCV is a way of systematically working through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. The beauty is that it can work through many combinations in only a couple extra lines of code.

Here's an example from the sklearn documentation:

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} svr = svm.SVC() clf = grid_search.GridSearchCV(svr, parameters) clf.fit(iris.data, iris.target)

Let's break this down line by line.

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

A dictionary of the parameters, and the possible values they may take. In this case, they're playing around with the kernel (possible choices are 'linear' and 'rbf'), and C (possible choices are 1 and 10).

Then a 'grid' of all the following combinations of values for (kernel, C) are automatically generated:

('rbf', 1) ('rbf', 10) ('linear', 1) ('linear', 10)

Each is used to train an SVM, and the performance is then assessed using cross-validation.

svr = svm.SVC() This looks kind of like creating a classifier, just like we've been doing since the first lesson. But note that the "clf" isn't made until the next line--this is just saying what kind of algorithm to use. Another way to think about this is that the "classifier" isn't just the algorithm in this case, it's algorithm plus parameter values. Note that there's no monkeying around with the kernel or C; all that is handled in the next line.

clf = grid_search.GridSearchCV(svr, parameters)

This is where the first bit of magic happens; the classifier is being created. We pass the algorithm (svr) and the dictionary of parameters to try (parameters) and it generates a grid of parameter combinations to try.

clf.fit(iris.data, iris.target)

And the second bit of magic. The fit function now tries all the parameter combinations, and returns a fitted classifier that's automatically tuned to the optimal parameter combination. You can now access the parameter values via clf.bestparams