Lesson 2 - Support Vector Machines

This is the code to accompany the Lesson 2 (SVM) mini-project.

Use a SVM to identify emails from the Enron corpus by their authors:

  1. Sara has label 0
  2. Chris has label 1

In [2]:
%pylab inline

In [1]:
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
from prep_terrain_data import makeTerrainData
from class_vis import prettyPicture, output_image


import copy


/Users/omojumiller/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

features_train and features_test are the features for the training and testing datasets respectively

labels_train and labels_test are the corresponding item labels


In [3]:
features_train, features_test, labels_train, labels_test = preprocess()
#features_train, labels_train, features_test, labels_test = makeTerrainData()


no. of Chris training emails: 7936
no. of Sara training emails: 7884

In [12]:
from sklearn import svm
clf = svm.SVC(kernel="rbf", gamma=1.0, C = 2)

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"


training time: 208.973 s

Now use a linear classifier.


In [14]:
from sklearn import svm
clf = svm.SVC(kernel="linear")


t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"


training time: 0.11 s

Speeding up the algorithm

One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100]

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. What’s the accuracy now?


In [22]:
from sklearn import svm
clf = svm.SVC(kernel="linear")

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100]

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"


training time: 0.107 s

In [23]:
t0 = time()
pred = clf.predict(features_test)
print "testing time:", round(time()-t0, 3), "s"


testing time: 1.107 s

In [24]:
from sklearn.metrics import accuracy_score


def submitAccuracy():
    return accuracy_score(pred, labels_test)

In [25]:
submitAccuracy()


Out[25]:
0.88452787258248011

Use different kernel

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. What’s the accuracy now, with this more complex kernel?


In [27]:
clf = svm.SVC(kernel="rbf")

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t0 = time()
pred = clf.predict(features_test)
print "testing time:", round(time()-t0, 3), "s"

submitAccuracy()


training time: 0.115 s
testing time: 1.312 s
Out[27]:
0.61604095563139927

Tune parameters for better accuracy

Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). Which one gives the best accuracy?


In [32]:
clf = svm.SVC(kernel="rbf", C = 10000)

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t0 = time()
pred = clf.predict(features_test)
print "testing time:", round(time()-t0, 3), "s"

submitAccuracy()


training time: 0.112 s
testing time: 0.973 s
Out[32]:
0.89249146757679176

Optimized SVM

With C = 1000 I got an accuracy of 0.821. With C = 10k, accuracy went up to 0.89

Now that you’ve optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. What is the accuracy of the optimized SVM?


In [35]:
features_train, features_test, labels_train, labels_test = preprocess() # full training set

clf2 = svm.SVC(kernel="rbf", C = 10000)

t0 = time()
clf2.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t0 = time()
pred = clf2.predict(features_test)
print "testing time:", round(time()-t0, 3), "s"

submitAccuracy()


no. of Chris training emails: 7936
no. of Sara training emails: 7884
training time: 120.323 s
testing time: 11.709 s
Out[35]:
0.99089874857792948

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th?

(Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]


In [34]:
from sklearn import svm
clf = svm.SVC(kernel="linear")

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100]

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t0 = time()
pred = clf.predict(features_test)
print "testing time:", round(time()-t0, 3), "s"

print "class for element 10 is ", pred[10]
print "class for element 26 is ", pred[26]
print "class for element 50 is ", pred[50]


training time: 0.102 s
testing time: 1.21 s
class for element 10 is  1
class for element 26 is  0
class for element 50 is  1

There are over 1700 test events--how many are predicted to be in the “Chris” (1) class?

(Use the RBF kernel, C=10000., and the full training set.)


In [37]:
pred = clf2.predict(features_test)
len(pred)


The number of emails that are predicted to be authored by “Chris” (1) class is:  0

In [43]:
chrisPred = 0
for item in range(len(pred)):
    if pred[item] == 1:
        chrisPred += 1
    
print "The number of emails that are predicted to be authored by “Chris” (1) class is: ", chrisPred


The number of emails that are predicted to be authored by “Chris” (1) class is:  877

Hopefully it’s becoming clearer what Sebastian meant when he said Naive Bayes is great for text--it’s faster and generally gives better performance than an SVM for this particular problem. Of course, there are plenty of other problems where an SVM might work better. Knowing which one to try when you’re tackling a problem for the first time is part of the art and science of machine learning. In addition to picking your algorithm, depending on which one you try, there are parameter tunes to worry about as well, and the possibility of overfitting (especially if you don’t have lots of training data).

Our general suggestion is to try a few different algorithms for each problem. Tuning the parameters can be a lot of work, but just sit tight for now--toward the end of the class we will introduce you to GridCV, a great sklearn tool that can find an optimal parameter tune almost automatically.


In [44]:
### draw the decision boundary with the text points overlaid
# This only works for driving dataset. Skip it for the email data set.
# prettyPicture(clf, features_test, labels_test)

GridSearchCV in sklearn

GridSearchCV is a way of systematically working through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. The beauty is that it can work through many combinations in only a couple extra lines of code.

Here's an example from the sklearn documentation:

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(iris.data, iris.target)

Let's break this down line by line.

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

A dictionary of the parameters, and the possible values they may take. In this case, they're playing around with the kernel (possible choices are 'linear' and 'rbf'), and C (possible choices are 1 and 10).

Then all the following combinations of values for (kernel, C) are automatically generated: [('rbf', 1), ('rbf', 10), ('linear', 1), ('linear', 10)]. Each is used to train an SVM, and the performance is then assessed using cross-validation.

svr = svm.SVC()

This looks kind of like creating a classifier, just like we've been doing since the first lesson. But note that the "clf" isn't made until the next line--this is just saying what kind of algorithm to use. Another way to think about this is that the "classifier" isn't just the algorithm in this case, it's algorithm plus parameter values. Note that there's no monkeying around with the kernel or C; all that is handled in the next line.

clf = grid_search.GridSearchCV(svr, parameters)

This is where the first bit of magic happens; the classifier is being created. We pass the algorithm (svr) and the dictionary of parameters to try (parameters) and it generates a grid of parameter combinations to try.

clf.fit(iris.data, iris.target)

And the second bit of magic. The fit function now tries all the parameter combinations, and returns a fitted classifier that's automatically tuned to the optimal parameter combination. You can now access the parameter values via clf.bestparams.


In [ ]:
from sklearn.grid_search import GridSearchCV
from sklearn import svm


#features_train = features_train[:len(features_train)/100] 
#labels_train = labels_train[:len(labels_train)/100]

features_train, features_test, labels_train, labels_test = preprocess() # full training set

parameters = {'kernel':('linear', 'rbf'), 'C':[10, 100, 1000, 10000]}
svr = svm.SVC()
clf = GridSearchCV(svr, parameters)


t0 = time()
clf.fit(features_train, labels_train)
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_)


no. of Chris training emails: 7936
no. of Sara training emails: 7884

In [ ]:


In [ ]:
len(features_train)

In [ ]:


In [ ]:


In [1]:
from sklearn import datasets
from sklearn.svm import SVC

iris = datasets.load_iris()
features = iris.data
labels = iris.target

In [ ]:


In [2]:
###############################################################
### YOUR CODE HERE
###############################################################

### import the relevant code and make your train/test split
### name the output datasets features_train, features_test,
### labels_train, and labels_test

### set the random_state to 0 and the test_size to 0.4 so
### we can exactly check your result

from sklearn import cross_validation

iris.data.shape, iris.target.shape

### We can now quickly sample a training set while holding out 40% of the data for testing 
### (evaluating) our classifier:
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(iris.data, 
                                iris.target, test_size=0.4, random_state=0)

In [3]:
features_train.shape, labels_train.shape


Out[3]:
((90, 4), (90,))

In [4]:
features_test.shape, labels_test.shape


Out[4]:
((60, 4), (60,))

In [5]:
###############################################################

clf = SVC(kernel="linear", C=1.)
clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)


##############################################################
def submitAcc():
    return clf.score(features_test, labels_test)


0.966666666667

In [6]:
from sklearn.cross_validation import KFold

t0 = time()

In [ ]: