Udacity Machine Learning mini-project 2

Prep stuff


In [1]:
import sys
from sklearn.svm import SVC
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess

Training and Testing data:


In [2]:
features_train, features_test, labels_train, labels_test = preprocess()


no. of Chris training emails: 7936
no. of Sara training emails: 7884

Fitting the model:


In [3]:
clf = SVC(kernel="linear")
clf.fit(features_train,labels_train)


Out[3]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Model accuracy:


In [4]:
clf.score(features_test,labels_test)


Out[4]:
0.98407281001137659

Timing model training

Not using %timeit because it's really slow to run even once


In [5]:
%time clf.fit(features_train,labels_train)


CPU times: user 2min 3s, sys: 168 ms, total: 2min 4s
Wall time: 2min 4s
Out[5]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

And not surprisingly this is much, much slower than something like naive Bayes.

Accuracy with a reduced training set:


In [6]:
features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]
clf.fit(features_train,labels_train)
clf.score(features_test,labels_test)


Out[6]:
0.88452787258248011

Switching to a radial basis function kernel


In [7]:
clf_rbf = SVC(kernel="rbf")
clf_rbf.fit(features_train,labels_train)


Out[7]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Accuracy


In [8]:
clf_rbf.score(features_test,labels_test)


Out[8]:
0.61604095563139927

Assessing parameter choices

I had some weirdness with the grid search functions, which are probably a better method of doing this in general.


In [9]:
clf10 = SVC(C=10.0,kernel="rbf")
clf10.fit(features_train,labels_train)
clf100 = SVC(C=100.0,kernel="rbf")
clf100.fit(features_train,labels_train)
clf1000 = SVC(C=1000.0,kernel="rbf")
clf1000.fit(features_train,labels_train)
clf10000 = SVC(C=10000.0,kernel="rbf")
clf10000.fit(features_train,labels_train)


Out[9]:
SVC(C=10000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

In [10]:
print "C = 10: ", clf10.score(features_test,labels_test)
print "C = 100: ", clf100.score(features_test,labels_test)
print "C = 1000: ", clf1000.score(features_test,labels_test)
print "C = 10,000: ", clf10000.score(features_test,labels_test)


C = 10:  0.616040955631
C = 100:  0.616040955631
C = 1000:  0.821387940842
C = 10,000:  0.892491467577

Trying C=10,000 with the full training data


In [11]:
features_train, features_test, labels_train, labels_test = preprocess()
clf = SVC(C=10000,kernel="rbf")
clf.fit(features_train,labels_train)


no. of Chris training emails: 7936
no. of Sara training emails: 7884
Out[11]:
SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Accuracy with the full training set:


In [12]:
clf.score(features_test,labels_test)


Out[12]:
0.99089874857792948

Answering questions about specific data points with the RBF kernel:


In [13]:
pred = clf.predict(features_test)

In [14]:
for i in [10,26,50]:
    print 'training point',i,'--predicted:',pred[i],'real value:',labels_test[i]


training point 10 --predicted: 1 real value: 1
training point 26 --predicted: 0 real value: 0
training point 50 --predicted: 1 real value: 1

Proportion of emails attributed to Chris (label = 1)


In [15]:
# Raw count:
chrisCount = sum(pred)
chrisCount


Out[15]:
877

In [17]:
# Proportion:
chrisCount/float(len(pred))


Out[17]:
0.49886234357224118