Udacity Machine Learning mini-project 2

Prep stuff



In [1]:

    
import sys
from sklearn.svm import SVC
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess

Training and Testing data:



In [2]:

    
features_train, features_test, labels_train, labels_test = preprocess()









    



no. of Chris training emails: 7936
no. of Sara training emails: 7884

Fitting the model:



In [3]:

    
clf = SVC(kernel="linear")
clf.fit(features_train,labels_train)









    Out[3]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Model accuracy:



In [4]:

    
clf.score(features_test,labels_test)









    Out[4]:





0.98407281001137659

Timing model training

Not using %timeit because it's really slow to run even once



In [5]:

    
%time clf.fit(features_train,labels_train)









    



CPU times: user 2min 3s, sys: 168 ms, total: 2min 4s
Wall time: 2min 4s






    Out[5]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

And not surprisingly this is much, much slower than something like naive Bayes.

Accuracy with a reduced training set:



In [6]:

    
features_train = features_train[:len(features_train)/100]
labels_train = labels_train[:len(labels_train)/100]
clf.fit(features_train,labels_train)
clf.score(features_test,labels_test)









    Out[6]:





0.88452787258248011

Switching to a radial basis function kernel



In [7]:

    
clf_rbf = SVC(kernel="rbf")
clf_rbf.fit(features_train,labels_train)









    Out[7]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

Accuracy



In [8]:

    
clf_rbf.score(features_test,labels_test)









    Out[8]:





0.61604095563139927

Assessing parameter choices

I had some weirdness with the grid search functions, which are probably a better method of doing this in general.



In [9]:

    
clf10 = SVC(C=10.0,kernel="rbf")
clf10.fit(features_train,labels_train)
clf100 = SVC(C=100.0,kernel="rbf")
clf100.fit(features_train,labels_train)
clf1000 = SVC(C=1000.0,kernel="rbf")
clf1000.fit(features_train,labels_train)
clf10000 = SVC(C=10000.0,kernel="rbf")
clf10000.fit(features_train,labels_train)









    Out[9]:





SVC(C=10000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)



In [10]:

    
print "C = 10: ", clf10.score(features_test,labels_test)
print "C = 100: ", clf100.score(features_test,labels_test)
print "C = 1000: ", clf1000.score(features_test,labels_test)
print "C = 10,000: ", clf10000.score(features_test,labels_test)









    



C = 10:  0.616040955631
C = 100:  0.616040955631
C = 1000:  0.821387940842
C = 10,000:  0.892491467577

Trying C=10,000 with the full training data



In [11]:

    
features_train, features_test, labels_train, labels_test = preprocess()
clf = SVC(C=10000,kernel="rbf")
clf.fit(features_train,labels_train)









    



no. of Chris training emails: 7936
no. of Sara training emails: 7884






    Out[11]:





SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Accuracy with the full training set:



In [12]:

    
clf.score(features_test,labels_test)









    Out[12]:





0.99089874857792948

Answering questions about specific data points with the RBF kernel:



In [13]:

    
pred = clf.predict(features_test)



In [14]:

    
for i in [10,26,50]:
    print 'training point',i,'--predicted:',pred[i],'real value:',labels_test[i]









    



training point 10 --predicted: 1 real value: 1
training point 26 --predicted: 0 real value: 0
training point 50 --predicted: 1 real value: 1

Proportion of emails attributed to Chris (label = 1)



In [15]:

    
# Raw count:
chrisCount = sum(pred)
chrisCount









    Out[15]:





877



In [17]:

    
# Proportion:
chrisCount/float(len(pred))









    Out[17]:





0.49886234357224118