In [53]:
import sys
from time import time
from os.path import expanduser
sys.path.append(expanduser("~/Documents/ud120-projects-master/tools/"))
from email_preprocess import preprocess
import numpy as np
import pandas as pd
from __future__ import division
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
In [2]:
import sklearn.naive_bayes as nb
In [3]:
alabels_test=np.array(labels_test)
In [4]:
alabels_test.shape
Out[4]:
In [5]:
features_test.shape
Out[5]:
This indicates axis 0 is emails, and 1 is words(features?)?
In [6]:
features_test[:10].sum(axis=1)
Out[6]:
In [7]:
features_test.sum()/features_test.shape[0]
Out[7]:
I don't understand how this information is coded. The logical thing to me would be an integer record of whether a word is used in an email, but since it appears that at least most of the first ten contain non integer amounts, that doesn't seem to be the case. Anyway, I can't seem to understand the average length of the emails, which would be an indicator of how confident one can be of writers of the emails.
In [8]:
%%time
model=nb.GaussianNB()
model.fit(features_train,labels_train)
In [9]:
%%time
testprediction=model.predict(features_test)
In [11]:
(testprediction==alabels_test).sum()/alabels_test.shape[0]
Out[11]:
This is astoundingly good. I would never guess that people are so reliable in their choice of words.
In [12]:
from sklearn import svm
In [49]:
sub_features_train=features_train[:int(round(features_train.shape[0]/100))]
sub_labels_train=labels_train[:int(round(features_train.shape[0]/100))]
In [50]:
kernels=['linear', 'poly', 'rbf', 'sigmoid']
In [51]:
for i in kernels:
print(i)
model=svm.SVC(kernel=i)
%time model.fit(sub_features_train,sub_labels_train)
print((model.predict(features_test)==alabels_test).sum()/alabels_test.shape[0])
Wow. I did not think that the kernel mattered that much. I guess a highly complex kernel requires some tuning of hyper-parameters? I really need to understand better how to tailor svm kernels to the data. It's pretty hard to do though, as long as this data is a black box.
In [13]:
model=svm.SVC(kernel='linear')
In [14]:
%%time
model.fit(features_train,labels_train)
Out[14]:
In [15]:
%%time
testprediction=model.predict(features_test)
In [16]:
np.sum(testprediction==alabels_test)/alabels_test.shape[0]
Out[16]:
In [17]:
#training time
lsvm=(2*60+13)*1000
lsvm/805
Out[17]:
In [34]:
#prediction time
15.9*1000/117
Out[34]:
Very poorly. Training and prediction times are both well over 100 times longer for SVM.
In [19]:
sub_features_train=features_train[:int(round(features_train.shape[0]/100))]
sub_labels_train=labels_train[:int(round(features_train.shape[0]/100))]
In [20]:
model=svm.SVC(kernel='linear')
%time model.fit(sub_features_train,sub_labels_train)
%time testprediction=model.predict(features_test)
np.sum(testprediction==alabels_test)/alabels_test.shape[0]
Out[20]:
That's an awful lot faster, and doesn't do too bad as far a prediction goes. The prediction time is also drammatically lower.
Flagging credit card fraud, and blocking a transaction before it goes through and voice recognition, like Siri, both would require quick prediction time. However, training time for both of these can be long. There are very few applications where long training times are not acceptable for the final product, though long training times can definitely make testing difficult.
In [21]:
model=svm.SVC(kernel='rbf')
%time model.fit(sub_features_train,sub_labels_train)
%time testprediction=model.predict(features_test)
np.sum(testprediction==alabels_test)/alabels_test.shape[0]
Out[21]:
In [22]:
asub_labels_train=np.array(sub_labels_train)
np.sum(model.predict(sub_features_train)==asub_labels_train)/asub_labels_train.shape[0]
Out[22]:
I would have guessed that this kernel might just be overfitting the data, but that doesn't quite seem to be the case - it doesn't even predict the training data well. It must be that it is "underfitting", and doesn't have the freedom to change the shape to match the data.
In [24]:
a=10**np.arange(1,10)
for i in a:
model=svm.SVC(kernel='rbf',C=i)
print('C='+str(i))
print('fitting:')
%time model.fit(sub_features_train,sub_labels_train)
print('prediction:')
%time testprediction=model.predict(features_test)
print('accuracy='+str(np.sum(testprediction==alabels_test)/alabels_test.shape[0]))
There seems to be an ideal value for this. The default value, 1, is insufficiantly complex to follow the data and is underfit, and at some point, around C=10000, it shifts to being overfit, as the complexities of the decision boundary allow it to simply follow every single point, with the ideal value around 10000.
However, I should note that doing this process is bad data science - we are over-fitting the parameter C to the test data set. Udacity's suggestion doesn't even do this bad process right, it has us stop at 10000, and we don't even know if we could do better by going higher.
In [23]:
a=np.linspace(10000-5000,10000+5000,num=10,dtype=int)
for i in a:
model=svm.SVC(kernel='rbf',C=i)
print('C='+str(i))
print('fitting:')
%time model.fit(sub_features_train,sub_labels_train)
print('prediction:')
%time testprediction=model.predict(features_test)
print('accuracy='+str(np.sum(testprediction==alabels_test)/alabels_test.shape[0]))
In [25]:
model=svm.SVC(kernel='rbf',C=10000)
In [26]:
%%time
model.fit(features_train,labels_train)
Out[26]:
That took a while, but it was still an awful lot shorter than the 15 minutes it took with C=1
In [30]:
%%time
testprediction=model.predict(features_test)
In [28]:
labels_test=np.array(labels_test)
In [31]:
(np.sum(testprediction==labels_test))/labels_test.shape[0]
Out[31]:
That's so good that I have a hard time believing it. As I was using the test set to choose my C, it's probable that I have overfitted to the test set. In order to do this process properly, the training set should be devided into subsets, and one subset used for training, one for testing, and then the acuraccy should be maximized with respect to the parameter. Then you can test on a larger test and training set to get a real understanding of the accuracy. The sklearn.grid_search.GridSearchCV method uses cross validation, a slightly more complex version of what I just described to do this.
In [32]:
print('elm 10 prediction = ' + str(testprediction[10])+', actual = '+ str(labels_test[10]))
print('elm 26 prediction = ' + str(testprediction[26])+', actual = '+ str(labels_test[26]))
print('elm 50 prediction = ' + str(testprediction[50])+', actual = '+ str(labels_test[50]))
In [33]:
np.sum(testprediction)
Out[33]:
In [36]:
from sklearn.tree import DecisionTreeClassifier
In [37]:
model=DecisionTreeClassifier(min_samples_split=40)
In [38]:
%%time
model.fit(features_train,labels_train)
Out[38]:
In [39]:
%%time
testprediction=model.predict(features_test)
In [40]:
labels_test=np.array(labels_test)
In [41]:
(testprediction==labels_test).sum()/labels_test.shape[0]
Out[41]:
In [44]:
features_train.shape[1]
Out[44]:
The feature selection algorithm is selecting only the features that are most well correlated with the data, with the correlation in this case measured by a $\chi ^2$ correlation test between the feature and the labels. In this case, we're picking out the top 10% most highly correlated variables.
In [45]:
from email_preprocess import preprocesssmall
features_train, features_test, labels_train, labels_test = preprocesssmall()
In [46]:
features_train.shape[1]
Out[46]:
With a smaller number of variables, we cannot have a less complex decision surface. If we add a completely random feature, however, with an ideal machine learning algorithm there should be no increased complexity, and in general a good algorithm should only increase the complexity of the decision surface with more features if the new features add useful information about the labels.
Since we are dropping features that are fairly highly correlated with the labels, this will decrease the complexity of the decision surface.
In [48]:
%time model.fit(features_train,labels_train)
%time testprediction=model.predict(features_test)
labels_test=np.array(labels_test)
(testprediction==labels_test).sum()/labels_test.shape[0]
Out[48]:
Based on these two data points (hardly enough data to really figure out), I'd guess that model fitting and prediction time scales at least linearly with the number of features, if not with some higher power.