Udacity Machine Learning Decision Tree mini-project

Prep stuff:



In [1]:

    
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
from sklearn.tree import DecisionTreeClassifier

Data:



In [2]:

    
features_train, features_test, labels_train, labels_test = preprocess()









    



no. of Chris training emails: 7936
no. of Sara training emails: 7884

Training:



In [3]:

    
clf = DecisionTreeClassifier(min_samples_split=40)
clf.fit(features_train,labels_train)









    Out[3]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=40, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Accuracy:



In [4]:

    
clf.score(features_test,labels_test)









    Out[4]:





0.97781569965870307



In [5]:

    
#### Number of features:



In [6]:

    
features_train.shape









    Out[6]:





(15820, 3785)

Working with a modified version of the data generator:

This one selects a smaller subset of features (1% rather than 10%)



In [7]:

    
from email_preprocess2 import preprocess as preprocess2



In [8]:

    
features_train2, features_test2, labels_train2, labels_test2 = preprocess2()









    



no. of Chris training emails: 7936
no. of Sara training emails: 7884



In [9]:

    
features_train2.shape









    Out[9]:





(15820, 379)

Less features/less complex model accuracy:



In [9]:

    
clf2 = DecisionTreeClassifier(min_samples_split=40)
clf2.fit(features_train2,labels_train2)
clf2.score(features_test2,labels_test2)









    Out[9]:





0.96587030716723554

Not all that much worse.