Udacity Machine Learning Decision Tree mini-project

Prep stuff:


In [1]:
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
from sklearn.tree import DecisionTreeClassifier

Data:


In [2]:
features_train, features_test, labels_train, labels_test = preprocess()


no. of Chris training emails: 7936
no. of Sara training emails: 7884

Training:


In [3]:
clf = DecisionTreeClassifier(min_samples_split=40)
clf.fit(features_train,labels_train)


Out[3]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=40, min_weight_fraction_leaf=0.0,
            random_state=None, splitter='best')

Accuracy:


In [4]:
clf.score(features_test,labels_test)


Out[4]:
0.97781569965870307

In [5]:
#### Number of features:

In [6]:
features_train.shape


Out[6]:
(15820, 3785)

Working with a modified version of the data generator:

This one selects a smaller subset of features (1% rather than 10%)


In [7]:
from email_preprocess2 import preprocess as preprocess2

In [8]:
features_train2, features_test2, labels_train2, labels_test2 = preprocess2()


no. of Chris training emails: 7936
no. of Sara training emails: 7884

In [9]:
features_train2.shape


Out[9]:
(15820, 379)

Less features/less complex model accuracy:


In [9]:
clf2 = DecisionTreeClassifier(min_samples_split=40)
clf2.fit(features_train2,labels_train2)
clf2.score(features_test2,labels_test2)


Out[9]:
0.96587030716723554

Not all that much worse.