Udacity Machine Learning Evaluation mini-project

Prep stuff:


In [1]:
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

In [2]:
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

Training a decision tree on this starter data:


In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split

In [4]:
features_train, features_test, labels_train, labels_test = train_test_split(features,labels,test_size=0.3,random_state=42)

In [5]:
clf = DecisionTreeClassifier()
clf.fit(features_train,labels_train)
pred = clf.predict(features_test)

Counts of actual and predicted values:


In [6]:
# ref http://stackoverflow.com/questions/10741346
import numpy as np
unique, counts = np.unique(labels_test, return_counts=True)
print "true labels"
print np.asarray((unique, counts)).T
print "predicted labels"
unique, counts = np.unique(pred, return_counts=True)
print np.asarray((unique, counts)).T


true labels
[[  0.  25.]
 [  1.   4.]]
predicted labels
[[  0.  25.]
 [  1.   4.]]

Which turn out to match up very poorly. No true positives. Just guessing 0 for everyone would in fact be more accurate.


In [7]:
print "number of true positives:",sum((labels_test==1) & (pred ==1))


number of true positives: 0

Precision and Recall:


In [8]:
from sklearn.metrics import precision_score, recall_score

These are not even slightly good news:


In [9]:
print "precision:",precision_score(labels_test,pred)
print "recall:",recall_score(labels_test,pred)


precision: 0.0
recall: 0.0

Same thing with some fake data for comparison:


In [10]:
predictions = np.array([0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
true_labels = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0])
print "number of true positives:",sum((true_labels==1) & (predictions==1))
print "number of false positives:",sum((true_labels==0) & (predictions==1))
print "number of true negatives:",sum((true_labels==0) & (predictions==0))
print "number of false negatives:",sum((true_labels==1) & (predictions==0))


number of true positives: 6
number of false positives: 3
number of true negatives: 9
number of false negatives: 2

In [11]:
print "precision:", 6/(6+3.)
print "recall:", 6/(6+2.)


precision: 0.666666666667
recall: 0.75