Lesson 14 - Evaluation Metrics

Task: Identify Persons Of Interest (POI) for Enron fraud dataset.


In [13]:
import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../final_project/final_project_dataset.pkl", "r") )

### first element is our labels, any added elements are predictor
### features. Keep this the same for the mini-project, but you'll
### have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)


print len(labels), len(features)


95 95

Create a decision tree classifier (just use the default parameters), train it on all the data. Print out the accuracy. THIS IS AN OVERFIT TREE, DO NOT TRUST THIS NUMBER! Nonetheless,

  • what’s the accuracy?

In [14]:
from sklearn import tree
from time import time

def submitAcc(features, labels):
    return clf.score(features, labels)



clf = tree.DecisionTreeClassifier()
t0 = time()
clf.fit(features, labels)
print("done in %0.3fs" % (time() - t0))


done in 0.001s

In [15]:
pred = clf.predict(features)
print "Classifier with accurancy %.2f%%" % (submitAcc(features, labels))


Classifier with accurancy 0.99%

Now you’ll add in training and testing, so that you get a trustworthy accuracy number. Use the train_test_split validation available in sklearn.cross_validation; hold out 30% of the data for testing and set the random_state parameter to 42 (random_state controls which points go into the training set and which are used for testing; setting it to 42 means we know exactly which events are in which set, and can check the results you get).

  • What’s your updated accuracy?

In [16]:
from sklearn import cross_validation


X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, labels, test_size=0.30, random_state=42)

print len(X_train), len(y_train)
print len(X_test), len(y_test)


66 66
29 29

In [17]:
clf = tree.DecisionTreeClassifier()
t0 = time()
clf.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))


done in 0.001s

In [18]:
pred = clf.predict(X_test)
print "Classifier with accurancy %.2f%%" % (submitAcc(X_test, y_test))


Classifier with accurancy 0.72%

How many POIs are in the test set for your POI identifier?

  • (Note that we said test set! We are not looking for the number of POIs in the whole dataset.)

In [27]:
numPoiInTestSet = len([p for p in y_test if p == 1.0])
print numPoiInTestSet


4

If your identifier predicted 0. (not POI) for everyone in the test set, what would its accuracy be?


In [29]:
from __future__ import division

1.0 - numPoiInTestSet/29


Out[29]:
0.8620689655172413

Aaaand the testing data brings us back down to earth after that 99% accuracy.

Concerns with Accuracy

  • If you have a skewed dataset, as is the case with this dataset
  • The problem might be of such that it is best to err on the side of guessing innocence
  • For another case, you may want to err on the side of predicting guilt, with the hopes that the innocent persons will be cleared through the investigation.

Accuracy is not particularly good if any of these cases apply to you. Precision and recall are a better metric for evaluating the performance of the model.

Picking The Most Suitable Metric

As you may now see, having imbalanced classes like we have in the Enron dataset (many more non-POIs than POIs) introduces some special challenges, namely that you can just guess the more common class label for every point, not a very insightful strategy, and still get pretty good accuracy!

Precision and recall can help illuminate your performance better.

  • Use the precision_score and recall_score available in sklearn.metrics to compute those quantities.
  • What’s the precision?

In [30]:
from sklearn.metrics import *

In [32]:
precision_score(y_test,clf.predict(X_test))


Out[32]:
0.0

Obviously this isn’t a very optimized machine learning strategy (we haven’t tried any algorithms besides the decision tree, or tuned any parameters, or done any feature selection), and now seeing the precision and recall should make that much more apparent than the accuracy did.


In [33]:
recall_score(y_test,clf.predict(X_test))


Out[33]:
0.0

In [31]:
y_true = y_test
y_pred = clf.predict(X_test)

cM = confusion_matrix(y_true, y_pred)

print "{:>72}".format('Actual Class')
print "{:>20}{:>20}{:>20}{:>23}".format('Predicted', '', 'Positive', 'Negative')
print "{:>20}{:>20}{:>20.3f}{:>23.3f}".format('', 'Positive', cM[0][0], cM[0][1])
print "{:>20}{:>20}{:>20.3f}{:>23.3f}".format('', 'Negative', cM[1][0], cM[1][1])


                                                            Actual Class
           Predicted                                Positive               Negative
                                Positive              21.000                  4.000
                                Negative               4.000                  0.000

In [ ]: