Evaluation

Once we have computed results using our test set which can be compared against their ground truth labels, there are a number of evaluation methods to understand how well a classifier is performing, particularly when we want to compare different classifiers.

The simplest evaluation measures to produce are the 'True Positive', 'False Positive' and 'False Negative' counts.



In [41]:

    
import numpy as np

positive_label = 1 # The label that we call correct.

# An example set of labels for our ground truth
gt_labels    = np.array([1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2])

# Some example output from a classifier (i.e. model), with classification errors.
model_output = np.array([1,1,2,2,2,1,1,1,2,1,2,2,2,2,2,2,1,2,1,2])

With the true and false positive counts we can compute 'Precision', 'Recall' and 'F-Measure'. Precision describes the ratio of correctly classified files against all which were classified as positive. That is, precision = #TP / (#TP + #FP). Hence:



In [42]:

    
# Compute the number of true positives
tp = sum((model_output == gt_labels) * (gt_labels == positive_label) + 0.0)
# Determine precision by dividing the true positive count by the number of all claimed as positive.
precision = tp / sum((model_output == positive_label) + 0.0)
precision









    Out[42]:





0.75

The false positive count counts the number of classifications as positive but did not match our ground truth labels.



In [43]:

    
fp = sum((model_output == positive_label) * (gt_labels != positive_label) + 0.0)
fp









    Out[43]:





2.0

The false negative count counts the number of incorrect classifications, labelled as negative, when the ground truth label was positive.



In [44]:

    
fn = sum((model_output != positive_label) * (gt_labels == positive_label) + 0.0)
fn









    Out[44]:





4.0

Recall describes the ratio of correctly classified files against the files which were known to be correct. Recall = # TP / (# TP + # FN)



In [45]:

    
recall = tp / (tp + fn)
recall









    Out[45]:





0.59999999999999998

The F-score is the harmonic mean of precision and recall, combining both measures:



In [46]:

    
f_score = 2 * precision * recall / (precision + recall)
f_score









    Out[46]:





0.66666666666666652

Alternatively, often we want to understand the performance of a single classifier over a range of different threshold parameter settings. A "Receiver Operating Characteristics" (ROC) curve plots true positive rate vs. false positive rate for different parameter settings. This depicts the relative trade-offs between true positive (benefits) and false positive (costs) for each parameter value.

Plotting this curve is made easy with scikit.learn's ROC functions. It is however, restricted to binary classifications (i.e. snare vs. non-snare).



In [48]:

    
import numpy as np
from sklearn.metrics import roc_curve
roc_curve?



In [56]:

    
# We can use roc_curve to produce the false and true positive rates for each example, given the labels and normalised scores. 
# The scores represent the classifier's confidence of each classification.
# Since our classifications do not issue a confidence measure, we make them binary.
scores = (model_output == gt_labels) + 0.0
print scores

# We indicate to roc_curve that the value 1 in gt_labels is our positive snare label.
fpr, tpr, thresholds = roc_curve(gt_labels, scores, pos_label=positive_label)
print fpr
print tpr
print thresholds









    



[ 1.  1.  0.  0.  0.  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  0.  1.
  0.  1.]
[ 0.   0.8  1. ]
[ 0.   0.6  1. ]
[ 2.  1.  0.]

The roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by the acronyms AUC or AUROC. This summarises the behaviour of the system to one number.



In [61]:

    
import numpy as np
from sklearn.metrics import roc_auc_score, auc

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc = roc_auc_score(y_true, y_scores)
roc_auc









    Out[61]:





0.75



In [66]:

    
binary_labels = (gt_labels == positive_label) + 0
print binary_labels
roc_auc = roc_auc_score(binary_labels, scores)

roc_auc









    



[1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]






    Out[66]:





0.39999999999999997



In [59]:

    
# Alternatively the area under curve can be computed directly from the positive rates.
roc_auc = auc(fpr, tpr)
roc_auc









    Out[59]:





0.39999999999999997

We can then plot an ROC curve with matplotlib



In [9]:

    
from matplotlib.pyplot import *

clf()
plot(fpr, tpr, '-x', label='ROC curve (area = %0.2f)' % roc_auc)
# Plot the line of no discrimination
plot([0, 1], [0, 1], 'k--')
xlim([0.0, 1.0])
ylim([0.0, 1.0])
xlabel('False Positive Rate')
ylabel('True Positive Rate')
title('Receiver Operating Characteristic')
legend(loc="lower right")
show()



In [ ]: