The simplest evaluation measures to produce are the 'True Positive', 'False Positive' and 'False Negative' counts.
In [41]:
import numpy as np
positive_label = 1 # The label that we call correct.
# An example set of labels for our ground truth
gt_labels = np.array([1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2])
# Some example output from a classifier (i.e. model), with classification errors.
model_output = np.array([1,1,2,2,2,1,1,1,2,1,2,2,2,2,2,2,1,2,1,2])
With the true and false positive counts we can compute 'Precision', 'Recall' and 'F-Measure'. Precision describes the ratio of correctly classified files against all which were classified as positive. That is, precision = #TP / (#TP + #FP). Hence:
In [42]:
# Compute the number of true positives
tp = sum((model_output == gt_labels) * (gt_labels == positive_label) + 0.0)
# Determine precision by dividing the true positive count by the number of all claimed as positive.
precision = tp / sum((model_output == positive_label) + 0.0)
precision
Out[42]:
The false positive count counts the number of classifications as positive but did not match our ground truth labels.
In [43]:
fp = sum((model_output == positive_label) * (gt_labels != positive_label) + 0.0)
fp
Out[43]:
The false negative count counts the number of incorrect classifications, labelled as negative, when the ground truth label was positive.
In [44]:
fn = sum((model_output != positive_label) * (gt_labels == positive_label) + 0.0)
fn
Out[44]:
Recall describes the ratio of correctly classified files against the files which were known to be correct. Recall = # TP / (# TP + # FN)
In [45]:
recall = tp / (tp + fn)
recall
Out[45]:
The F-score is the harmonic mean of precision and recall, combining both measures:
In [46]:
f_score = 2 * precision * recall / (precision + recall)
f_score
Out[46]:
Alternatively, often we want to understand the performance of a single classifier over a range of different threshold parameter settings. A "Receiver Operating Characteristics" (ROC) curve plots true positive rate vs. false positive rate for different parameter settings. This depicts the relative trade-offs between true positive (benefits) and false positive (costs) for each parameter value.
Plotting this curve is made easy with scikit.learn's ROC functions. It is however, restricted to binary classifications (i.e. snare vs. non-snare).
In [48]:
import numpy as np
from sklearn.metrics import roc_curve
roc_curve?
In [56]:
# We can use roc_curve to produce the false and true positive rates for each example, given the labels and normalised scores.
# The scores represent the classifier's confidence of each classification.
# Since our classifications do not issue a confidence measure, we make them binary.
scores = (model_output == gt_labels) + 0.0
print scores
# We indicate to roc_curve that the value 1 in gt_labels is our positive snare label.
fpr, tpr, thresholds = roc_curve(gt_labels, scores, pos_label=positive_label)
print fpr
print tpr
print thresholds
The roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by the acronyms AUC or AUROC. This summarises the behaviour of the system to one number.
In [61]:
import numpy as np
from sklearn.metrics import roc_auc_score, auc
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc = roc_auc_score(y_true, y_scores)
roc_auc
Out[61]:
In [66]:
binary_labels = (gt_labels == positive_label) + 0
print binary_labels
roc_auc = roc_auc_score(binary_labels, scores)
roc_auc
Out[66]:
In [59]:
# Alternatively the area under curve can be computed directly from the positive rates.
roc_auc = auc(fpr, tpr)
roc_auc
Out[59]:
We can then plot an ROC curve with matplotlib
In [9]:
from matplotlib.pyplot import *
clf()
plot(fpr, tpr, '-x', label='ROC curve (area = %0.2f)' % roc_auc)
# Plot the line of no discrimination
plot([0, 1], [0, 1], 'k--')
xlim([0.0, 1.0])
ylim([0.0, 1.0])
xlabel('False Positive Rate')
ylabel('True Positive Rate')
title('Receiver Operating Characteristic')
legend(loc="lower right")
show()
In [ ]: