In [1]:
# include("src/ROCAnalysis.jl")
Out[1]:
In [1]:
using ROCAnalysis
This package analyses the results of a probabilistic binary classifier. We'll assume that the classifier generates scores that tend to be higher towards the "target" class, and lower towards the "non-target" class.
In this notebook we will simulate the scores, by generating them from two separate distributions. We'll do this in a way that the scores can be considered well-calibrated log-likelihood-ratio scores, i.e., the scores are supposed to be probabilistically interpretable as llr $\ell$
$$\ell = \log \dfrac{P({\rm data} \mid {\rm target})}{P({\rm data} \mid {\rm nontarget})}$$
In [2]:
## Produce some well-calibrated log-likelihood-ratio scores for target and non-target class:
tar = 2 + 2randn(1000)
non = -2 + 2randn(100000)
Out[2]:
These scores are well-calibrated, because their distribution follows the defining property
$$\ell = \log \dfrac{P(\ell \mid {\rm target})}{P(\ell \mid {\rm nontarget})}$$
In [3]:
## quick estimate of equal error rate, should be close to pnorm(-1) = 0.5 + 0.5erf(-1/√2)
eer(tar, non)
Out[3]:
The equal error rate is an important one-valued summary of the ROC curve (see below for a plot). It is the point at which falso positive and false negative rates are the same.
In [4]:
pnorm(-1)
Out[4]:
For the score generated above from a Gaussian distribution, we can derive the theoretical EER quite easily. It is related to the quantile function of the normal distribution.
In [5]:
## compute full ROC statistics
r = roc(tar, non)
Out[5]:
Admittedly not pretty printed---but we can call in the help from DataFrames
to undestand the data structure of a Roc
object
pfa
the false alarm ratepmiss
the miss ratethres
the thereshold, separating this line's pfa
and pmiss
from the nextchull
indicating if this point is on the convex hull of the ROC curvellr
the optimal log-likelihood-ratio score for all data points contributing to the ROC line segment from this line to the nextThe last entry, llr
, corresponds to the negative slope of the ROC convex hull, and has a direct relation to the "minimum" versions of the various metrics below.
In [6]:
using DataFrames
DataFrame(r)
Out[6]:
In [7]:
## accurate computation of the equal error rate, using the convex hull
eerch(r)
Out[7]:
In [9]:
using Winston ## or perhaps another plotting package
plot(r)
Out[9]:
The axes of a ROC plot are named "Probability of False Alarm" (false positive rate) and "Probability of Miss" (false negative rate). There are many names for these (see the README.md
). A note aoutb this way of plotting:
In [10]:
## The "Detection Error Tradeoff" plot, this should give a more/less straight line
detplot(r)
Out[10]:
A Detection Error Trade-off plot (DET plot) shows the same information as the ROC plot above---but the scales are warped according to the inverse of the cumulative normal distribution. This way of plotting has many advantages:
If the distributions of target and non-target scores are both Normal, then the DET-curve is a straight line. In practice, many detection problems give rise to more-or-less straight DET curves, and this suggests that there exists a strictly increasing warping function that can make the score distributions (more) Normal.
Towards better performance (lower error rates), the resolution of the graph is higher. This makes it more easy to have multiple systems / performance characteristics over a smaller or wider range of performance in the same graph, and still be able to tell these apart.
Conventionally, the ranges of the axes are chosen 0.1%--50%---and the plot area should really be square. This makes it possible to immediately assess the overall performance based on the absolute position of the line in the graph if you have seen more DET plots in your life.
The slope of the (straight) line corresponds to the ratio of the σ
parameters of the underlying Normal score distributions, namely that of the non-target scores divided by that of the target scores. Often, highly discriminative classifiers show very flat curves, indicating that that target scores have a much larger variance than the non-target scores.
The origin of this type of plot lies in psychophysics, where graph paper with lines according to this warping was referred to as double probability paper. The diagonal $y=x$ in a DET plot corresponds linearly to a quantity known as $d'$ (d-prime) from psychophysics, ranging from 0 at 50% error to about 6 at 0.1% error.
In [11]:
## compute the Area Under the ROC, should be close to 0.078
auc(r)
Out[11]:
Please note, that traditionally the AUC is the complement of this. However, in this package all metrics are "error" and "cost" like, so low numbers always indicate better performance. As a courtesy to all other research fields that work with two-class detectos, we have the convenience function AUC()
, which---like auc()
---allows for partial integration up to a certain value of pfa
or pmiss
.
In [12]:
println(AUC(r))
println(AUC(r, pfa=0.1, normalize=false)) ## Traditional AUC integrating pfa from 0 to 0.1
println(AUC(r, pfa=0.1)) ## normalize partial AUC to range 0.5 (no discrimiation) to 1.0 (perfect discrimination)
In [13]:
## define a decision cost function by its parameter p_tar=0.01, Cfa=1, Cmiss=10 (NIST SRE 2008 setting)
d = DCF(0.01, 1, 10)
Out[13]:
Decision theory analyses the two types of error that can be made in terms of a cost function, weighting one type of error (say, a criminal that is not sentenced) differently from the other type of error (say, an innocent civilian in sentenced), and a prior for the target hypothesis (the prior that someone is guilty before assessing the evidence). Theses parameters are called $p_{\rm tar}$, $C_{\rm fa}$ and $C_{\rm miss}$ in this package. Their joint influence to decision making can be nicely summarized in the prior log-odds
In [14]:
## `actual costs' using a threshold of scores at -plo(d) (minus prior log odds)
plo(d)
Out[14]:
If the scores are interpretable as calibrated log-likelihood-ratios, then the optimal threshold should be at -plo(DCF)
, the value of the dcf can be computed as:
In [15]:
dcf(tar, non, d=d)
Out[15]:
If you know the decision cost function you want to work with and it doesn't change, you can set it default:
In [16]:
setdcf(d=d)
Out[16]:
In [17]:
dcf(tar, non)
Out[17]:
Perhaps your data requires a different threshold, because the scores have not been calibrated.
In [18]:
dcf(tar, non, thres=0.)
Out[18]:
A lot |worse threshold for this data and cost function, thres = -2.29
performs better (has lower dcf). What is the minimal value of the dcf that can be obtained if we vary the threshold?
In [19]:
mindcf(tar, non, d=d)
Out[19]:
The minimum dcf is the value of the decision cost function after choosing the threshold optimally, given the entire ROC is known. In this case, the data were generated to be well calibrated, meaning that the difference between the actual dcf and the mininumum dcf is small.
The mindcf
is efficiently calculated by computing the ROC first, so if you have that lying around, you can use it in place of tar, non
in most of the above functions:
In [20]:
mindcf(r)
Out[20]:
Please, note that it is also possible to specify multiple cost functions and store it as a DCF object. For instance, if you want to scan the behaviour of the dcf when the prior varies:
In [21]:
d = DCF(collect(0.01:0.01:0.99), 1, 1)
mindcf(r, d=d)
Out[21]:
So from this analysis, it seems that at higher and lower priors, the dcf vanishes. This is true, because at priors close to 0 or 1, the decision can rely more and more on the prior alone, and less on the classifier. It therefore is interesting to look at what the normalized dcf is. This is the dcf computed for the classifier scores divided by the dcf based on the prior alone. In this case,
In [22]:
mindcf(r, d=d, norm=true)
Out[22]:
This shows that the utility of the classifier (how much decision costs can be improved) lies in the region of (effective) priors closer to 0.5
.
Instead of the somewhat complicated cost function, which depends on a combination of three parameters $p_{\rm tar}$, $C_{\rm fa}$ and $C_{\rm miss}$, we can analyze the classifier's performance in terms of the critical parameter, the prior-log-odds $\zeta$
$$\zeta = \log\bigl(\dfrac{p_{\rm tar}}{1-p_{\rm tar}} \dfrac{C_{\rm miss}}{C_{\rm fa}}\bigr) $$such a function is the Bayes Error Rate $E_B$
$$E_B = p_{\rm eff} \ p_{\rm miss} + (1-p_{\rm eff}) \ p_{\rm fa}$$where $p_{\rm eff}$ is the effective prior (a cost-weighted prior, related to $\zeta$), and $p_{\rm fa}$ and $p_{\rm miss}$ are the false alarm and miss rate computed by thresholding the scores tar
and non
at the optimal Bayes threshold $-\zeta$.
In [23]:
ber(tar, non, -2.29)
Out[23]:
In [24]:
## now scan the Bayes error rate (similar to the dcf above) for a range of prior log odds, and plot
## This is known as the Applied Probability of Error plot
apeplot(r)
Out[24]:
The red curve corresponds to the actual Bayes Error Rate, given the classifier's scores and the plo
parameter, the green curve corresponds to the minimum Bayes Error Rate. This minimum Bayes Error is obtained, for each value of plo
by shifting the scores until the Bayes Error Rate is minimum. In this case, because scores were generated as log-likelihood ratios, this minimum is more-or-less equal to the actual ber
. The black dotted line is the ber
of the trivial classifier, that bases the decision on the prior alone (i.e., always decides true
if lpo>0
, and false
otherwise).
You may appreciate this, by considering scrores that are just twice as big---when we plot the APE plot, the Bayes Error is higher than the minimum value---this is an example of mis-calibratred log-likelihood-ratio scores.
In [25]:
r2 = roc(2tar, 2non)
apeplot(r2)
Out[25]:
For abs(lpo)>2.5
, the classifier's log-likelihood-ratio scores are so much off, that Bayes decisions are taken that lead to a higher error than the trivial classifier!
This whole curve of calibration performance as a function of prior can be summarized as proportional to the integral under the red curve, a quantity known as the "Cost of the log-likelihood-ratio", cllr
:
In [26]:
cllr(2tar, 2non)
Out[26]:
In a similar way, the area under the green curve can be computed, and this is a summary of the discrimination performance of the classifier, in the same units (which happen to be "bits") as cllr
:
In [27]:
mincllr(2tar, 2non)
Out[27]:
Another way of viewing the same calibration-over-effective-prior analysis, but that focuses on the utility of the classifier over the trivial classifier, is to show the normalized Bayes error rate:
In [28]:
nbeplot(r)
Out[28]:
Again, green and red are minimum and actual normalized Bayes error rates. The two dashed/dotted curves correspond to the contribution of the false positives and false negatives tot the total normalized error rate. The plo
score is chosen asymmetric, because there often are many more non-target trials than target trials, and hence the accuracy of the metric is better.
Finally, a last graph in this workbook, a plot that show the relation between the actual scores and the optimally calibrated scores (corresponding to the minimum-curves in nbe
and ber
). For well-calibrated log-likelihood-ratio scores this should be a straight line of unit slope. Let's see what happens for our over-confident llr scores:
In [29]:
llrplot(r2)
Out[29]:
Mmm. There still is some work to be done in scaling the graphs. More to come.
In [ ]: