In [1]:

    
# include("src/ROCAnalysis.jl")









    Out[1]:





ROCAnalysis



In [1]:

    
using ROCAnalysis

This package analyses the results of a probabilistic binary classifier. We'll assume that the classifier generates scores that tend to be higher towards the "target" class, and lower towards the "non-target" class.

In this notebook we will simulate the scores, by generating them from two separate distributions. We'll do this in a way that the scores can be considered well-calibrated log-likelihood-ratio scores, i.e., the scores are supposed to be probabilistically interpretable as llr $\ell$

$$\ell = \log \dfrac{P({\rm data} \mid {\rm target})}{P({\rm data} \mid {\rm nontarget})}$$



In [2]:

    
## Produce some well-calibrated log-likelihood-ratio scores for target and non-target class:
tar =  2 + 2randn(1000)
non = -2 + 2randn(100000)









    Out[2]:





100000-element Array{Float64,1}:
 -3.65865 
  0.111599
  0.378915
 -4.8397  
 -3.57974 
 -3.60667 
 -2.83765 
 -3.09525 
 -0.88947 
 -0.538918
 -0.847828
 -1.00992 
 -3.0787  
  ⋮       
 -1.33932 
 -2.02718 
 -5.81056 
 -0.525173
 -5.38597 
 -4.12203 
 -3.7793  
 -0.844421
 -1.87812 
 -0.223619
 -2.43359 
 -0.560365

These scores are well-calibrated, because their distribution follows the defining property

$$\ell = \log \dfrac{P(\ell \mid {\rm target})}{P(\ell \mid {\rm nontarget})}$$



In [3]:

    
## quick estimate of equal error rate, should be close to pnorm(-1) = 0.5 + 0.5erf(-1/√2)
eer(tar, non)









    Out[3]:





0.15815000000176976

The equal error rate is an important one-valued summary of the ROC curve (see below for a plot). It is the point at which falso positive and false negative rates are the same.



In [4]:

    
pnorm(-1)









    Out[4]:





0.15865525393145707

For the score generated above from a Gaussian distribution, we can derive the theoretical EER quite easily. It is related to the quantile function of the normal distribution.



In [5]:

    
## compute full ROC statistics
r = roc(tar, non)









    Out[5]:





ROCAnalysis.Roc{Float64}([1.0,0.964,0.964,0.8508,0.8508,0.73922,0.73922,0.69712,0.69712,0.68447  …  5.0e-5,5.0e-5,3.0e-5,3.0e-5,2.0e-5,2.0e-5,1.0e-5,1.0e-5,0.0,0.0],[0.0,0.0,0.001,0.001,0.002,0.002,0.003,0.003,0.004,0.004  …  0.982,0.984,0.984,0.985,0.985,0.987,0.987,0.995,0.995,1.0],Bool[true,true,false,true,false,true,false,true,false,false  …  false,false,false,false,false,false,true,false,true,true],[-10.7113,-5.57892,-5.57876,-4.07145,-4.07142,-3.28102,-3.28091,-3.02784,-3.0277,-2.95507  …  6.01045,6.05454,6.13288,6.15789,6.18196,6.21583,6.42813,6.45797,6.98265,7.13642],[-Inf,-4.72916,-4.72916,-4.71474,-4.71474,-3.74005,-3.74005,-2.8739,-2.8739,-2.8739  …  5.54344,5.54344,5.54344,5.54344,5.54344,5.54344,5.54344,6.68461,6.68461,Inf])

Admittedly not pretty printed---but we can call in the help from DataFrames to undestand the data structure of a Roc object

pfa the false alarm rate
pmiss the miss rate
thres the thereshold, separating this line's pfa and pmiss from the next
chull indicating if this point is on the convex hull of the ROC curve
llr the optimal log-likelihood-ratio score for all data points contributing to the ROC line segment from this line to the next

The last entry, llr, corresponds to the negative slope of the ROC convex hull, and has a direct relation to the "minimum" versions of the various metrics below.



In [6]:

    
using DataFrames
DataFrame(r)









    Out[6]:




pfa pmiss thres chull llr
1 1.0 0.0 -10.711334863947837 true -Inf
2 0.964 0.0 -5.578924344835429 true -4.729156165769083
3 0.964 0.001 -5.578760545269962 false -4.729156165769083
4 0.8508 0.001 -4.071446506828367 true -4.714741822417382
5 0.8508 0.002 -4.071422418269238 false -4.714741822417382
6 0.73922 0.002 -3.281023040359016 true -3.740047740688336
7 0.73922 0.003 -3.280912541319257 false -3.740047740688336
8 0.69712 0.003 -3.027839712052299 true -2.8739035651888574
9 0.69712 0.004 -3.02769928579535 false -2.8739035651888574
10 0.68447 0.004 -2.9550692339638793 false -2.8739035651888574
11 0.68447 0.005 -2.955049365503252 false -2.8739035651888574
12 0.68373 0.005 -2.9507927393062925 false -2.8739035651888574
13 0.68373 0.006 -2.950743887080411 false -2.8739035651888574
14 0.68028 0.006 -2.931498602177074 false -2.8739035651888574
15 0.68028 0.007 -2.931435660028978 false -2.8739035651888574
16 0.6506799999999999 0.007 -2.7734494606815216 false -2.8739035651888574
17 0.6506799999999999 0.008 -2.7732652640462256 false -2.8739035651888574
18 0.60859 0.008 -2.553231097149636 true -2.7444892712936144
19 0.60859 0.009 -2.5532203058185923 false -2.7444892712936144
20 0.60563 0.009 -2.536933187354933 false -2.7444892712936144
21 0.60563 0.01 -2.5369279619767133 false -2.7444892712936144
22 0.59363 0.01 -2.47744186076765 false -2.7444892712936144
23 0.59363 0.011 -2.477420592967947 false -2.7444892712936144
24 0.56192 0.011 -2.3135827276046363 true -2.622492312740559
25 0.56192 0.012 -2.3135798730768524 false -2.622492312740559
26 0.54815 0.012 -2.2411423612327432 true -1.6477384699325126
27 0.54815 0.013 -2.241121938275877 false -1.6477384699325126
28 0.5473399999999999 0.013 -2.2373328971133386 false -1.6477384699325126
29 0.5473399999999999 0.014 -2.2372946667468003 false -1.6477384699325126
30 0.54616 0.014 -2.23085370683574 false -1.6477384699325126
&vellip &vellip &vellip &vellip &vellip &vellip



In [7]:

    
## accurate computation of the equal error rate, using the convex hull
eerch(r)









    Out[7]:





0.15763626251390433



In [9]:

    
using Winston ## or perhaps another plotting package 
plot(r)









    Out[9]:

The axes of a ROC plot are named "Probability of False Alarm" (false positive rate) and "Probability of Miss" (false negative rate). There are many names for these (see the README.md). A note aoutb this way of plotting:

The verical axis (like the horizontal) is an error rate, and not its complement, the hit rate. This means that a "good" ROC curve here goes towards the lower left corner, and not the upper left corner. In literature one traditionally comes across the true positive rate vs. false positive rate, which is an "upper-left" style of ROC.



In [10]:

    
## The "Detection Error Tradeoff" plot, this should give a more/less straight line
detplot(r)









    Out[10]:

A Detection Error Trade-off plot (DET plot) shows the same information as the ROC plot above---but the scales are warped according to the inverse of the cumulative normal distribution. This way of plotting has many advantages:

If the distributions of target and non-target scores are both Normal, then the DET-curve is a straight line. In practice, many detection problems give rise to more-or-less straight DET curves, and this suggests that there exists a strictly increasing warping function that can make the score distributions (more) Normal.
Towards better performance (lower error rates), the resolution of the graph is higher. This makes it more easy to have multiple systems / performance characteristics over a smaller or wider range of performance in the same graph, and still be able to tell these apart.
Conventionally, the ranges of the axes are chosen 0.1%--50%---and the plot area should really be square. This makes it possible to immediately assess the overall performance based on the absolute position of the line in the graph if you have seen more DET plots in your life.
The slope of the (straight) line corresponds to the ratio of the σ parameters of the underlying Normal score distributions, namely that of the non-target scores divided by that of the target scores. Often, highly discriminative classifiers show very flat curves, indicating that that target scores have a much larger variance than the non-target scores.
The origin of this type of plot lies in psychophysics, where graph paper with lines according to this warping was referred to as double probability paper. The diagonal $y=x$ in a DET plot corresponds linearly to a quantity known as $d'$ (d-prime) from psychophysics, ranging from 0 at 50% error to about 6 at 0.1% error.



In [11]:

    
## compute the Area Under the ROC, should be close to 0.078
auc(r)









    Out[11]:





0.08052228999999997

Please note, that traditionally the AUC is the complement of this. However, in this package all metrics are "error" and "cost" like, so low numbers always indicate better performance. As a courtesy to all other research fields that work with two-class detectos, we have the convenience function AUC(), which---like auc()---allows for partial integration up to a certain value of pfa or pmiss.



In [12]:

    
println(AUC(r))
println(AUC(r, pfa=0.1, normalize=false)) ## Traditional AUC integrating pfa from 0 to 0.1
println(AUC(r, pfa=0.1)) ## normalize partial AUC to range 0.5 (no discrimiation) to 1.0 (perfect discrimination)









    



0.91947771
0.058844110000000005
0.7833900526315789



In [13]:

    
## define a decision cost function by its parameter p_tar=0.01, Cfa=1, Cmiss=10 (NIST SRE 2008 setting)
d = DCF(0.01, 1, 10)









    Out[13]:





Ptar = 0.01, Cfa = 1, Cmiss = 10
 prior log-odds = -2.292534757140544
 effective prior odds = 0.10101010101010102
 effective prior = 0.09174311926605504

Decision theory analyses the two types of error that can be made in terms of a cost function, weighting one type of error (say, a criminal that is not sentenced) differently from the other type of error (say, an innocent civilian in sentenced), and a prior for the target hypothesis (the prior that someone is guilty before assessing the evidence). Theses parameters are called $p_{\rm tar}$, $C_{\rm fa}$ and $C_{\rm miss}$ in this package. Their joint influence to decision making can be nicely summarized in the prior log-odds



In [14]:

    
## `actual costs' using a threshold of scores at -plo(d) (minus prior log odds)
plo(d)









    Out[14]:





-2.292534757140544

If the scores are interpretable as calibrated log-likelihood-ratios, then the optimal threshold should be at -plo(DCF), the value of the dcf can be computed as:



In [15]:

    
dcf(tar, non, d=d)









    Out[15]:





0.07233010000000001

If you know the decision cost function you want to work with and it doesn't change, you can set it default:



In [16]:

    
setdcf(d=d)









    Out[16]:





Ptar = 0.01, Cfa = 1, Cmiss = 10
 prior log-odds = -2.292534757140544
 effective prior odds = 0.10101010101010102
 effective prior = 0.09174311926605504



In [17]:

    
dcf(tar, non)









    Out[17]:





0.07233010000000001

Perhaps your data requires a different threshold, because the scores have not been calibrated.



In [18]:

    
dcf(tar, non, thres=0.)









    Out[18]:





0.1732981

A lot |worse threshold for this data and cost function, thres = -2.29 performs better (has lower dcf). What is the minimal value of the dcf that can be obtained if we vary the threshold?



In [19]:

    
mindcf(tar, non, d=d)









    Out[19]:





0.07157410000000002

The minimum dcf is the value of the decision cost function after choosing the threshold optimally, given the entire ROC is known. In this case, the data were generated to be well calibrated, meaning that the difference between the actual dcf and the mininumum dcf is small.

The mindcf is efficiently calculated by computing the ROC first, so if you have that lying around, you can use it in place of tar, non in most of the above functions:



In [20]:

    
mindcf(r)









    Out[20]:





0.07157410000000002

Please, note that it is also possible to specify multiple cost functions and store it as a DCF object. For instance, if you want to scan the behaviour of the dcf when the prior varies:



In [21]:

    
d = DCF(collect(0.01:0.01:0.99), 1, 1)
mindcf(r, d=d)









    Out[21]:





99-element Array{Float64,1}:
 0.009493 
 0.018154 
 0.0262308
 0.03364  
 0.040773 
 0.0469394
 0.0528443
 0.0587428
 0.0646369
 0.070531 
 0.0758965
 0.080628 
 0.0853595
 ⋮        
 0.076338 
 0.0709765
 0.065615 
 0.0602535
 0.054892 
 0.0495305
 0.0440354
 0.037706 
 0.0307648
 0.0238236
 0.0167444
 0.0093722

So from this analysis, it seems that at higher and lower priors, the dcf vanishes. This is true, because at priors close to 0 or 1, the decision can rely more and more on the prior alone, and less on the classifier. It therefore is interesting to look at what the normalized dcf is. This is the dcf computed for the classifier scores divided by the dcf based on the prior alone. In this case,



In [22]:

    
mindcf(r, d=d, norm=true)









    Out[22]:





99-element Array{Float64,1}:
 0.9493  
 0.9077  
 0.87436 
 0.841   
 0.81546 
 0.782323
 0.754919
 0.734285
 0.718188
 0.70531 
 0.689968
 0.6719  
 0.656612
 ⋮       
 0.63615 
 0.645241
 0.65615 
 0.669483
 0.68615 
 0.707579
 0.733923
 0.75412 
 0.76912 
 0.79412 
 0.83722 
 0.93722

This shows that the utility of the classifier (how much decision costs can be improved) lies in the region of (effective) priors closer to 0.5.

Instead of the somewhat complicated cost function, which depends on a combination of three parameters $p_{\rm tar}$, $C_{\rm fa}$ and $C_{\rm miss}$, we can analyze the classifier's performance in terms of the critical parameter, the prior-log-odds $\zeta$

$$\zeta = \log\bigl(\dfrac{p_{\rm tar}}{1-p_{\rm tar}} \dfrac{C_{\rm miss}}{C_{\rm fa}}\bigr) $$

such a function is the Bayes Error Rate $E_B$

$$E_B = p_{\rm eff} \ p_{\rm miss} + (1-p_{\rm eff}) \ p_{\rm fa}$$

where $p_{\rm eff}$ is the effective prior (a cost-weighted prior, related to $\zeta$), and $p_{\rm fa}$ and $p_{\rm miss}$ are the false alarm and miss rate computed by thresholding the scores tar and non at the optimal Bayes threshold $-\zeta$.



In [23]:

    
ber(tar, non, -2.29)









    Out[23]:





0.06651028927562345



In [24]:

    
## now scan the Bayes error rate (similar to the dcf above) for a range of prior log odds, and plot
## This is known as the Applied Probability of Error plot
apeplot(r)









    Out[24]:

The red curve corresponds to the actual Bayes Error Rate, given the classifier's scores and the plo parameter, the green curve corresponds to the minimum Bayes Error Rate. This minimum Bayes Error is obtained, for each value of plo by shifting the scores until the Bayes Error Rate is minimum. In this case, because scores were generated as log-likelihood ratios, this minimum is more-or-less equal to the actual ber. The black dotted line is the ber of the trivial classifier, that bases the decision on the prior alone (i.e., always decides true if lpo>0, and false otherwise).

You may appreciate this, by considering scrores that are just twice as big---when we plot the APE plot, the Bayes Error is higher than the minimum value---this is an example of mis-calibratred log-likelihood-ratio scores.



In [25]:

    
r2 = roc(2tar, 2non)
apeplot(r2)









    Out[25]:

For abs(lpo)>2.5, the classifier's log-likelihood-ratio scores are so much off, that Bayes decisions are taken that lead to a higher error than the trivial classifier!

This whole curve of calibration performance as a function of prior can be summarized as proportional to the integral under the red curve, a quantity known as the "Cost of the log-likelihood-ratio", cllr:



In [26]:

    
cllr(2tar, 2non)









    Out[26]:





0.6340853728577864

In a similar way, the area under the green curve can be computed, and this is a summary of the discrimination performance of the classifier, in the same units (which happen to be "bits") as cllr:



In [27]:

    
mincllr(2tar, 2non)









    Out[27]:





0.5103042431991219

Another way of viewing the same calibration-over-effective-prior analysis, but that focuses on the utility of the classifier over the trivial classifier, is to show the normalized Bayes error rate:



In [28]:

    
nbeplot(r)









    Out[28]:

Again, green and red are minimum and actual normalized Bayes error rates. The two dashed/dotted curves correspond to the contribution of the false positives and false negatives tot the total normalized error rate. The plo score is chosen asymmetric, because there often are many more non-target trials than target trials, and hence the accuracy of the metric is better.

Finally, a last graph in this workbook, a plot that show the relation between the actual scores and the optimally calibrated scores (corresponding to the minimum-curves in nbe and ber). For well-calibrated log-likelihood-ratio scores this should be a straight line of unit slope. Let's see what happens for our over-confident llr scores:



In [29]:

    
llrplot(r2)









    Out[29]:

Mmm. There still is some work to be done in scaling the graphs. More to come.



In [ ]:

	pfa	pmiss	thres	chull	llr
1	1.0	0.0	-10.711334863947837	true	-Inf
2	0.964	0.0	-5.578924344835429	true	-4.729156165769083
3	0.964	0.001	-5.578760545269962	false	-4.729156165769083
4	0.8508	0.001	-4.071446506828367	true	-4.714741822417382
5	0.8508	0.002	-4.071422418269238	false	-4.714741822417382
6	0.73922	0.002	-3.281023040359016	true	-3.740047740688336
7	0.73922	0.003	-3.280912541319257	false	-3.740047740688336
8	0.69712	0.003	-3.027839712052299	true	-2.8739035651888574
9	0.69712	0.004	-3.02769928579535	false	-2.8739035651888574
10	0.68447	0.004	-2.9550692339638793	false	-2.8739035651888574
11	0.68447	0.005	-2.955049365503252	false	-2.8739035651888574
12	0.68373	0.005	-2.9507927393062925	false	-2.8739035651888574
13	0.68373	0.006	-2.950743887080411	false	-2.8739035651888574
14	0.68028	0.006	-2.931498602177074	false	-2.8739035651888574
15	0.68028	0.007	-2.931435660028978	false	-2.8739035651888574
16	0.6506799999999999	0.007	-2.7734494606815216	false	-2.8739035651888574
17	0.6506799999999999	0.008	-2.7732652640462256	false	-2.8739035651888574
18	0.60859	0.008	-2.553231097149636	true	-2.7444892712936144
19	0.60859	0.009	-2.5532203058185923	false	-2.7444892712936144
20	0.60563	0.009	-2.536933187354933	false	-2.7444892712936144
21	0.60563	0.01	-2.5369279619767133	false	-2.7444892712936144
22	0.59363	0.01	-2.47744186076765	false	-2.7444892712936144
23	0.59363	0.011	-2.477420592967947	false	-2.7444892712936144
24	0.56192	0.011	-2.3135827276046363	true	-2.622492312740559
25	0.56192	0.012	-2.3135798730768524	false	-2.622492312740559
26	0.54815	0.012	-2.2411423612327432	true	-1.6477384699325126
27	0.54815	0.013	-2.241121938275877	false	-1.6477384699325126
28	0.5473399999999999	0.013	-2.2373328971133386	false	-1.6477384699325126
29	0.5473399999999999	0.014	-2.2372946667468003	false	-1.6477384699325126
30	0.54616	0.014	-2.23085370683574	false	-1.6477384699325126
&vellip	&vellip	&vellip	&vellip	&vellip	&vellip