In [1]:
# include("src/ROCAnalysis.jl")


Out[1]:
ROCAnalysis

In [1]:
using ROCAnalysis

This package analyses the results of a probabilistic binary classifier. We'll assume that the classifier generates scores that tend to be higher towards the "target" class, and lower towards the "non-target" class.

In this notebook we will simulate the scores, by generating them from two separate distributions. We'll do this in a way that the scores can be considered well-calibrated log-likelihood-ratio scores, i.e., the scores are supposed to be probabilistically interpretable as llr $\ell$

$$\ell = \log \dfrac{P({\rm data} \mid {\rm target})}{P({\rm data} \mid {\rm nontarget})}$$

In [2]:
## Produce some well-calibrated log-likelihood-ratio scores for target and non-target class:
tar =  2 + 2randn(1000)
non = -2 + 2randn(100000)


Out[2]:
100000-element Array{Float64,1}:
 -3.65865 
  0.111599
  0.378915
 -4.8397  
 -3.57974 
 -3.60667 
 -2.83765 
 -3.09525 
 -0.88947 
 -0.538918
 -0.847828
 -1.00992 
 -3.0787  
  ⋮       
 -1.33932 
 -2.02718 
 -5.81056 
 -0.525173
 -5.38597 
 -4.12203 
 -3.7793  
 -0.844421
 -1.87812 
 -0.223619
 -2.43359 
 -0.560365

These scores are well-calibrated, because their distribution follows the defining property

$$\ell = \log \dfrac{P(\ell \mid {\rm target})}{P(\ell \mid {\rm nontarget})}$$

In [3]:
## quick estimate of equal error rate, should be close to pnorm(-1) = 0.5 + 0.5erf(-1/√2)
eer(tar, non)


Out[3]:
0.15815000000176976

The equal error rate is an important one-valued summary of the ROC curve (see below for a plot). It is the point at which falso positive and false negative rates are the same.


In [4]:
pnorm(-1)


Out[4]:
0.15865525393145707

For the score generated above from a Gaussian distribution, we can derive the theoretical EER quite easily. It is related to the quantile function of the normal distribution.


In [5]:
## compute full ROC statistics
r = roc(tar, non)


Out[5]:
ROCAnalysis.Roc{Float64}([1.0,0.964,0.964,0.8508,0.8508,0.73922,0.73922,0.69712,0.69712,0.68447  …  5.0e-5,5.0e-5,3.0e-5,3.0e-5,2.0e-5,2.0e-5,1.0e-5,1.0e-5,0.0,0.0],[0.0,0.0,0.001,0.001,0.002,0.002,0.003,0.003,0.004,0.004  …  0.982,0.984,0.984,0.985,0.985,0.987,0.987,0.995,0.995,1.0],Bool[true,true,false,true,false,true,false,true,false,false  …  false,false,false,false,false,false,true,false,true,true],[-10.7113,-5.57892,-5.57876,-4.07145,-4.07142,-3.28102,-3.28091,-3.02784,-3.0277,-2.95507  …  6.01045,6.05454,6.13288,6.15789,6.18196,6.21583,6.42813,6.45797,6.98265,7.13642],[-Inf,-4.72916,-4.72916,-4.71474,-4.71474,-3.74005,-3.74005,-2.8739,-2.8739,-2.8739  …  5.54344,5.54344,5.54344,5.54344,5.54344,5.54344,5.54344,6.68461,6.68461,Inf])

Admittedly not pretty printed---but we can call in the help from DataFrames to undestand the data structure of a Roc object

  • pfa the false alarm rate
  • pmiss the miss rate
  • thres the thereshold, separating this line's pfa and pmiss from the next
  • chull indicating if this point is on the convex hull of the ROC curve
  • llr the optimal log-likelihood-ratio score for all data points contributing to the ROC line segment from this line to the next

The last entry, llr, corresponds to the negative slope of the ROC convex hull, and has a direct relation to the "minimum" versions of the various metrics below.


In [6]:
using DataFrames
DataFrame(r)


Out[6]:
pfapmissthreschullllr
11.00.0-10.711334863947837true-Inf
20.9640.0-5.578924344835429true-4.729156165769083
30.9640.001-5.578760545269962false-4.729156165769083
40.85080.001-4.071446506828367true-4.714741822417382
50.85080.002-4.071422418269238false-4.714741822417382
60.739220.002-3.281023040359016true-3.740047740688336
70.739220.003-3.280912541319257false-3.740047740688336
80.697120.003-3.027839712052299true-2.8739035651888574
90.697120.004-3.02769928579535false-2.8739035651888574
100.684470.004-2.9550692339638793false-2.8739035651888574
110.684470.005-2.955049365503252false-2.8739035651888574
120.683730.005-2.9507927393062925false-2.8739035651888574
130.683730.006-2.950743887080411false-2.8739035651888574
140.680280.006-2.931498602177074false-2.8739035651888574
150.680280.007-2.931435660028978false-2.8739035651888574
160.65067999999999990.007-2.7734494606815216false-2.8739035651888574
170.65067999999999990.008-2.7732652640462256false-2.8739035651888574
180.608590.008-2.553231097149636true-2.7444892712936144
190.608590.009-2.5532203058185923false-2.7444892712936144
200.605630.009-2.536933187354933false-2.7444892712936144
210.605630.01-2.5369279619767133false-2.7444892712936144
220.593630.01-2.47744186076765false-2.7444892712936144
230.593630.011-2.477420592967947false-2.7444892712936144
240.561920.011-2.3135827276046363true-2.622492312740559
250.561920.012-2.3135798730768524false-2.622492312740559
260.548150.012-2.2411423612327432true-1.6477384699325126
270.548150.013-2.241121938275877false-1.6477384699325126
280.54733999999999990.013-2.2373328971133386false-1.6477384699325126
290.54733999999999990.014-2.2372946667468003false-1.6477384699325126
300.546160.014-2.23085370683574false-1.6477384699325126
&vellip&vellip&vellip&vellip&vellip&vellip

In [7]:
## accurate computation of the equal error rate, using the convex hull
eerch(r)


Out[7]:
0.15763626251390433

In [9]:
using Winston ## or perhaps another plotting package 
plot(r)


Out[9]:

The axes of a ROC plot are named "Probability of False Alarm" (false positive rate) and "Probability of Miss" (false negative rate). There are many names for these (see the README.md). A note aoutb this way of plotting:

  • The verical axis (like the horizontal) is an error rate, and not its complement, the hit rate. This means that a "good" ROC curve here goes towards the lower left corner, and not the upper left corner. In literature one traditionally comes across the true positive rate vs. false positive rate, which is an "upper-left" style of ROC.

In [10]:
## The "Detection Error Tradeoff" plot, this should give a more/less straight line
detplot(r)


Out[10]:

A Detection Error Trade-off plot (DET plot) shows the same information as the ROC plot above---but the scales are warped according to the inverse of the cumulative normal distribution. This way of plotting has many advantages:

  • If the distributions of target and non-target scores are both Normal, then the DET-curve is a straight line. In practice, many detection problems give rise to more-or-less straight DET curves, and this suggests that there exists a strictly increasing warping function that can make the score distributions (more) Normal.

  • Towards better performance (lower error rates), the resolution of the graph is higher. This makes it more easy to have multiple systems / performance characteristics over a smaller or wider range of performance in the same graph, and still be able to tell these apart.

  • Conventionally, the ranges of the axes are chosen 0.1%--50%---and the plot area should really be square. This makes it possible to immediately assess the overall performance based on the absolute position of the line in the graph if you have seen more DET plots in your life.

  • The slope of the (straight) line corresponds to the ratio of the σ parameters of the underlying Normal score distributions, namely that of the non-target scores divided by that of the target scores. Often, highly discriminative classifiers show very flat curves, indicating that that target scores have a much larger variance than the non-target scores.

  • The origin of this type of plot lies in psychophysics, where graph paper with lines according to this warping was referred to as double probability paper. The diagonal $y=x$ in a DET plot corresponds linearly to a quantity known as $d'$ (d-prime) from psychophysics, ranging from 0 at 50% error to about 6 at 0.1% error.


In [11]:
## compute the Area Under the ROC, should be close to 0.078
auc(r)


Out[11]:
0.08052228999999997

Please note, that traditionally the AUC is the complement of this. However, in this package all metrics are "error" and "cost" like, so low numbers always indicate better performance. As a courtesy to all other research fields that work with two-class detectos, we have the convenience function AUC(), which---like auc()---allows for partial integration up to a certain value of pfa or pmiss.


In [12]:
println(AUC(r))
println(AUC(r, pfa=0.1, normalize=false)) ## Traditional AUC integrating pfa from 0 to 0.1
println(AUC(r, pfa=0.1)) ## normalize partial AUC to range 0.5 (no discrimiation) to 1.0 (perfect discrimination)


0.91947771
0.058844110000000005
0.7833900526315789

In [13]:
## define a decision cost function by its parameter p_tar=0.01, Cfa=1, Cmiss=10 (NIST SRE 2008 setting)
d = DCF(0.01, 1, 10)


Out[13]:
Ptar = 0.01, Cfa = 1, Cmiss = 10
 prior log-odds = -2.292534757140544
 effective prior odds = 0.10101010101010102
 effective prior = 0.09174311926605504

Decision theory analyses the two types of error that can be made in terms of a cost function, weighting one type of error (say, a criminal that is not sentenced) differently from the other type of error (say, an innocent civilian in sentenced), and a prior for the target hypothesis (the prior that someone is guilty before assessing the evidence). Theses parameters are called $p_{\rm tar}$, $C_{\rm fa}$ and $C_{\rm miss}$ in this package. Their joint influence to decision making can be nicely summarized in the prior log-odds


In [14]:
## `actual costs' using a threshold of scores at -plo(d) (minus prior log odds)
plo(d)


Out[14]:
-2.292534757140544

If the scores are interpretable as calibrated log-likelihood-ratios, then the optimal threshold should be at -plo(DCF), the value of the dcf can be computed as:


In [15]:
dcf(tar, non, d=d)


Out[15]:
0.07233010000000001

If you know the decision cost function you want to work with and it doesn't change, you can set it default:


In [16]:
setdcf(d=d)


Out[16]:
Ptar = 0.01, Cfa = 1, Cmiss = 10
 prior log-odds = -2.292534757140544
 effective prior odds = 0.10101010101010102
 effective prior = 0.09174311926605504

In [17]:
dcf(tar, non)


Out[17]:
0.07233010000000001

Perhaps your data requires a different threshold, because the scores have not been calibrated.


In [18]:
dcf(tar, non, thres=0.)


Out[18]:
0.1732981

A lot |worse threshold for this data and cost function, thres = -2.29 performs better (has lower dcf). What is the minimal value of the dcf that can be obtained if we vary the threshold?


In [19]:
mindcf(tar, non, d=d)


Out[19]:
0.07157410000000002

The minimum dcf is the value of the decision cost function after choosing the threshold optimally, given the entire ROC is known. In this case, the data were generated to be well calibrated, meaning that the difference between the actual dcf and the mininumum dcf is small.

The mindcf is efficiently calculated by computing the ROC first, so if you have that lying around, you can use it in place of tar, non in most of the above functions:


In [20]:
mindcf(r)


Out[20]:
0.07157410000000002

Please, note that it is also possible to specify multiple cost functions and store it as a DCF object. For instance, if you want to scan the behaviour of the dcf when the prior varies:


In [21]:
d = DCF(collect(0.01:0.01:0.99), 1, 1)
mindcf(r, d=d)


Out[21]:
99-element Array{Float64,1}:
 0.009493 
 0.018154 
 0.0262308
 0.03364  
 0.040773 
 0.0469394
 0.0528443
 0.0587428
 0.0646369
 0.070531 
 0.0758965
 0.080628 
 0.0853595
 ⋮        
 0.076338 
 0.0709765
 0.065615 
 0.0602535
 0.054892 
 0.0495305
 0.0440354
 0.037706 
 0.0307648
 0.0238236
 0.0167444
 0.0093722

So from this analysis, it seems that at higher and lower priors, the dcf vanishes. This is true, because at priors close to 0 or 1, the decision can rely more and more on the prior alone, and less on the classifier. It therefore is interesting to look at what the normalized dcf is. This is the dcf computed for the classifier scores divided by the dcf based on the prior alone. In this case,


In [22]:
mindcf(r, d=d, norm=true)


Out[22]:
99-element Array{Float64,1}:
 0.9493  
 0.9077  
 0.87436 
 0.841   
 0.81546 
 0.782323
 0.754919
 0.734285
 0.718188
 0.70531 
 0.689968
 0.6719  
 0.656612
 ⋮       
 0.63615 
 0.645241
 0.65615 
 0.669483
 0.68615 
 0.707579
 0.733923
 0.75412 
 0.76912 
 0.79412 
 0.83722 
 0.93722 

This shows that the utility of the classifier (how much decision costs can be improved) lies in the region of (effective) priors closer to 0.5.

Instead of the somewhat complicated cost function, which depends on a combination of three parameters $p_{\rm tar}$, $C_{\rm fa}$ and $C_{\rm miss}$, we can analyze the classifier's performance in terms of the critical parameter, the prior-log-odds $\zeta$

$$\zeta = \log\bigl(\dfrac{p_{\rm tar}}{1-p_{\rm tar}} \dfrac{C_{\rm miss}}{C_{\rm fa}}\bigr) $$

such a function is the Bayes Error Rate $E_B$

$$E_B = p_{\rm eff} \ p_{\rm miss} + (1-p_{\rm eff}) \ p_{\rm fa}$$

where $p_{\rm eff}$ is the effective prior (a cost-weighted prior, related to $\zeta$), and $p_{\rm fa}$ and $p_{\rm miss}$ are the false alarm and miss rate computed by thresholding the scores tar and non at the optimal Bayes threshold $-\zeta$.


In [23]:
ber(tar, non, -2.29)


Out[23]:
0.06651028927562345

In [24]:
## now scan the Bayes error rate (similar to the dcf above) for a range of prior log odds, and plot
## This is known as the Applied Probability of Error plot
apeplot(r)


Out[24]:

The red curve corresponds to the actual Bayes Error Rate, given the classifier's scores and the plo parameter, the green curve corresponds to the minimum Bayes Error Rate. This minimum Bayes Error is obtained, for each value of plo by shifting the scores until the Bayes Error Rate is minimum. In this case, because scores were generated as log-likelihood ratios, this minimum is more-or-less equal to the actual ber. The black dotted line is the ber of the trivial classifier, that bases the decision on the prior alone (i.e., always decides true if lpo>0, and false otherwise).

You may appreciate this, by considering scrores that are just twice as big---when we plot the APE plot, the Bayes Error is higher than the minimum value---this is an example of mis-calibratred log-likelihood-ratio scores.


In [25]:
r2 = roc(2tar, 2non)
apeplot(r2)


Out[25]:

For abs(lpo)>2.5, the classifier's log-likelihood-ratio scores are so much off, that Bayes decisions are taken that lead to a higher error than the trivial classifier!

This whole curve of calibration performance as a function of prior can be summarized as proportional to the integral under the red curve, a quantity known as the "Cost of the log-likelihood-ratio", cllr:


In [26]:
cllr(2tar, 2non)


Out[26]:
0.6340853728577864

In a similar way, the area under the green curve can be computed, and this is a summary of the discrimination performance of the classifier, in the same units (which happen to be "bits") as cllr:


In [27]:
mincllr(2tar, 2non)


Out[27]:
0.5103042431991219

Another way of viewing the same calibration-over-effective-prior analysis, but that focuses on the utility of the classifier over the trivial classifier, is to show the normalized Bayes error rate:


In [28]:
nbeplot(r)


Out[28]:

Again, green and red are minimum and actual normalized Bayes error rates. The two dashed/dotted curves correspond to the contribution of the false positives and false negatives tot the total normalized error rate. The plo score is chosen asymmetric, because there often are many more non-target trials than target trials, and hence the accuracy of the metric is better.

Finally, a last graph in this workbook, a plot that show the relation between the actual scores and the optimally calibrated scores (corresponding to the minimum-curves in nbe and ber). For well-calibrated log-likelihood-ratio scores this should be a straight line of unit slope. Let's see what happens for our over-confident llr scores:


In [29]:
llrplot(r2)


Out[29]:

Mmm. There still is some work to be done in scaling the graphs. More to come.


In [ ]: