Part 3: Non-parametric Bayesian tests

http://ipg.idsia.ch/tutorials/2016/bayesian-tests-ml/


Alessio Benavoli, IDSIA

http://people.idsia.ch/~alessio/



Nonparametric analysis

T-test, correlated t-test are part of parametric statistics

In ML we use these nonparametric tests for

  1. comparing classifiers
  2. many other purposes (e.g., features, graph of a BN)

What is nonparametric statistics?

Frequentist definition

A nonparametric procedure is a statistical procedure that has a certain desirable properties that hold under relatively mild assumptions regarding the underlying populations from which the data are obtained.


Hollander, Myles, Douglas A. Wolfe, and Eric Chicken. Nonparametric statistical methods. John Wiley and Sons, 2013.</sub>

Confused

Frequentist definition

The term nonparametric is imprecise. The related term distribution-free has a more precise meaning...The distribution-free property enables one to obtain the distribution of the statistic under the null hypothesis without specifying the underlying distribution of the data


Hollander, Myles, Douglas A. Wolfe, and Eric Chicken. Nonparametric statistical methods. John Wiley and Sons, 2013.</sub>

comparing classifiers:

  • $m$ runs of the $k$-fold cross-validation accuracy for each dataset for nbc and for aode

How do we do that with NHST?

There is no direct NHST

  • that takes that table as input and
  • returns as output a statistical decision about which classifier is better in all the datasets.

NHST procedure 1/2

The usual way to address this problem with NHST is:

  1. compute the mean difference of accuracy for each dataset (averaging the differences of accuracies obtained in the $m$ runs of the $k$-fold cross-validation);

Data

NHST procedure 1/2

NHST procedure 2/2

2) perform NHST to establish if the two classifiers have different performance or not based on these mean differences of accuracy.

Suggested (Nonparametric) Tests:

  • Wilcoxon signed-rank test or Sign test

Why?

  1. no assumption of normality (distribution-free);
  2. no assumption of commensurability across datasets;
  3. robust to outliers.
    J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 2006

In [1]:
using Distributions, Gadfly, Compose, DataFrames, Fontconfig,  Cairo
include("/home/benavoli/Data/Work_Julia/Julia/Plots/plot_data1.jl")
include("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Tests/Bsigntest.jl")
ClassID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierID.dat", ',')
ClassNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierNames.dat", ',')
DatasetID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetID.dat", ',');
DatasetNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetNames.dat", ',');
Percent_correct = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/Percent_correct.dat", ',');

cl1a=1 #nbc
cl2a=2 #aode
println("Comparison of ", ClassNames[cl1a,1], " vs. ", ClassNames[cl2a,1])
println()

#Compute mean accuracy
indi=find(x->x==cl1a,ClassID)
indj=find(x->x==cl2a,ClassID)
accNbc=Float64[]
accAode=Float64[]
for d=1:Int32(maximum(DatasetID))
    indd=find(x->x==d,DatasetID)
    indid=intersect(indi,indd)
    indjd=intersect(indj,indd)
    push!(accNbc,mean(Percent_correct[indid])/100)
    push!(accAode,mean(Percent_correct[indjd])/100)
end


Comparison of nbc vs. aode

WARNING: New definition 
    +(AbstractArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury) at /home/benavoli/.julia/v0.4/WoodburyMatrices/src/SymWoodburyMatrices.jl:106
is ambiguous with: 
    +(DataArrays.DataArray, AbstractArray) at /home/benavoli/.julia/v0.4/DataArrays/src/operators.jl:276.
To fix, define 
    +(DataArrays.DataArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury)
before the new definition.
WARNING: New definition 
    +(AbstractArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury) at /home/benavoli/.julia/v0.4/WoodburyMatrices/src/SymWoodburyMatrices.jl:106
is ambiguous with: 
    +(DataArrays.AbstractDataArray, AbstractArray) at /home/benavoli/.julia/v0.4/DataArrays/src/operators.jl:300.
To fix, define 
    +(DataArrays.AbstractDataArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury)
before the new definition.

Wilcoxon signed-rank test

Typically used as follows.

  • null hypothesis is that the median of the distribution is 0;
  • when the test rejects the null, it claims that it is significantly different from 0.

Assumptions

  • $z_1,\dots,z_q$ i.i.d. and generated from a symmetric distribution
  • The test is miscalibrated if the distribution is not symmetric.

Wilcoxon signed-rank test

The test statistic is: $$ \begin{array}{rcl} t=& \sum\limits_{1\leq i \leq j \leq q} t^+_{ij}, \end{array} $$

where

$$ t^+_{ij}=\left\{\begin{array}{ll} 1 & \textit{if } z_i \geq -z_j,\\ 0 & \textit{otherwise. } \\ \end{array}\right. $$

Example: consider the two cases ${z}=\{-2,-1,4,5\}$ or ${z}=\{-1,4,5\}$, statistic is $t=7$ and, respectively, $t=5$.

$p$-value of the statistic is computed against the null (zero median).

NBC - AODE


In [8]:
using HypothesisTests
p=plot_data1(cl1a,cl2a,"all datasets",accNbc-accAode,-0.15,0.1)
display(p) 
p1=pvalue(SignedRankTest(accNbc,accAode))
p2=pvalue(SignTest(accNbc,accAode))
@printf "p-value SignedRank=%0.8f    " p1 
@printf "p-value Sign=%0.8f\n" p2


DeltaAcc -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 -0.40 -0.39 -0.38 -0.37 -0.36 -0.35 -0.34 -0.33 -0.32 -0.31 -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 -0.4 -0.2 0.0 0.2 0.4 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 -40 -30 -20 -10 0 10 20 30 40 50 60 70 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -30 0 30 60 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
p-value SignedRank=0.00000163    p-value Sign=0.00000040

In [3]:
ClassID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierID.dat", ',')
ClassNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierNames.dat", ',')
DatasetID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetID.dat", ',');
DatasetNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetNames.dat", ',');
Percent_correct = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/Percent_correct.dat", ',');

cl1b=1 #nbc
cl2b=3 #aode
println("Comparison of ", ClassNames[cl1b,1], " vs. ", ClassNames[cl2b,1])
println()

#Compute mean accuracy
indi=find(x->x==cl1b,ClassID)
indj=find(x->x==cl2b,ClassID)
accNbc=Float64[]
accHnb=Float64[]
for d=1:Int32(maximum(DatasetID))
    indd=find(x->x==d,DatasetID)
    indid=intersect(indi,indd)
    indjd=intersect(indj,indd)
    push!(accNbc,mean(Percent_correct[indid])/100)
    push!(accHnb,mean(Percent_correct[indjd])/100)
end


Comparison of nbc vs. hnb

NBC-HNB


In [4]:
using HypothesisTests
p1=pvalue(SignedRankTest(accNbc,accHnb))
p2=pvalue(SignTest(accNbc,accHnb))
p=plot_data1(cl1b,cl2b,"all datasets",accNbc-accHnb,-0.15,0.1)
display(p) 
@printf "p-value SignedRank=%0.4f  " p1 
@printf "p-value Sign=%0.4f\n" p2


DeltaAcc -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 -0.40 -0.39 -0.38 -0.37 -0.36 -0.35 -0.34 -0.33 -0.32 -0.31 -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 -0.4 -0.2 0.0 0.2 0.4 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 -20 -15 -10 -5 0 5 10 15 20 25 30 35 -15.0 -14.5 -14.0 -13.5 -13.0 -12.5 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 -20 0 20 40 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
p-value SignedRank=0.0005  p-value Sign=0.0038

In [5]:
ClassID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierID.dat", ',')
ClassNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierNames.dat", ',')
DatasetID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetID.dat", ',');
DatasetNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetNames.dat", ',');
Percent_correct = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/Percent_correct.dat", ',');

cl1b=4 #nbc
cl2b=5 #aode
println("Comparison of ", ClassNames[cl1b,1], " vs. ", ClassNames[cl2b,1])
println()

#Compute mean accuracy
indi=find(x->x==cl1b,ClassID)
indj=find(x->x==cl2b,ClassID)
accJ48=Float64[]
accJ48gr=Float64[]
for d=1:Int32(maximum(DatasetID))
    indd=find(x->x==d,DatasetID)
    indid=intersect(indi,indd)
    indjd=intersect(indj,indd)
    push!(accJ48,mean(Percent_correct[indid])/100)
    push!(accJ48gr,mean(Percent_correct[indjd])/100)
end


Comparison of j48 vs. j48gr

J48 - J48gr (only 30 datasets)


In [6]:
using HypothesisTests
p1=pvalue(SignedRankTest(accJ48[20:50],accJ48gr[20:50]))
p2=pvalue(SignTest(accJ48[20:50],accJ48gr[20:50]))
p=plot_data1(cl1b,cl2b,"all datasets",accJ48[20:50]-accJ48gr[20:50],-0.013,0.011)
display(p)  
@printf "p-value SignedRank=%0.4f  " p1 
@printf "p-value Sign=%0.4f\n" p2


DeltaAcc -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 -0.037 -0.036 -0.035 -0.034 -0.033 -0.032 -0.031 -0.030 -0.029 -0.028 -0.027 -0.026 -0.025 -0.024 -0.023 -0.022 -0.021 -0.020 -0.019 -0.018 -0.017 -0.016 -0.015 -0.014 -0.013 -0.012 -0.011 -0.010 -0.009 -0.008 -0.007 -0.006 -0.005 -0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 -0.04 -0.02 0.00 0.02 0.04 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 -250 0 250 500 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500
p-value SignedRank=0.0822  p-value Sign=0.6636

J48 - J48gr all datasets (cheating?)


In [7]:
using HypothesisTests
p1=pvalue(SignedRankTest(accJ48,accJ48gr))
p2=pvalue(SignTest(accJ48,accJ48gr))
p=plot_data1(cl1b,cl2b,"all datasets",accJ48-accJ48gr,-0.013,0.011)
display(p)  
@printf "p-value SignedRank=%0.4f  " p1 
@printf "p-value Sign=%0.4f\n" p2


DeltaAcc -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 -0.037 -0.036 -0.035 -0.034 -0.033 -0.032 -0.031 -0.030 -0.029 -0.028 -0.027 -0.026 -0.025 -0.024 -0.023 -0.022 -0.021 -0.020 -0.019 -0.018 -0.017 -0.016 -0.015 -0.014 -0.013 -0.012 -0.011 -0.010 -0.009 -0.008 -0.007 -0.006 -0.005 -0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 -0.04 -0.02 0.00 0.02 0.04 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 -500 -480 -460 -440 -420 -400 -380 -360 -340 -320 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880 900 920 940 960 980 1000 -500 0 500 1000 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
p-value SignedRank=0.0006  p-value Sign=0.0095

Same problems

Black and White thinking

  • reasoning in the opposite direction (data|hypothesis) and not (hypothesis|data)
  • blind decisions $p<\alpha$
    • magnitude of the effect size,
    • the uncertainty,
    • statistical significance vs. practical significance
  • etcetera.

A colour thinking

Comparison through Bayesian analysis: