Part 3: Non-parametric Bayesian tests

^{http://ipg.idsia.ch/tutorials/2016/bayesian-tests-ml/}

Alessio Benavoli, IDSIA

http://people.idsia.ch/~alessio/

Nonparametric analysis

T-test, correlated t-test are part of parametric statistics

In ML we use these nonparametric tests for

comparing classifiers
many other purposes (e.g., features, graph of a BN)

What is nonparametric statistics?

Frequentist definition

A nonparametric procedure is a statistical procedure that has a certain desirable properties that hold under relatively mild assumptions regarding the underlying populations from which the data are obtained.

_{_{Hollander, Myles, Douglas A. Wolfe, and Eric Chicken. Nonparametric statistical methods. John Wiley and Sons, 2013.}</sub>}

Confused

Frequentist definition

The term nonparametric is imprecise. The related term distribution-free has a more precise meaning...The distribution-free property enables one to obtain the distribution of the statistic under the null hypothesis without specifying the underlying distribution of the data

_{_{Hollander, Myles, Douglas A. Wolfe, and Eric Chicken. Nonparametric statistical methods. John Wiley and Sons, 2013.}</sub>}

comparing classifiers:

$m$ runs of the $k$-fold cross-validation accuracy for each dataset for nbc and for aode

How do we do that with NHST?

There is no direct NHST

that takes that table as input and
returns as output a statistical decision about which classifier is better in all the datasets.

NHST procedure 1/2

The usual way to address this problem with NHST is:

compute the mean difference of accuracy for each dataset (averaging the differences of accuracies obtained in the $m$ runs of the $k$-fold cross-validation);

Data

NHST procedure 1/2

NHST procedure 2/2

2) perform NHST to establish if the two classifiers have different performance or not based on these mean differences of accuracy.

Suggested (Nonparametric) Tests:

Wilcoxon signed-rank test or Sign test

Why?

no assumption of normality (distribution-free);
no assumption of commensurability across datasets;
robust to outliers.
_{_{J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 2006}}



In [1]:

    
using Distributions, Gadfly, Compose, DataFrames, Fontconfig,  Cairo
include("/home/benavoli/Data/Work_Julia/Julia/Plots/plot_data1.jl")
include("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Tests/Bsigntest.jl")
ClassID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierID.dat", ',')
ClassNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierNames.dat", ',')
DatasetID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetID.dat", ',');
DatasetNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetNames.dat", ',');
Percent_correct = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/Percent_correct.dat", ',');

cl1a=1 #nbc
cl2a=2 #aode
println("Comparison of ", ClassNames[cl1a,1], " vs. ", ClassNames[cl2a,1])
println()

#Compute mean accuracy
indi=find(x->x==cl1a,ClassID)
indj=find(x->x==cl2a,ClassID)
accNbc=Float64[]
accAode=Float64[]
for d=1:Int32(maximum(DatasetID))
    indd=find(x->x==d,DatasetID)
    indid=intersect(indi,indd)
    indjd=intersect(indj,indd)
    push!(accNbc,mean(Percent_correct[indid])/100)
    push!(accAode,mean(Percent_correct[indjd])/100)
end









    



Comparison of nbc vs. aode







    



WARNING: New definition 
    +(AbstractArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury) at /home/benavoli/.julia/v0.4/WoodburyMatrices/src/SymWoodburyMatrices.jl:106
is ambiguous with: 
    +(DataArrays.DataArray, AbstractArray) at /home/benavoli/.julia/v0.4/DataArrays/src/operators.jl:276.
To fix, define 
    +(DataArrays.DataArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury)
before the new definition.
WARNING: New definition 
    +(AbstractArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury) at /home/benavoli/.julia/v0.4/WoodburyMatrices/src/SymWoodburyMatrices.jl:106
is ambiguous with: 
    +(DataArrays.AbstractDataArray, AbstractArray) at /home/benavoli/.julia/v0.4/DataArrays/src/operators.jl:300.
To fix, define 
    +(DataArrays.AbstractDataArray{T<:Any, 2}, WoodburyMatrices.SymWoodbury)
before the new definition.

Wilcoxon signed-rank test

Typically used as follows.

null hypothesis is that the median of the distribution is 0;
when the test rejects the null, it claims that it is significantly different from 0.

Assumptions

$z_1,\dots,z_q$ i.i.d. and generated from a symmetric distribution
The test is miscalibrated if the distribution is not symmetric.

Wilcoxon signed-rank test

The test statistic is: $$ \begin{array}{rcl} t=& \sum\limits_{1\leq i \leq j \leq q} t^+_{ij}, \end{array} $$

where

$$ t^+_{ij}=\left\{\begin{array}{ll} 1 & \textit{if } z_i \geq -z_j,\\ 0 & \textit{otherwise. } \\ \end{array}\right. $$

Example: consider the two cases ${z}=\{-2,-1,4,5\}$ or ${z}=\{-1,4,5\}$, statistic is $t=7$ and, respectively, $t=5$.

$p$-value of the statistic is computed against the null (zero median).

NBC - AODE



In [8]:

    
using HypothesisTests
p=plot_data1(cl1a,cl2a,"all datasets",accNbc-accAode,-0.15,0.1)
display(p) 
p1=pvalue(SignedRankTest(accNbc,accAode))
p2=pvalue(SignTest(accNbc,accAode))
@printf "p-value SignedRank=%0.8f    " p1 
@printf "p-value Sign=%0.8f\n" p2









    













    



p-value SignedRank=0.00000163    p-value Sign=0.00000040



In [3]:

    
ClassID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierID.dat", ',')
ClassNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierNames.dat", ',')
DatasetID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetID.dat", ',');
DatasetNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetNames.dat", ',');
Percent_correct = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/Percent_correct.dat", ',');

cl1b=1 #nbc
cl2b=3 #aode
println("Comparison of ", ClassNames[cl1b,1], " vs. ", ClassNames[cl2b,1])
println()

#Compute mean accuracy
indi=find(x->x==cl1b,ClassID)
indj=find(x->x==cl2b,ClassID)
accNbc=Float64[]
accHnb=Float64[]
for d=1:Int32(maximum(DatasetID))
    indd=find(x->x==d,DatasetID)
    indid=intersect(indi,indd)
    indjd=intersect(indj,indd)
    push!(accNbc,mean(Percent_correct[indid])/100)
    push!(accHnb,mean(Percent_correct[indjd])/100)
end









    



Comparison of nbc vs. hnb

NBC-HNB



In [4]:

    
using HypothesisTests
p1=pvalue(SignedRankTest(accNbc,accHnb))
p2=pvalue(SignTest(accNbc,accHnb))
p=plot_data1(cl1b,cl2b,"all datasets",accNbc-accHnb,-0.15,0.1)
display(p) 
@printf "p-value SignedRank=%0.4f  " p1 
@printf "p-value Sign=%0.4f\n" p2









    













    



p-value SignedRank=0.0005  p-value Sign=0.0038



In [5]:

    
ClassID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierID.dat", ',')
ClassNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/ClassifierNames.dat", ',')
DatasetID = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetID.dat", ',');
DatasetNames = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/DatasetNames.dat", ',');
Percent_correct = readdlm("/home/benavoli/Data/Work_Julia/Tutorial/Julia/Data/Percent_correct.dat", ',');

cl1b=4 #nbc
cl2b=5 #aode
println("Comparison of ", ClassNames[cl1b,1], " vs. ", ClassNames[cl2b,1])
println()

#Compute mean accuracy
indi=find(x->x==cl1b,ClassID)
indj=find(x->x==cl2b,ClassID)
accJ48=Float64[]
accJ48gr=Float64[]
for d=1:Int32(maximum(DatasetID))
    indd=find(x->x==d,DatasetID)
    indid=intersect(indi,indd)
    indjd=intersect(indj,indd)
    push!(accJ48,mean(Percent_correct[indid])/100)
    push!(accJ48gr,mean(Percent_correct[indjd])/100)
end









    



Comparison of j48 vs. j48gr

J48 - J48gr (only 30 datasets)



In [6]:

    
using HypothesisTests
p1=pvalue(SignedRankTest(accJ48[20:50],accJ48gr[20:50]))
p2=pvalue(SignTest(accJ48[20:50],accJ48gr[20:50]))
p=plot_data1(cl1b,cl2b,"all datasets",accJ48[20:50]-accJ48gr[20:50],-0.013,0.011)
display(p)  
@printf "p-value SignedRank=%0.4f  " p1 
@printf "p-value Sign=%0.4f\n" p2









    













    



p-value SignedRank=0.0822  p-value Sign=0.6636

J48 - J48gr all datasets (cheating?)



In [7]:

    
using HypothesisTests
p1=pvalue(SignedRankTest(accJ48,accJ48gr))
p2=pvalue(SignTest(accJ48,accJ48gr))
p=plot_data1(cl1b,cl2b,"all datasets",accJ48-accJ48gr,-0.013,0.011)
display(p)  
@printf "p-value SignedRank=%0.4f  " p1 
@printf "p-value Sign=%0.4f\n" p2









    













    



p-value SignedRank=0.0006  p-value Sign=0.0095

Same problems

Black and White thinking

reasoning in the opposite direction (data|hypothesis) and not (hypothesis|data)
blind decisions $p<\alpha$
- magnitude of the effect size,
- the uncertainty,
- statistical significance vs. practical significance
etcetera.

Part 3: Non-parametric Bayesian tests

Alessio Benavoli, IDSIA

http://people.idsia.ch/~alessio/

Nonparametric analysis

What is nonparametric statistics?

Frequentist definition

Confused

Frequentist definition

comparing classifiers:

How do we do that with NHST?

NHST procedure 1/2

Data

NHST procedure 1/2

NHST procedure 2/2

Wilcoxon signed-rank test

Wilcoxon signed-rank test

NBC - AODE

NBC-HNB

J48 - J48gr (only 30 datasets)

J48 - J48gr all datasets (cheating?)

Same problems

Black and White thinking

A colour thinking

Comparison through Bayesian analysis: