Bayesian Hierarchical Test

Module hierarchical in bayesiantests compares the performance of two classifiers that have been assessed by m-runs of k-fold cross-validation on q datasets. It returns probabilities that, based on the measured performance, one model is better than another or vice versa or they are within the region of practical equivalence.

This notebook demonstrates the use of the module.

We will load the 10-folds, 10-runs cross-validation results of the naive Bayesian classifier and HNB on 54 UCI datasets from the file Data/diffNbcHnb.csv. The file includes the differences of accuracies NBC-HNB in the $10 \times 10=100$ (columns) ripetions and 54 datasets (rows). A total of 5400 values.



In [1]:

    
import numpy as np
scores = np.loadtxt('Data/diffNbcHnb.csv', delimiter=',')
names = ("HNB", "NBC")
print(scores)









    



[[-0.02222222  0.01111111 -0.02222222 ..., -0.02222222 -0.04494382  0.        ]
 [ 0.         -0.08695652  0.04347826 ..., -0.09090909  0.         -0.09090909]
 [ 0.06451613  0.03225806  0.         ...,  0.03333333  0.03333333  0.1       ]
 ..., 
 [ 0.01428571  0.          0.         ...,  0.02857143  0.01428571  0.        ]
 [-0.02684564 -0.02684564  0.00671141 ..., -0.02027027  0.01351351
   0.01351351]
 [-0.18181818  0.          0.         ..., -0.2        -0.1         0.        ]]

To analyse this data, we will use the function hierarchical in the module bayesiantests that accepts the following arguments.

scores: a 2-d array of differences.
rope: the region of practical equivalence. We consider two classifiers equivalent if the difference in their      
      performance is smaller than rope.
rho: correlation due to cross-validation
names: the names of the two classifiers; if x is a vector of differences, positive values mean that the second 
       (right) model had a higher score.

The hierarchical function uses STAN through Python module pystan.

Summarizing probabilities

Function hierarchical(scores,rope,rho, verbose, names=names) computes the Bayesian hierarchical test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.



In [8]:

    
import bayesiantests as bt    
rope=0.01 #we consider two classifers equivalent when the difference of accuracy is less that 1%
rho=1/10 #we are performing 10 folds, 10 runs cross-validation
pleft, prope, pright=bt.hierarchical(scores,rope,rho)

The first value (left) is the probability that the the differences of accuracies is negative (and, therefore, in favor of HNB). The third value (right) is the probability that the the differences of accuracies are positive (and, therefore, in favor of NBC). The second is the probability of the two classifiers to be practically equivalent, i.e., the difference within the rope.

In the above case, the HNB performs better than naive Bayes with a probability of 0.9965, and they are practically equivalent with a probability of 0.002. Therefore, we can conclude with high probability that HNB is better than NBC.

If we add arguments verbose and names, the function also prints out the probabilities.



In [9]:

    
pl, pe, pr=bt.hierarchical(scores,rope,rho, verbose=True, names=names)









    



P(HNB > NBC) = 0.998, P(rope) = 0.002, P(NBC > HNB) = 0.0

The posterior distribution can be plotted out:

using the function hierarchical_MC(scores,rope,rho, names=names) we generate the samples of the posterior
using the function plot_posterior(samples,names=('C1', 'C2')) we then plot the posterior in the probability simplex



In [10]:

    
%matplotlib inline
import matplotlib.pyplot as plt

samples=bt.hierarchical_MC(scores,rope,rho, names=names)

#plt.rcParams['figure.facecolor'] = 'black'

fig = bt.plot_posterior(samples,names)
plt.savefig('triangle_hierarchical.png',facecolor="black")
plt.show()

It can be seen that the posterior mass is in the region in favor of HNB and so it confirms that the classifier is better than NBC. From the posterior we have also an idea of the magnitude of the uncertainty and the "stability" of our inference.

Checking sensitivity to the prior

To functions hierarchical allow also to test the effect of the prior hyperparameters. We point to the last reference for a discussion about prior sensitivity.

References

@ARTICLE{bayesiantests2016, author = {{Benavoli}, A. and {Corani}, G. and {Demsar}, J. and {Zaffalon}, M.}, title = "{Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis}", journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1606.04316}, url={https://arxiv.org/abs/1606.04316}, year = 2016, month = jun }

@article{corani2016unpub, title = { Statistical comparison of classifiers through Bayesian hierarchical modelling}, author = {Corani, Giorgio and Benavoli, Alessio and Demsar, Janez and Mangili, Francesca and Zaffalon, Marco}, url = {http://ipg.idsia.ch/preprints/corani2016b.pdf}, year = {2016}, date = {2016-01-01}, institution = {technical report IDSIA}, keywords = {}, pubstate = {published}, tppubtype = {article} }