Comparing multiple classifiers through Bayesian analysis

This is the Python implementation of Bayesian tests to compare the performance of classifiers (more in general algorithms) assessed via cross-validation. The package bayesiantests includes the following tests:

  • correlated_ttest performs the correlated t-test on the performance of two classifiers that have been assessed by $m$-runs of $k$-fold cross-validation on the same dataset. It return probabilities that, based on the measured performance, one model is better than another or vice versa or they are within the region of practical equivalence
  • signtest computes the probabilities that, based on the measured performance, one classifier is better than another or vice versa or they are within the region of practical equivalence
  • signrank computes the Bayesian equivalent of the Wilcoxon signed-rank test. It return probabilities that, based on the measured performance, one model is better than another or vice versa or they are within the region of practical equivalence.
  • hierarchical compares the performance of two classifiers that have been assessed by m-runs of k-fold cross-validation on q datasets by using a bayesian hierarchical model.

We have written three notebooks that explain how to use these tests. Moreover, the notebook The importance of the Rope discusses with an example the importance of setting a region of practical equivalence