t-test for comparison of accuracies of two algorithms

(C) 2017 by Damir Cavar

Version: 1.0, January 2017

This is a tutorial related to the discussion of evaluation of machine learning algorithms and classifiers using simple significance tests.

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.

Using the t-test on two distributions

The task is to compare two distributions of accuracy counts over some experimental results. Imagine that we test two algorithms $a$ and $b$ on the same training and test sets of data. We will apply the t-test as provided in the $stats$ module of $scipy$. We will need to import this module first:

In [36]:
from scipy import stats

Imagine that these are our results from two independent algorithms trained and tested on the same pairs of training and test data sets:

In [37]:
a = [23, 43, 12, 10]
b = [23, 42, 13, 10]

The data sets are the same in both experiments. We could treat the algorithms as two different tests on the same population (of data). The t-test measures whether the average scores differ significantly. We apply the t-test for two related samples of scores as provided in the $stats$ module:

In [38]:

Ttest_relResult(statistic=0.0, pvalue=1.0)

The returned result contains a $pvalue$ (p-value) of 1.0 in this case. If we assume that our Null-Hypothesis was that the two outcomes in $a$ and $b$ are unrelated, that is that they are random. The p-value tells us that we would make an error by rejecting this Null-Hypothesis, in fact, we would make an error with a certainty of 100% by rejecting this Null Hypothesis.

Imagine now that the experimental results are:

In [39]:
a = [23, 43, 12, 10]
b = [4, 15, 3, 9]

Applying the t-test again will give us a different result:

In [41]:

Ttest_relResult(statistic=-2.4238865567066052, pvalue=0.093841791155051452)

In this case, the p-value tells us that we would make an error with a likelihood of approx. 9% by rejecting the Null Hypothesis. Remember, the Null Hypothesis is that the two distributions have nothing in common. We could have set a threshold of 10% and decided to reject the Null Hypothesis.

In [42]:
a = [74, 89, 88, 78]
b = [24, 2, 3, 9]

In [43]:

Ttest_relResult(statistic=8.4725342868682922, pvalue=0.0034519815681217179)