In [1]:
import numpy as np
scores_a = np.array([ 95.95, 71.4 , 83.34, 49.99, 76.17, 86.22, 84.45, 81.87, 52.81, 75.04, 71.94, 50.12, 72.03, 60. , 83.69])
scores_b = np.array([ 97.88, 71.66, 82.87, 50.71, 74.17, 86.68, 85.46, 82.02, 60.08, 75.83, 74.53, 45.76, 72.65, 60. , 84.31])
In [2]:
sum(scores_a > scores_b), sum(scores_a == scores_b), sum(scores_b > scores_a)
Out[2]:
In [3]:
# Disclaimer: This is not an exemplary use of Python.
# The code is optimized for those who are not used to it.
from random import random
results = []
for match in range(1000):
wins = 0
for dataset in range(15):
if random() < 0.5:
wins += 1
results.append(wins)
In [4]:
wins = np.bincount(results, minlength=15)
print(wins)
In [5]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
_ = plt.bar(range(15), wins)
In [6]:
sum(wins[11:])
Out[6]:
Since such a result is so unlikely, we reject the idea that A and B perform the same.
In [7]:
import scipy.stats
scipy.stats.wilcoxon(scores_a, scores_b)
Out[7]:
In [8]:
scores = np.array(
[[ 95.94595, 97.88477, 98.19664, 98.64226, 98.65337],
[ 71.39521, 71.6561 , 73.14819, 77.2214 , 77.26883],
[ 83.3377 , 82.87099, 81.95057, 77.1957 , 77.22797],
[ 49.98591, 50.70515, 51.71036, 49.49834, 49.53232],
[ 76.16673, 74.16672, 61.00001, 83.50006, 83.50006],
[ 86.21744, 86.68119, 84.88411, 86.28991, 86.40586],
[ 84.44935, 85.46383, 84.30431, 85.05801, 85.14497],
[ 81.87252, 82.02315, 80.12298, 80.21214, 80.41979],
[ 52.8114 , 60.08016, 63.39823, 64.20668, 64.19279],
[ 75.04 , 75.83 , 75.65 , 72.17 , 72.17 ],
[ 71.93504, 74.5346 , 73.77053, 72.31167, 72.31384],
[ 50.12499, 45.7625 , 38.66669, 39.66248, 40.04585],
[ 72.03123, 72.64526, 72.31939, 71.50976, 71.57642],
[ 60. , 60. , 56.9375 , 60. , 60. ],
[ 83.6875 , 84.3125 , 85.54998, 78.57087, 79.08754],
[ 84.36091, 84.42988, 84.80345, 80.51726, 80.58623],
[ 98.31118, 98.52327, 99.05625, 99.18611, 99.17557],
[ 89.40237, 91.0858 , 91.48102, 90.11586, 90.14524],
[ 93.33322, 93.06656, 92.06654, 93.59991, 93.59991],
[ 87.79086, 91.03252, 92.34689, 99.43702, 99.39943],
[ 87.59993, 88.43326, 91.59994, 85.79996, 85.39996],
[ 56.84723, 56.84723, 56.61866, 56.84723, 56.84723],
[ 85.09998, 86.86189, 85.57139, 76.37137, 77.24757],
[ 62.66532, 64.52785, 66.24374, 64.09357, 64.07691],
[ 74.63639, 84.63801, 100. , 97.80228, 97.80228],
[ 96.38929, 96.73215, 97.99556, 98.91727, 98.91727],
[ 95.76072, 99.9508 , 99.9631 , 100. , 100. ],
[ 90.29939, 92.73304, 94.24765, 97.18137, 97.18986],
[ 92.16903, 96.91635, 96.21347, 78.59252, 80.38254],
[ 93.37478, 96.92301, 96.67273, 96.60513, 96.54662],
[ 80.41671, 79.83337, 83.16672, 73.16674, 73.66674],
[ 87.72004, 97.76273, 97.66812, 88.61625, 89.56246],
[ 75.26002, 75.70347, 74.56967, 74.32345, 74.31062],
[ 68.11142, 66.77809, 65.1114 , 69.77806, 69.77806],
[ 47.19786, 47.87167, 47.8458 , 41.00794, 40.94911],
[ 91.14715, 95.06921, 96.47191, 94.14284, 94.09091],
[ 85.59372, 88.36932, 88.61455, 88.9546 , 88.9546 ],
[ 86.62217, 87.31057, 87.46307, 89.9792 , 90.01045],
[ 93.81435, 97.81069, 97.31636, 97.841 , 97.841 ],
[ 76.71185, 77.04996, 76.12854, 74.67613, 74.86423],
[ 92.19582, 93.30815, 94.6709 , 92.63445, 93.03048],
[ 89.82622, 93.11031, 92.27349, 92.85163, 93.22753],
[ 79.21932, 80.90305, 81.80477, 81.05974, 81.51275],
[ 95.41686, 96.11599, 96.13167, 94.16605, 94.15662],
[ 59.03334, 59.4 , 57.46667, 63.5 , 63.5 ],
[ 62.06667, 67.66667, 71.79999, 77.66667, 77.5 ],
[ 46.32085, 46.72085, 43.93333, 46.78335, 46.78335],
[ 58.78783, 75.69693, 84.60603, 76.53533, 76.68688],
[ 79.968 , 85.008 , 84.874 , 74.872 , 75.522 ],
[ 64.92861, 66.7381 , 77.90477, 60.95248, 60.95248],
[ 98.70582, 98.2058 , 97.58814, 90.72864, 90.94432],
[ 97.19638, 97.05376, 96.36779, 95.03683, 95.28029],
[ 57.61428, 57.81658, 57.73517, 57.0753 , 57.08878],
[ 93.98181, 94.66363, 99.6 , 92.60908, 92.60908]])
In [9]:
scipy.stats.friedmanchisquare(scores[:, 0], scores[:, 1], scores[:, 2], scores[:, 3], scores[:, 4])
Out[9]:
In [10]:
scipy.stats.friedmanchisquare(*scores.T)
Out[10]:
Null-hypothesis significance tests are magic bullet.
Null-hypothesis significance tests automatize science.
If the null-hypothesis was true, algorithm A should have won 7 matches out of 14.
Algorithm A did not win 7 matches.
=> The hypothesis is false.
All trees are green; I am not green; hence, I'm not a tree.
Null-hypothesis is probabilistic:
If the null hypothesis is true, the number of wins is unlikely to be 11.
But the number of wins is 11.
=> So the null-hypothesis is unlikely to be true.
We reject the null-hypothesis (as unlikely).
If somebody is a US citizen, he is unlikely to be the US president.
But Barack Obama is the US president.
=> Barack Obama is not a US citizen.
We reject the claim that president Obama is a US citizen (as unlikely).
Birthers are right!
(Scientifically proven; p < 0.05.)
We
We reject the null-hypothesis (for being too improbable, as in 2) if the observed data was improbable (as in 1).
We all know our Bayes.
$$P(H|D) = \frac{P(D|H)P(H)}{P(D)}$$and
$$P(H|D) \ne P(D|H)$$$P(H|D)$ depends on the prior $P(H)$.
We all
i.e. we know that p-value is something we don't care about.
... except that it has to be < 0.05 so we can publish.
p-values are meaningless since they represent the probability that nobody cares about.
" (...) p <0.05, therefore the null-hypothesis is unlikely".
Wrong. It's not the hypothesis that's unlikely. Just the data.
" (...) p > 0.05, therefore our superfast method is no worse than the existing slow algorithm."
Wrong. You cannot prove the null. You can only reject it.
NHST is a tool and it's not the tool's fault that it's misused.
Objection. NHST is presented as a tool for researchers interested in whether their hypotheses are true or not.
If you bought a hammer because it was advertised as a tool for cutting meat, you'd send it back to Amazon.
... and write a nasty review.
Fisher: p-value as the evidence against the null-hypothesis.
Something weird is going on here...
Neyman & Pearson: making optimal decisions with respect to probabilities of mutually exclusive hypotheses.
Both approaches are correct.
P-values are not unlike likelihoods, which are related to probabilities.
Deciding between two hypothesis is totally not unlike decision making.
Their combination is unfortunately not.
0.05 is the magic $p$-value. Anything below that is success, anything above that is failure.
Nothing magic about 0.05.
No reason to print numbers related to $p < 0.05$ in bold print.
Not only is there no sacred critical value.
There is no way to reason about critical values.
P-values are meaningless.
How do you set a reasonable threshold for a meaningless value?
No, p = 0.05 doesn't mean we accept 5 % of false alternative hypotheses.
P-values are not related to probabilities of hypotheses.
NHST is revolves around the meaningless threshold (0.05) for automatizing decisions about which findings are true and which are random.
No two classifiers are exactly same (unless they are the same classifier).
Hence, we should always reject the null-hypothesis. We just need to collect enough data.
Why go through collecting all the data then, if we already know the end result?
Sorry for the spoiler, guys:
The null-hypotheses dies in Section 4.
Arbitrarily small difference between a pair of classifiers (effect size) is statistically significant with enough data.
The p-value is the function of the effect and the sample size.
The sample size is manipulated by the researcher. Now do the math.
The p-value is intuitively understood as the indicator of the effect size.
Wrong. It is the function of effect size and sample size: same p-values do not imply same effect sizes.
Non-parametric tests make this even worse by ignoring the differences.
If one classifier consistently beats another, the difference does not matter.
The list goes on and on.
Good luck with that.
Next year it will be 50 years since Meehl wrote:
Significance testing is a potent but sterile intellectual rake who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.
The American Statistical Association issued a statement (quote) to steer research into a ‘post p<0.05 era.’
The journal of Basic and Applied Social Psychology prohibited the use the p-word in their journal.
(I believe that) NHST was useful
But now we do.