"An investigation of the false discovery rate and the misinterpretation of p-values" by David Colquhoun published in Royal Society Open Science 2014

Some general settings

In [1]:
%matplotlib inline
from pylab import*
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.power import TTestIndPower

In [2]:
def norm_pdf(x, mu, sd):
    dist = stats.norm(mu, sd)
    return dist.pdf(x)

In [3]:
def run_simulation(mean, sigma, nobs, nsim=10000):
    pval, diff = np.zeros(nsim), np.zeros(nsim)
    for r in range(nsim):
        s1, s2 = np.random.multivariate_normal(mean, sigma, nobs).T
        t, p = stats.ttest_ind(s1, s2, equal_var=False, nan_policy='omit')
        diff[r] = np.mean(s1) - np.mean(s2)
        pval[r] = p   
    return diff, pval

In [4]:
def produce_power_and_mean_effect_size(nobs):
    Psig = 0.05
    mean = np.array([0.0, 1.0])
    sd = 1.0
    sigma = np.array([[sd**2, 0.0], [0.0, sd**2]]) #covariance matrix
    diff, pval = run_simulation(mean, sigma, nobs)
    nsig = pval[pval <= Psig].size
    md = -np.sum(diff[pval <= Psig]) / nsig
    effect_size = (mean[0] - mean[1]) / sd
    power_analysis = TTestIndPower()
    pw = power_analysis.power(effect_size, nobs, Psig)
    print('n=', nobs, 'power=', pw, 'meandiff=', md)
    return pw, md

In [5]:
nsim = 10000

Figure 3. Results of simulated t-tests where the null hypothesis is true. Two groups of observations have identical true means and SD of 1.


In [6]:
mean0 = np.array([0.0, 0.0])
sd0 = 1.0
sigma0 = np.array([[sd0**2, 0.0], [0.0, sd0**2]]) #covariance matrix
diff0, pval0 = run_simulation(mean0, sigma0, 16, nsim=nsim)

A, The distribution of 'observed' differences between means. It is centered on zero as the average difference is close to zero.


In [7]:
plt.hist(diff0, bins=20);
xlabel('difference between means')
ylabel('frequency');


B, The distribution of the p-values is flat (uniform).


In [8]:
plt.hist(pval0, bins=20);
xlabel('value of P')
ylabel('frequency');



In [9]:
Psig = 0.05
nsig0 = pval0[pval0 <= Psig].size
print("{0:.0f}% of 'significant' differences for 'significance' threshold P = {1:.2f}".
     format(100*nsig0/nsim, Psig) + 
     '\nAll these are false positives.')


5% of 'significant' differences for 'significance' threshold P = 0.05
All these are false positives.

Figure 4. The case where the null hypothesis is not true.


In [10]:
mu1, sd1 = 0.0, 1.0 # mean and SD of control gruop
mu2, sd2 = 1.0, 1.0 # mean and SD of treatmen group
n1, n2 = 16, 16    #number of obs per sample

In [11]:
xmin = mu1 - 4 * sd1
xmax = mu1 + 4 * sd1
increase = (xmax - xmin) / 100
x = np.arange(xmin, xmax, increase)

A, Observations in control group (blue) and treatment group (red) are normally distributed with means that differ by 1SD.


In [12]:
plot(x, norm_pdf(x, mu1, sd1), 'b-')
plot(x, norm_pdf(x, mu2, sd2), 'r-')
xlabel('distributions of observations')
ylabel('probability density');


B, distribution of means of 16 observations.


In [13]:
plot(x, norm_pdf(x, mu1, sd1/sqrt(n1)), 'b-')
plot(x, norm_pdf(x, mu2, sd2/sqrt(n1)), 'r-')
xlabel('distributions of means (16 observations)')
ylabel('probability density');


The standard deviation of means of 16 observations is:


In [14]:
print('Standard deviation of means = ', sd1/sqrt(n1))


Standard deviation of means =  0.25

In [15]:
effect_size = (mu2 - mu1) / sd1
print('Effect size=', effect_size)


Effect size= 1.0

In [16]:
power_analysis = TTestIndPower()
pwr = power_analysis.power(effect_size, n1, Psig)
print('Power =', pwr, 'for P =', Psig)


Power = 0.7813977845875587 for P = 0.05

Figure 5. Result of simulated t-tests in the case where the null hypothesis is not true.


In [17]:
mean = np.array([mu1, mu2])
sigma = np.array([[sd1**2, 0.0], [0.0, sd2**2]]) #covariance matrix
nsim = 10000
diff, pval = run_simulation(mean, sigma, n1, nsim=nsim)

A, The distribution of 'observed' differences between means of 16 observations.


In [18]:
plt.hist(diff, bins=20);
xlabel('difference between means')
ylabel('frequency');


B, The distribution of p-values.


In [19]:
plt.hist(pval, bins=20);
xlabel('value of P')
ylabel('frequency');



In [20]:
nsig = pval[pval <= Psig].size
print("{0:.0f}% of 'significant' differences for 'significance' threshold P = {1:.2f}".
     format(100*nsig/nsim, Psig) + 
     '\nas expected from the calculated power {0:.3f}'.format(pwr))


78% of 'significant' differences for 'significance' threshold P = 0.05
as expected from the calculated power 0.781

In [21]:
#mean observed difference for expts with Pmin<P<=Pmax
meandiff = np.sum(diff[pval <= Psig]) / nsig
print("\n","Observed difference between means for 'sig' results = ", 
      meandiff, " True value = ", mu1-mu2)


 Observed difference between means for 'sig' results =  -1.1294952182087719  True value =  -1.0

The measured effect size is too big. This happens because experiments that have by a chance a larger than average effect size are more likely to be found 'significant' than those that happen to have small effect size.

Figure 6. Distribution of p-values from test like in Fig. 5 but with only 4 observations in each group, rather than 16.


In [22]:
diff4, pval4 = run_simulation(mean, sigma, 4, nsim=nsim)
plt.hist(pval4, bins=20)
xlabel('value of P')
ylabel('frequency');



In [23]:
pwr4 = power_analysis.power(effect_size, 4, Psig)
print('Power =', pwr4, 'for P =', Psig)


Power = 0.2231880306351374 for P = 0.05

In [24]:
nsig4 = pval4[pval4 <= Psig].size
print("{0:.0f}% of 'significant' differences for 'significance' threshold P = {1:.2f}".
     format(100*nsig4/nsim, Psig))


19% of 'significant' differences for 'significance' threshold P = 0.05

In [25]:
#mean observed difference for expts with Pmin<P<=Pmax
meandiff4 = np.sum(diff4[pval4 <= Psig]) / nsig4
print("\n","Observed difference between means for 'sig' results = ", 
      meandiff4, " True value = ", mu1-mu2)


 Observed difference between means for 'sig' results =  -1.7920071777663231  True value =  -1.0

Figure 7. The average difference between means for tests where the power of the test was varied by changing the number of observations.


In [26]:
nobs_list = [4, 6, 8, 10, 12, 14, 16, 20, 50]
power_list, esize_list = [], []
for n in nobs_list:
    pw, md = produce_power_and_mean_effect_size(n)
    power_list.append(pw)
    esize_list.append(md)


n= 4 power= 0.2231880306351374 meandiff= 1.8126934359013769
n= 6 power= 0.3473536967037645 meandiff= 1.5696056839511257
n= 8 power= 0.4612388467243746 meandiff= 1.4083861462849916
n= 10 power= 0.5620066386966456 meandiff= 1.3042109544352138
n= 12 power= 0.6486425480589466 meandiff= 1.2230225897767233
n= 14 power= 0.7214197465133732 meandiff= 1.1670049429639395
n= 16 power= 0.7813977845875587 meandiff= 1.1293262020163095
n= 20 power= 0.8689530131730784 meandiff= 1.0781776579980178
n= 50 power= 0.9986074105881142 meandiff= 1.0009009488438663

In [27]:
plot(power_list, esize_list, 'bo')
axhline(y=1, color='r')
xlabel('power')
ylabel("mean effect size for 'sig' results");



In [ ]: