The statistics of hypothesis can be thought of as observations of random variables from known distributions,
We do not prove (just support) alternative hypothesis;
We just support alternative hypothesis when we reject the null hypothesis.
Each coin flip is a Bernoulli trial,
In [9]:
from probability import normal_cdf, inverse_normal_cdf
import math, random
In [2]:
def normal_approximation_to_binomial(n, p):
"""finds mu and sigma corresponding to a Binomial(n, p)"""
mu = p * n
sigma = math.sqrt(p * (1 - p) * n)
return mu, sigma
Whenever a random variable follows a normal distribution, we can use normal_cdf to figure out the probability that its realized value lies within (or outside) a particular interval:
In [18]:
# the normal cdf _is_ the probability the variable is below a threshold
normal_probability_below = normal_cdf
In [13]:
# it's above the threshold if it's not below the threshold
def normal_probability_above(lo, mu=0, sigma=1):
return 1 - normal_cdf(lo, mu, sigma)
# it's between if it's less than hi, but not less than lo
def normal_probability_between(lo, hi, mu=0, sigma=1):
return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)
# it's outside if it's not between
def normal_probability_outside(lo, hi, mu=0, sigma=1):
return 1 - normal_probability_between(lo, hi, mu, sigma)
We can also find either the nontail region or the (symmetric) interval around the mean that accounts for a certain level of likelihood.
For example, if we want to find an interval centered at the mean and containing 60% probability
In [14]:
def normal_upper_bound(probability, mu=0, sigma=1):
"""returns the z for which P(Z <= z) = probability"""
return inverse_normal_cdf(probability, mu, sigma)
def normal_lower_bound(probability, mu=0, sigma=1):
"""returns the z for which P(Z >= z) = probability"""
return inverse_normal_cdf(1 - probability, mu, sigma)
In [15]:
def normal_two_sided_bounds(probability, mu=0, sigma=1):
"""returns the symmetric (about the mean) bounds
that contain the specified probability"""
tail_probability = (1 - probability) / 2
# upper bound should have tail_probability above it
upper_bound = normal_lower_bound(tail_probability, mu, sigma)
# lower bound should have tail_probability below it
lower_bound = normal_upper_bound(tail_probability, mu, sigma)
return lower_bound, upper_bound
In particular, when we choose to flip the coin n = 1000 times. If our hypothesis of fairness is true,
In [66]:
import numpy as np
np.sqrt(1000*0.5*(1-0.5))
Out[66]:
In [8]:
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)
print("mu_0", mu_0)
print("sigma_0", sigma_0)
Assuming p really equals 0.5 (i.e., H0 is true), there is just a 5% chance we observe an X that lies outside this interval, which is the exact significance we wanted.
if H0 is true, approximately 19 times out of 20, this test will give the correct result.
In [12]:
print("normal_two_sided_bounds(0.95, mu_0, sigma_0)")
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)
print("power of a test")
print("95% bounds based on assumption p is 0.5")
print("lo", lo)
print("hi", hi)
In [16]:
print("actual mu and sigma based on p = 0.55")
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)
print("mu_1", mu_1)
print("sigma_1", sigma_1)
print("Now, the coin is slightly biased toward heads.")
In [18]:
# a type 2 error means we fail to reject the null hypothesis
# which will happen when X is still in our original interval
type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
power = 1 - type_2_probability # 0.887
print("type 2 probability", type_2_probability)
print("power", power)
In [19]:
print("one-sided test")
hi = normal_upper_bound(0.95, mu_0, sigma_0)
print("hi", hi) # is 526 (< 531, since we need more probability in the upper tail)
type_2_probability = normal_probability_below(hi, mu_1, sigma_1)
power = 1 - type_2_probability # = 0.936
print("type 2 probability", type_2_probability)
print("power", power)
This is a more powerful test, since
In [16]:
def two_sided_p_value(x, mu=0, sigma=1):
if x >= mu:
# if x is greater than the mean, the tail is above x
return 2 * normal_probability_above(x, mu, sigma)
else:
# if x is less than the mean, the tail is below x
return 2 * normal_probability_below(x, mu, sigma)
In [25]:
# If we were to see 530 heads, we would compute:
print("two_sided_p_value(529.5, mu_0, sigma_0) is ", \
two_sided_p_value(529.5, mu_0, sigma_0))
Why did we use 529.5 instead of 530?
normal_probability_between(529.5, 530.5, mu_0, sigma_0)
is a better estimate of the probability of seeing 530 heads than normal_probabil ity_between(530, 531, mu_0, sigma_0)
is.One way to convince yourself that this is a sensible estimate is with a simulation:
In [35]:
def count_extreme_values():
extreme_value_count = 0
for _ in range(100000):
num_heads = sum(1 if random.random() < 0.5 else 0 # count # of heads
for _ in range(1000)) # in 1000 flips
if num_heads >= 530 or num_heads <= 470: # and count how often
extreme_value_count += 1 # the # is 'extreme'
return extreme_value_count / 100000
In [38]:
count_extreme_values() # 0.062
Out[38]:
Since the p-value is greater than our 5% significance, we don’t reject the null.
In [28]:
print("two_sided_p_value(531.5, mu_0, sigma_0) is ",\
two_sided_p_value(531.5, mu_0, sigma_0))
It is smaller than the 5% significance, and we would reject the null.
For our one-sided test, if we saw 525 heads we would compute:
In [39]:
upper_p_value = normal_probability_above
lower_p_value = normal_probability_below
In [42]:
upper_p_value(524.5, mu_0, sigma_0) # 0.061
# wouldn’t reject the null.
Out[42]:
For our one-sided test, if we saw 527 heads we would compute:
In [41]:
upper_p_value(526.5, mu_0, sigma_0) # 0.047
# we would reject the null.
Out[41]:
Make sure your data is roughly normally distributed before using normal_probability_above to compute p-values.
Example
We can estimate the probability of the unfair coin by looking at the average value of the Bernoulli variables corresponding to each flip
If we observe 525 heads out of 1,000 flips, then we estimate p equals 0.525
.
How confident can we be about this estimate?
If we knew the exact value of p, the central limit theorem
tells us
the average of those Bernoulli variables should be approximately normal, with mean p and standard deviation $\sigma = \sqrt{p(1 - p) / 1000}$:
In [69]:
p_hat = 525 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat)/ 1000) # 0.0158
print(mu, sigma)
we are “95% confident” that the following interval contains the true parameter p:
In [47]:
normal_two_sided_bounds(0.95, mu, sigma)
Out[47]:
we do not conclude that the coin is unfair, since 0.5 falls within our confidence interval.
If instead we’d seen 540 heads, then we’d have:
In [70]:
p_hat = 540 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1 - p_hat) / 1000)
print(mu, sigma, normal_two_sided_bounds(0.95, mu, sigma) )
Here, “fair coin” doesn’t lie in the confidence interval.
P-hacking是科研人员不断的尝试统计计算直到p<.05,当然有时这可能是无意识的。
Simmons JP, Nelson LD, Simonshohn U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011 Nov;22(11):1359-66.
In [51]:
def run_experiment():
"""flip a fair coin 1000 times, True = heads, False = tails"""
return [random.random() < 0.5 for _ in range(1000)]
def reject_fairness(experiment):
"""using the 5% significance levels"""
num_heads = len([flip for flip in experiment if flip])
return num_heads < 469 or num_heads > 531
In [53]:
print("P-hacking")
random.seed(0)
experiments = [run_experiment() for _ in range(1000)]
num_rejections = len([experiment
for experiment in experiments
if reject_fairness(experiment)])
print(num_rejections) # 46
If 990 out of 1,000 A-viewers click their ad while only 10 out of 1,000 B-viewers click their ad, you can be pretty confident that A is the better ad.
But what if the differences are not so stark?
Let’s say that $N_A$ people see ad A, and that $n_A$ of them click it.
$\frac{n_A}{N_A}$ is approximately a normal random variable with mean $p_A$ and standard deviation $\sigma_A$
$$\sigma_A = \sqrt \frac{P_A(1-P_A)}{N_A}$$Similarly,
$$\sigma_B = \sqrt \frac{P_B(1-P_B)}{N_B}$$
In [6]:
# running an A/B test
def estimated_parameters(N, n):
p = n / N
sigma = math.sqrt(p * (1 - p) / N)
return p, sigma
If we assume those two normals are independent (which seems reasonable, since the individual Bernoulli trials ought to be), then their difference should also be normal with
Null Hypothesis: $p_A$ and $p_B$ are the same, that is $p_A - p_B = 0$
In [7]:
def a_b_test_statistic(N_A, n_A, N_B, n_B):
p_A, sigma_A = estimated_parameters(N_A, n_A)
p_B, sigma_B = estimated_parameters(N_B, n_B)
return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2)
# it should approximately be a standard normal.
For example, if “tastes great” gets 200 clicks out of 1,000 views and “less bias” gets 180 clicks out of 1,000 views, the statistic equals:
In [21]:
num_rejections = 1000
print(num_rejections, "rejections out of 1000")
print("A/B testing")
z = a_b_test_statistic(1000, 200, 1000, 180)
print("a_b_test_statistic(1000, 200, 1000, 180)", z)
print("p-value", two_sided_p_value(z))
print('which is large enough that you can’t conclude there’s much of a difference. ')
On the other hand, if “less bias” only got 150 clicks, we’d have:
In [22]:
z = a_b_test_statistic(1000, 200, 1000, 150)
print("a_b_test_statistic(1000, 200, 1000, 150)", z)
print("p-value", two_sided_p_value(z))
An alternative approach to inference involves treating the unknown parameters themselves as random variables.
prior distribution
for the parameters posterior distribution
for the parameters. For example, when the unknown parameter is a probability (as in our coin-flipping example), we often use a prior
from the Beta distribution
, which puts all its probability between 0 and 1:
In [24]:
def B(alpha, beta):
"""a normalizing constant so that the total probability is 1"""
return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta)
def beta_pdf(x, alpha, beta):
if x < 0 or x > 1: # no weight outside of [0, 1]
return 0
return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta)
The larger alpha and beta are, the “tighter” the distribution is.
In [51]:
x = [i*10/1000 for i in range(101)]
y1 = [beta_pdf(i, 1, 1) for i in x]
y2 = [beta_pdf(i, 10, 10) for i in x]
y3 = [beta_pdf(i, 4, 16) for i in x]
y4 = [beta_pdf(i, 16, 4) for i in x]
In [56]:
import matplotlib.pyplot as plt
plt.plot(x, y1, label = 'Beta(1, 1)')
plt.plot(x, y2, '--', label = 'Beta(10, 10)')
plt.plot(x, y3, '-*', label = 'Beta(4, 16)')
plt.plot(x, y4, '_-', label = 'Beta(16, 4)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
So let’s say we assume a prior distribution on p.
Then we flip our coin a bunch of times and see h heads and t tails.
Whenever you update a Beta prior using observations from the corresponding binomial, you will get back a Beta posterior.
Let’s say you flip the coin 10 times and see only 3 heads.
In [61]:
y5 = [beta_pdf(i, 4, 8) for i in x]
plt.plot(x, y5, '--', label = 'Beta(4, 8)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
In [60]:
y5 = [beta_pdf(i, 4, 8) for i in x]
y6 = [beta_pdf(i, 23, 27) for i in x]
plt.plot(x, y5, '--', label = 'Beta(4, 8)')
plt.plot(x, y6, '-*', label = 'Beta(23, 27)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
In [58]:
y5 = [beta_pdf(i, 4, 8) for i in x]
y6 = [beta_pdf(i, 23, 27) for i in x]
y7 = [beta_pdf(i, 33, 17) for i in x]
plt.plot(x, y5, '--', label = 'Beta(4, 8)')
plt.plot(x, y6, '-*', label = 'Beta(23, 27)')
plt.plot(x, y7, '_-', label = 'Beta(33, 17)')
plt.legend(loc=0,numpoints=1,fontsize=10)
plt.ylim(0, 6)
plt.show()
What’s interesting is that this allows us to make probability statements about hypotheses:
“Based on the prior and the observed data, there is only a 5% likelihood the coin’s heads probability is between 49% and 51%.”
This is philosophically very different from a statement like
“if the coin were fair we would expect to observe data so extreme only 5% of the time.”
Using Bayesian inference to test hypotheses is considered somewhat controversial— in part because of the subjective nature of choosing a prior.
In [ ]: