In [1]:
# Configure Jupyter so figures appear in the notebook
%matplotlib inline
# Configure Jupyter to display the assigned value after an assignment
%config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
import numpy as np
import matplotlib.pyplot as plt
Here's an example that uses the "There is only one test" framework to solve a problem from the 2019 AP® Statistics Exam
Tumbleweed, commonly found in the western United States, is the dried structure of certain plants that are blown by the wind. Kochia, a type of plant that turns into tumbleweed at the end of the summer, is a problem for farmers because it takes nutrients away from soil that would otherwise go to more beneficial plants. Scientists are concerned that kochia plants are becoming resistant to the most commonly used herbicide, glyphosate.
In 2014, 19.7 percent of 61 randomly selected kochia plants were resistant to glyphosate. In 2017, 38.5 percent of 52 randomly selected kochia plants were resistant to glyphosate. Do the data provide convincing statistical evidence, at the level of $\alpha = 0.05$, that there has been an increase in the proportion of all kochia plants that are resistant to glyphosate?
Because the problem reports the data in terms of percentages, we have to work backwards to get the actual number of plants in each group that were resistant:
In [2]:
0.197 * 61
Out[2]:
In [3]:
0.385 * 52
Out[3]:
It looks like there were 12 resistant plants in the first group and 20 in the second group.
In [4]:
k1 = 12
k2 = 20
k = k1 + k2
Out[4]:
And here are the sample sizes.
In [5]:
n1 = 61
n2 = 52
n = n1 + n2
Out[5]:
The first step toward hypothesis testing is to define a test statistic that measures the size of the effect. In this case, the hypothetical effect is an increase in the fraction of resistant plants, that is, a difference in proportions.
I'll write a function that takes the data as a parameter and computes the difference in proportions.
In [6]:
def test_stat(data):
"""Compute the test statistic."""
k1, n1, k2, n2 = data
p1 = k1/n1
p2 = k2/n2
return p2 - p1
I pass the data to the function as a tuple of 4 numbers.
In [7]:
data = k1, n1, k2, n2
Out[7]:
Now we can compute the observed difference in proportions (and check it against the numbers in the problem).
In [8]:
actual = test_stat(data)
Out[8]:
In [9]:
0.385 - 0.197
Out[9]:
Now we need a null hypothesis. The scientists think there might be an increase in the fraction of resistant plants, so the null hypothesis is that there is no increase.
Under that assumption, the two groups have the same proportion of resistant plants, which we can estimate by pooling the data and computing the proportion among all of the observed plants.
In [10]:
p = k / n
Out[10]:
Now, the fundamental question in hypothesis testing is this: under the null hypothesis, what would be the chance of seeing an apparent effect as big as actual
?
To answer that question, we simulate the null hypothesis and generate "fake data", then compute the test statistic of the fake data.
I'll use the following function, which simulates flipping n
coins, each with probability p
of landing heads, and returns the total number of heads.
In [11]:
def flip(n, p):
"""Flip n coins with probability p.
n: number of coins
p: probability of heads
return: total number of heads
"""
return np.sum(np.random.random(n) < p)
As an example, I'll simulate the two groups and compute the number of resistant plants in each group.
In [12]:
flip(n1, p)
Out[12]:
In [13]:
flip(n2, p)
Out[13]:
Now we can wrap that in a function that returns the results in the same format as the original data, a tuple of 4 numbers.
In [14]:
def run_model(data):
"""Simulate the null hypothesis."""
k1, n1, k2, n2 = data
p = (k1 + k2) / (n1 + n2)
k1 = flip(n1, p)
k2 = flip(n2, p)
return k1, n1, k2, n2
Returning the results in that format makes it possible to pass the "fake data" directly to test_stat
.
In [15]:
test_stat(run_model(data))
Out[15]:
That's the result of one simulated experiment.
Now we run the experiment 1000 times and collect the results.
In [16]:
test_stat_dist = np.array([test_stat(run_model(data))
for i in range(1000)])
Out[16]:
The result is the "sampling distribution of the test statistic under the null hypothesis".
The mean of this distribution is close to zero, which is not suprising because it is based on the assumption that there is actually no difference between the groups.
In [17]:
np.mean(test_stat_dist)
Out[17]:
Here's what the distribution of test statistics looks like.
The vertical gray line shows the observed effect size, actual
.
In [18]:
plt.hist(test_stat_dist)
plt.axvline(actual, color='gray')
plt.xlabel('Difference in proportions')
plt.ylabel('Number of simulations')
plt.title('Sampling distribution of the test statistic under H0');
Now we can compute the probability that the test statistic, under the null hypothesis, exceeds the observed difference.
This probability is the "p-value".
In [19]:
p_value = np.mean(test_stat_dist >= actual)
Out[19]:
The estimated p-value is less than the specified threshold, $\alpha = 0.05$, so the observed difference is considered statistically significant.
When I present the "There is only one test" framework, many of the questions I get are about choosing the test statistic. In many cases, it seems like an arbitrary choice.
And in some ways it is, although rather than "arbitrary", I would say it is a modeling decision informed by the context. In this example, the researchers think there might be an increase in the proportion of resistant plants, so I chose a test statistic that measures an increase.
If the researchers expected a change in resistance without specifying an increase or decrease, I might use the absolute value of the difference as a test statistic, because the absolute value measures magnitude without regard to sign.
Using the difference in proportions was also a choice; another possibility is to use the ratio of proportions. In the context of epidemiology, it is common to use risk ratios to compare proportions.
One benefit of the computational framework is that it is easy to try a different test statistic (or a different model of the null hypothesis) and see what effect it has on the results.
Here's a function that computes the ratio of proportions.
In [20]:
def test_stat(data):
k1, n1, k2, n2 = data
p1 = k1/n1
p2 = k2/n2
return p2 / p1
Here's the observed risk ratio.
In [21]:
actual = test_stat(data)
Out[21]:
Now we can run the simulated experiments with this test statistic, and compute the p-value.
In [22]:
test_stat_dist = np.array([test_stat(run_model(data))
for i in range(1000)])
p_value = np.mean(test_stat_dist >= actual)
Out[22]:
The two test statistics yield pretty much the same p-values. That's good; it means that the interpretation of the results is not sensitive to our modeling choices.
In [ ]: