Copyright 2016 Allen B. Downey

MIT License: https://opensource.org/licenses/MIT

```
In [1]:
```from __future__ import print_function, division
%matplotlib inline
import numpy as np
import random
import thinkstats2
import thinkplot

The following is a version of `thinkstats2.HypothesisTest`

with just the essential methods:

```
In [2]:
```class HypothesisTest(object):
def __init__(self, data):
self.data = data
self.MakeModel()
self.actual = self.TestStatistic(data)
def PValue(self, iters=1000):
self.test_stats = [self.TestStatistic(self.RunModel())
for _ in range(iters)]
count = sum(1 for x in self.test_stats if x >= self.actual)
return count / iters
def TestStatistic(self, data):
raise UnimplementedMethodException()
def MakeModel(self):
pass
def RunModel(self):
raise UnimplementedMethodException()

```
In [3]:
```class CoinTest(HypothesisTest):
def TestStatistic(self, data):
heads, tails = data
test_stat = abs(heads - tails)
return test_stat
def RunModel(self):
heads, tails = self.data
n = heads + tails
sample = [random.choice('HT') for _ in range(n)]
hist = thinkstats2.Hist(sample)
data = hist['H'], hist['T']
return data

The p-value turns out to be about 7%, which is considered on the border of statistical significance.

```
In [4]:
```ct = CoinTest((140, 110))
pvalue = ct.PValue()
pvalue

```
Out[4]:
```

```
In [5]:
```class DiffMeansPermute(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
group1, group2 = data
test_stat = abs(group1.mean() - group2.mean())
return test_stat
def MakeModel(self):
group1, group2 = self.data
self.n, self.m = len(group1), len(group2)
self.pool = np.hstack((group1, group2))
def RunModel(self):
np.random.shuffle(self.pool)
data = self.pool[:self.n], self.pool[self.n:]
return data

```
In [6]:
```import first
live, firsts, others = first.MakeFrames()
data = firsts.prglngth.values, others.prglngth.values

```
In [7]:
```ht = DiffMeansPermute(data)
pvalue = ht.PValue()
pvalue

```
Out[7]:
```

Here's the distrubution of the test statistic (the difference in means) over many simulated samples:

```
In [8]:
```ht.PlotCdf()
thinkplot.Config(xlabel='test statistic',
ylabel='CDF')

```
```

Under the null hypothesis, we often see differences bigger than the observed difference.

```
In [9]:
```class DiffMeansOneSided(DiffMeansPermute):
def TestStatistic(self, data):
group1, group2 = data
test_stat = group1.mean() - group2.mean()
return test_stat

```
In [10]:
```ht = DiffMeansOneSided(data)
pvalue = ht.PValue()
pvalue

```
Out[10]:
```

But in this example, the result is still not statistically significant.

```
In [11]:
```class DiffStdPermute(DiffMeansPermute):
def TestStatistic(self, data):
group1, group2 = data
test_stat = group1.std() - group2.std()
return test_stat

```
In [12]:
```ht = DiffStdPermute(data)
pvalue = ht.PValue()
pvalue

```
Out[12]:
```

But that's not statistically significant either.

```
In [13]:
```class CorrelationPermute(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
xs, ys = data
test_stat = abs(thinkstats2.Corr(xs, ys))
return test_stat
def RunModel(self):
xs, ys = self.data
xs = np.random.permutation(xs)
return xs, ys

Here's an example testing the correlation between birth weight and mother's age.

```
In [14]:
```cleaned = live.dropna(subset=['agepreg', 'totalwgt_lb'])
data = cleaned.agepreg.values, cleaned.totalwgt_lb.values
ht = CorrelationPermute(data)
pvalue = ht.PValue()
pvalue

```
Out[14]:
```

The reported p-value is 0, which means that in 1000 trials we didn't see a correlation, under the null hypothesis, that exceeded the observed correlation. That means that the p-value is probably smaller than $1/1000$, but it is not actually 0.

To get a sense of how unexpected the observed value is under the null hypothesis, we can compare the actual correlation to the largest value we saw in the simulations.

```
In [15]:
```ht.actual, ht.MaxTestStat()

```
Out[15]:
```

```
In [16]:
```class DiceTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
observed = data
n = sum(observed)
expected = np.ones(6) * n / 6
test_stat = sum(abs(observed - expected))
return test_stat
def RunModel(self):
n = sum(self.data)
values = [1, 2, 3, 4, 5, 6]
rolls = np.random.choice(values, n, replace=True)
hist = thinkstats2.Hist(rolls)
freqs = hist.Freqs(values)
return freqs

Here's an example using the data from the book:

```
In [17]:
```data = [8, 9, 19, 5, 8, 11]
dt = DiceTest(data)
pvalue = dt.PValue(iters=10000)
pvalue

```
Out[17]:
```

The observed deviance from the expected values is not statistically significant.

By convention, it is more common to test data like this using the chi-squared statistic:

```
In [18]:
```class DiceChiTest(DiceTest):
def TestStatistic(self, data):
observed = data
n = sum(observed)
expected = np.ones(6) * n / 6
test_stat = sum((observed - expected)**2 / expected)
return test_stat

Using this test, we get a smaller p-value:

```
In [19]:
```dt = DiceChiTest(data)
pvalue = dt.PValue(iters=10000)
pvalue

```
Out[19]:
```

```
In [20]:
```class PregLengthTest(thinkstats2.HypothesisTest):
def MakeModel(self):
firsts, others = self.data
self.n = len(firsts)
self.pool = np.hstack((firsts, others))
pmf = thinkstats2.Pmf(self.pool)
self.values = range(35, 44)
self.expected_probs = np.array(pmf.Probs(self.values))
def RunModel(self):
np.random.shuffle(self.pool)
data = self.pool[:self.n], self.pool[self.n:]
return data
def TestStatistic(self, data):
firsts, others = data
stat = self.ChiSquared(firsts) + self.ChiSquared(others)
return stat
def ChiSquared(self, lengths):
hist = thinkstats2.Hist(lengths)
observed = np.array(hist.Freqs(self.values))
expected = self.expected_probs * len(lengths)
stat = sum((observed - expected)**2 / expected)
return stat

```
In [21]:
```data = firsts.prglngth.values, others.prglngth.values
ht = PregLengthTest(data)
p_value = ht.PValue()
print('p-value =', p_value)
print('actual =', ht.actual)
print('ts max =', ht.MaxTestStat())

```
```

```
In [22]:
```def FalseNegRate(data, num_runs=1000):
"""Computes the chance of a false negative based on resampling.
data: pair of sequences
num_runs: how many experiments to simulate
returns: float false negative rate
"""
group1, group2 = data
count = 0
for i in range(num_runs):
sample1 = thinkstats2.Resample(group1)
sample2 = thinkstats2.Resample(group2)
ht = DiffMeansPermute((sample1, sample2))
p_value = ht.PValue(iters=101)
if p_value > 0.05:
count += 1
return count / num_runs

```
In [23]:
```neg_rate = FalseNegRate(data)
neg_rate

```
Out[23]:
```

**Exercise:** As sample size increases, the power of a hypothesis test increases, which means it is more likely to be positive if the effect is real. Conversely, as sample size decreases, the test is less likely to be positive even if the effect is real.

To investigate this behavior, run the tests in this chapter with different subsets of the NSFG data. You can use `thinkstats2.SampleRows`

to select a random subset of the rows in a DataFrame.

What happens to the p-values of these tests as sample size decreases? What is the smallest sample size that yields a positive test?

```
In [24]:
```# Solution
def RunTests(live, iters=1000):
"""Runs the tests from Chapter 9 with a subset of the data.
live: DataFrame
iters: how many iterations to run
"""
n = len(live)
firsts = live[live.birthord == 1]
others = live[live.birthord != 1]
# compare pregnancy lengths
data = firsts.prglngth.values, others.prglngth.values
ht = DiffMeansPermute(data)
p1 = ht.PValue(iters=iters)
data = (firsts.totalwgt_lb.dropna().values,
others.totalwgt_lb.dropna().values)
ht = DiffMeansPermute(data)
p2 = ht.PValue(iters=iters)
# test correlation
live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
data = live2.agepreg.values, live2.totalwgt_lb.values
ht = CorrelationPermute(data)
p3 = ht.PValue(iters=iters)
# compare pregnancy lengths (chi-squared)
data = firsts.prglngth.values, others.prglngth.values
ht = PregLengthTest(data)
p4 = ht.PValue(iters=iters)
print('%d\t%0.2f\t%0.2f\t%0.2f\t%0.2f' % (n, p1, p2, p3, p4))

```
In [25]:
```# Solution
n = len(live)
for _ in range(7):
sample = thinkstats2.SampleRows(live, n)
RunTests(sample)
n //= 2

```
```

```
In [26]:
```# Solution
# My results:
# test1: difference in mean pregnancy length
# test2: difference in mean birth weight
# test3: correlation of mother's age and birth weight
# test4: chi-square test of pregnancy length
# n test1 test2 test2 test4
# 9148 0.16 0.00 0.00 0.00
# 4574 0.10 0.01 0.00 0.00
# 2287 0.25 0.06 0.00 0.00
# 1143 0.24 0.03 0.39 0.03
# 571 0.81 0.00 0.04 0.04
# 285 0.57 0.41 0.48 0.83
# 142 0.45 0.08 0.60 0.04
# Conclusion: As expected, tests that are positive with large sample
# sizes become negative as we take away data. But the pattern is
# erratic, with some positive tests even at small sample sizes.

**Exercise:** In Section 9.3, we simulated the null hypothesis by permutation; that is, we treated the observed values as if they represented the entire population, and randomly assigned the members of the population to the two groups.

An alternative is to use the sample to estimate the distribution for the population, then draw a random sample from that distribution. This process is called resampling. There are several ways to implement resampling, but one of the simplest is to draw a sample with replacement from the observed values, as in Section 9.10.

Write a class named `DiffMeansResample`

that inherits from `DiffMeansPermute`

and overrides `RunModel`

to implement resampling, rather than permutation.

Use this model to test the differences in pregnancy length and birth weight. How much does the model affect the results?

```
In [27]:
```# Solution
class DiffMeansResample(DiffMeansPermute):
"""Tests a difference in means using resampling."""
def RunModel(self):
"""Run the model of the null hypothesis.
returns: simulated data
"""
group1 = np.random.choice(self.pool, self.n, replace=True)
group2 = np.random.choice(self.pool, self.m, replace=True)
return group1, group2

```
In [28]:
```# Solution
def RunResampleTest(firsts, others):
"""Tests differences in means by resampling.
firsts: DataFrame
others: DataFrame
"""
data = firsts.prglngth.values, others.prglngth.values
ht = DiffMeansResample(data)
p_value = ht.PValue(iters=10000)
print('\ndiff means resample preglength')
print('p-value =', p_value)
print('actual =', ht.actual)
print('ts max =', ht.MaxTestStat())
data = (firsts.totalwgt_lb.dropna().values,
others.totalwgt_lb.dropna().values)
ht = DiffMeansPermute(data)
p_value = ht.PValue(iters=10000)
print('\ndiff means resample birthweight')
print('p-value =', p_value)
print('actual =', ht.actual)
print('ts max =', ht.MaxTestStat())

```
In [29]:
```# Solution
RunResampleTest(firsts, others)

```
```

```
In [30]:
```# Solution
# Conclusions: Using resampling instead of permutation has very
# little effect on the results.
# The two models are based on slightly difference assumptions, and in
# this example there is no compelling reason to choose one or the other.
# But in general p-values depend on the choice of the null hypothesis;
# different models can yield very different results.