In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
In [2]:
# download from http://roybatty.org/statplots.py
# and upload to your Jupyter instance if you haven't done so already
import statplots
See lecture slides for intro discussion on hypothesis testing.
I've posted a data set at the following URL -- http://roybatty.org/possum.txt -- that contains morphological measurements on 104 brushtail possums (Trichosurus vulpecula). This data set contains information on variables such as sex, age, head length, skull width, etc.
By Bryce McQuillan - originally
posted to Flickr as Brushtail Opossum
Trichosurus vulpecula, CC BY-SA 2.0,
https://commons.wikimedia.org/w/index.php?curid=4195284
Previous studies of brushtail possums in Australia have established that the mean tail length of adult possums is 37.86cm. I am studying an isolated population of possums in the state of Victoria, and I am interested in whether mean tail length of Victorian possums is shorter than that of possums in the rest of Australia.
Since I have an a priori reason to believe the difference in tail length is shorter, this is a "one-tailed" hypothesis test (pun intended!).
In [3]:
possums = pd.read_table("http://roybatty.org/possum.txt")
possums.rename(columns={'pop':'popn'}, inplace=True)
possums.columns
Out[3]:
In [4]:
vic = vic = possums[possums.popn == 'Vic']
print(vic.tailL.describe())
In [5]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(10,4)
ax1.hist(vic.tailL, bins=9, color='steelblue', normed=True)
ax1.set_xlabel("tail length (cm)")
stats.probplot(vic.tailL, plot=ax2)
pass
In [6]:
mean_tl = vic.tailL.mean()
std_tl = vic.tailL.std()
SE_tl = std_tl/np.sqrt(vic.tailL.count())
tl_ci95 = mean_tl - 1.96 * SE_tl, mean_tl + 1.96 * SE_tl
print("sample mean tail length = {:.2f}".format(mean_tl))
print("SE of mean tail length = {:.2f}".format(SE_tl))
print("95% CI for mean = ({:.2f}, {:.2f})".format(tl_ci95[0], tl_ci95[1]))
In [7]:
fig = plt.figure()
fig.set_size_inches(10,4)
mean_null = 37.86
plt.plot([],[]) # draw an empty plot to setup axes
# draw line at x-value corresponding to H0
plt.vlines(mean_null, 0, 1, linestyle='dashed', color='k')
plt.text(mean_null+0.05, 0.2, "$H_0: \mu = {:.2f}$".format(mean_null),
fontsize=18)
plt.errorbar(mean_tl, 0.2, xerr=1.96 * SE_tl, fmt='ro',
label='95% CI of Victoria\nmean tail length')
plt.legend(numpoints=1, loc='upper left')
plt.yticks([]) # get ride of y-ticks
plt.xlabel("Tail length (cm)", fontsize=14)
plt.xlim(35, 40)
pass
The CI approach above let us assess, in a general way, whether there's evidence to reject the null hypothesis. In the above example, we might state our finding something along the lines of "We are 95% confident that the mean tail length of Victoria possums is between 35.44 and 36.43mm. We therefore reject the null hypothesis that Victoria possums have the same mean tail length as other Australian possums." We would make the same statement for any other estimated 95% CI that did not overlap 37.86. Thus the CI approach doesn't really tell us anything about the strength of evidence against the null hypothesis, it just gives us a basis by which to reject it (or fail to reject it).
To assess the strength of the evidence against the null hypothesis, we can ask "what is the probability of observing data (or a statistic of interest) at least as favorable to the alternative hypothesis, if the null hypothesis were true?"
When considering a statistic of interest the usual way to phrase this question is with respect to the expected sampling distribution of the statistic of interest under the null hypothesis. For example, when testing a hypothesis involving means, we would ask "What is the probability of observing my sample mean, with respect to the expected distribution of sample means, if the null hypothesis was true."
Let's apply this to our possum tail example. Our null hypothesis, $H_0$, is that the mean tail length in Victoria possums should be the same as the rest of Australia ($\mu=37.86$). Let's assume the standard deviation of tail length under the null and alternative hypothesis is the same, so that we can estimate the sampling distribution of the mean under the null hypothesis by using our sample standard deviation.
First we calculate the expected distribution of the sample means under the null hypothesis and plot this estimate.
In [8]:
mean_null = 37.86
se_null = SE_tl # because we assumed H0 and HA have same SE
fig = plt.figure()
fig.set_size_inches(10,4)
null_distn, ax = statplots.norm_plot(mean_null, se_null, nstds=4, color='k',
label='Distn of Sample Means\nUnder $H_0$')
plt.xlabel("Mean tail length (cm)")
# make a fancy legend box
plt.legend(loc='upper right', fancybox=True, shadow=True, borderpad=1)
pass
Now let's modify this figure to draw the observed sample mean relative to the expected distribution of sample means under $H_0$.
Most of the following code is involved in generating a nice plot, not carrying out any statistical calculations. One thing to note is that I've written the statplots.norm_plot
function so that it returns the "frozen" scipy.stats.norm
distribution that corresponds to the plot you asked it to generate. This is convenient because you can then carry out subsequent calculations with this distribution.
In [9]:
fig = plt.figure()
fig.set_size_inches(10,4)
distn, ax = statplots.norm_plot(mean_null, SE_tl, nstds=8, color='k',
label='Distn of Sample Means\nUnder $H_0$')
ax.annotate("Observed $\overline{x}$", xy=(mean_tl,0),
xytext=(0,40),
textcoords='offset points',
horizontalalignment="center",
verticalalignment="bottom",
arrowprops=dict(arrowstyle="->",color='black'))
plt.xlabel("Mean tail length (cm)")
plt.legend(fancybox=True, shadow=True, borderpad=1)
pass
It looks like our observed mean is very far to the left of the expected distribution of the samples mean for the given null hypothesis. Let's express that as a Z-score. Recall that Z-scores are defined in terms of the number of standard deviations an observation sits from the mean $\frac{x - \mu}{\sigma}$. In the present example, $x = \overline{x}$, $\mu = \mu_{H_0}$, and $\sigma = {SE}_{H_0}$.
In [10]:
z = (mean_tl - mean_null)/se_null
z
Out[10]:
Our calculation confirms our graphical intuition. Our observed sample mean is 7.6 standard errors to the left of the the mean under the null hypothesis!
Finally, let's calculate the probability of observing a value this extreme under the null hypothesis. Since this is a one-sided hypothesis test, with our alternative hypothesis to the left of the null hypotehsis, we only have to the sum probability to the left of our observed value of the mean.
In [11]:
# probability to the left of observed mean(x) under H0
pval = stats.norm.cdf(z) # since Z score we can use the std normal to compute
print("p-value =", pval)
That is a vanishingly small probability, so we consider this strong evidence to reject the null-hypothesis.
Let's test another hypothesis, this time a two-side one. We wish to test the hypothesis that the mean total body length of Victoria possums is not equal to the mean total length of other Australian possums, which previous studies have shown to be 86.8 cm$
In [12]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(10,4)
ax1.hist(vic.totalL, bins=10, color='steelblue', normed=True)
ax1.set_xlabel("head length (mm)")
stats.probplot(vic.totalL, plot=ax2)
pass
Our histogram and probability plot raise some questions about the assumption of approximately normality. However we have a decently large sample size ($n = 46$) and we're testing hypotheses about the equality of the mean (which tends to be fairly robust to departures from normality), so we decide to proceed (cautiously) under the normal assumption.
In [13]:
vic.totalL.describe()
Out[13]:
In [14]:
mean_totL = vic.totalL.mean()
std_totL = vic.totalL.std()
se_totL = std_totL/np.sqrt(vic.totalL.count())
totL_ci95 = mean_totL - 1.96 * se_totL, mean_totL + 1.96 * se_totL
print("sample mean total length = {:.3f}".format(mean_totL))
print("SE of mean total length = {:.2f}".format(se_totL))
print("95% CI for mean = ({:.2f}, {:.2f})".format(totL_ci95[0], totL_ci95[1]))
In [15]:
fig = plt.figure()
fig.set_size_inches(10,4)
mean_null = 86.8
plt.plot([],[]) # draw an empty plot to setup axes
# draw line at x-value corresponding to H0
plt.vlines(mean_null, 0, 1, linestyle='dashed', color='k')
plt.text(mean_null+0.05, 0.5, "$H_0: \mu = {:.2f}$".format(mean_null),
fontsize=18)
plt.errorbar(mean_totL, 0.2, xerr=1.96 * se_totL, fmt='ro',
label='95% CI of Victoria\nmean total length')
plt.legend(numpoints=1, loc='upper left')
plt.yticks([]) # get ride of y-ticks
plt.xlabel("Total length (cm)", fontsize=14)
plt.xlim(85, 90)
pass
In [16]:
se_null = se_totL # because we assumed H0 and HA have same SE
fig = plt.figure()
fig.set_size_inches(10,4)
null_distn, ax = statplots.norm_plot(mean_null, se_null, nstds=4, color='k',
label='Distn of Sample Means\nUnder $H_0$')
ax.annotate("Observed $\overline{x}$", xy=(mean_totL,0),
xytext=(0,20),
textcoords='offset points',
horizontalalignment="center",
verticalalignment="bottom",
arrowprops=dict(arrowstyle="->",color='black'))
plt.xlabel("Mean tail length (cm)")
# make a fancy legend box
plt.legend(loc='upper right', fancybox=True, shadow=True, borderpad=1)
plt.xlim(84,90)
pass
In [17]:
z = (mean_totL - mean_null)/se_null
z
Out[17]:
In [18]:
fig = plt.figure()
fig.set_size_inches(10,4)
distn,ax = statplots.norm_plot(mean_null, se_null, nstds=4, color='k',
label='Distn of Sample Means\nUnder $H_0$')
# left tail
statplots.area_under_distn(distn, mean_null - 4*se_null, mean_null - z*se_null, color='gray')
# right tail
statplots.area_under_distn(distn, mean_null + 4*se_null, mean_null + z*se_null, color='gray')
plt.xlim(84,90)
pass
In [20]:
### p-value of two-tail hypothesis test
In [19]:
pleft = stats.norm.cdf(-z) # left tail = cdf
pright = stats.norm.sf(z) # right tail = sf
pval = pleft + pright
print("p-value =", pval)