In [1]:
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import pandas as pd
The scipy.stats
module has three functions for carrying out t-tests:
ttest_1samp(a, popmean)
-- carries out a one sample t-test, comparing the mean in a
to the given popmean.
ttest_ind(a,b)
-- carries out a t-test for the mean of two independent samples a
and b
ttest_rel(a,b)
-- carries out a paired t-test for related samples a
and b
We will illustrate the use of the functions using sample data sets from the OpenIntro Statistics textbook.
We'll start with a one-sample t-test. To illustrate this we'll use a data set involving bushtail possums that we used previously (see previous notebook).
Previous studies of brushtail possums in Australia have established that the mean tail length of adult possums is 37.86cm. I am studying an isolated population of possums in the state of Victoria, and I am interested in whether mean tail length of Victorian possums is shorter than that of possums in the rest of Australia.
Since I have an a priori reason to believe the difference in tail length is shorter, this is a "one-tailed" hypothesis test.
In [4]:
possums = pd.read_table("http://roybatty.org/possum.txt")
# rename the pop column because thats a pandas method name
possums.rename(columns={'pop':'popn'}, inplace=True)
# get the victoria possums
vic = possums[possums.popn == 'Vic']
In [12]:
stats.ttest_1samp(vic.tailL, 37.86)
Out[12]:
Here's the same thing as above, but showing how you can work with the named fields of the tuple return from ttest_1samp
In [15]:
vicT = stats.ttest_1samp(vic.tailL, 37.86)
print("The z-score (t-score) for our test is: {:0.2f}".format(vicT.statistic))
print("The p-value for our test is: {:0.10f}".format(vicT.pvalue))
If you compare the p-value to the previous notebook where we carried out a one-sample hypothesis test of the mean, you'll see the value we calculate here is slightly larger, reflecting the difference between the t- and normal distribution. Using either distribution we have strong evidence for rejecting the null-hypothesis.
We'll use the book price example from your textbook (see section 5.2) to illustrate a paired t-test. The data is book prices from the UCLA bookstore and Amazon.com, for 73 text books used in classes at UCLA.
Our null and alternative hypotheses are:
$H_0$: the mean book price of textbooks at the UCLA bookstore and Amazon.com are the same
$H_A$: the mean books price of textbooks at the UCLA bookstore and Amazon.com are different
In [16]:
books = pd.read_table("https://github.com/Bio204-class/bio204-datasets/raw/master/textbooks.txt")
In [17]:
books.columns
Out[17]:
In [20]:
books.head()
Out[20]:
In [21]:
books.shape
Out[21]:
In [22]:
booksT = stats.ttest_rel(books.uclaNew, books.amazNew)
In [23]:
booksT
Out[23]:
Our calculated t-score is 7.65, with a corresponding p-value of ~$7 \times 10^{-11}$. Compare this to the calculation of the t-score for these data on page 230 of your textbook. We therefore have strong evidence to reject the null-hypothesis.
To illustrate t-tests for independent samples we'll use the smoking and birthweight example from section 5.3 of your text book.
This data set includes 150 cases of mothers and their newborns in North Carolina. As per the textbook ((Diez et al. 2015), the null and alternative hypotheses we want to test are:
$H_0$: There is no difference in average birth weight for newborns from mothers who did and did not smoke. In statistical notation: $μ_n − μ_s = 0$, where $μ_n$ represents non-smoking mothers and $μ_s$ represents mothers who smoked.
$H_A$: There is some difference in average newborn weights from mothers who did and did not smoke ($μ_n$ − $μ_s$ $\neq$ 0).
In [24]:
births = pd.read_table("https://github.com/Bio204-class/bio204-datasets/raw/master/births.txt")
In [25]:
births.head()
Out[25]:
In [26]:
births.shape
Out[26]:
Let's use the groupby
and describe
methods to generate some useful summary statistics on baby weight, grouped-by whether the mother smoked or not.
In [35]:
births.groupby('smoke').weight.describe()
Out[35]:
We see that the mean birthweight for babies of non-smoking mothers is greater than that for smoking mothers (7.18 vs 6.78 lbs). However, there is significant variation for both classes.
In [27]:
# subset data based on smoke column
nonsmokers = births[births.smoke == 'nonsmoker']
smokers = births[births.smoke == "smoker"]
In [28]:
birthT = stats.ttest_ind(nonsmokers.weight, smokers.weight)
In [29]:
birthT
Out[29]:
For this example, we fail to reject the null hypothesis of no-difference in means, at the significance level $\alpha=0.5$.
In [ ]: