In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [69]:
import pandas as pd
df = pd.read_csv('data/human_body_temperature.csv')
df.head()
Out[69]:
(1) The histogram and normal probability plot shows that the distribution of body temperatures approximately follows a normal distribution
In [70]:
import numpy as np
import math
import pylab
import scipy.stats as stats
import matplotlib.pyplot as plt
plt.hist(df.temperature)
plt.show()
stats.probplot(df.temperature, dist="norm", plot=pylab)
pylab.show()
(2) The sample size is 130, which is large enough (>30) for the assumption of CLT. In addition, 130 people is <10% of the human population, so we can assume that the observations are independent.
In [71]:
sample_size = df.temperature.count()
print('sample size is ' + str(sample_size))
(3) We can use one-sample z test (the sample size is much larger than 30):
$H_0: T = 98.6$
$H_A: T \neq 98.6$
The p value is 4.35e-08, which is much smaller than 0.05. This indicates that the true mean of the human body temperature is not 98.6.
When using t-test instead, the p value is 2.19e-07, which is larger than the p value obtained from z test due to the thicker tails of t-distribution. This p value is still much smaller than 0.05, indicating that the true mean of the human body temperature is not 98.6
In [72]:
mean = np.mean(df.temperature)
se = (np.std(df.temperature))/math.sqrt(sample_size)
z = (98.6 - mean)/se
p_z = (1-stats.norm.cdf(z))*2
print('p value for z test is ' + str(p_z))
dgf = sample_size - 1
p_t = 2*(1-stats.t.cdf(z, dgf))
print('p value for t test is ' + str(p_t))
(4) We would consider someone's temperature to be "abnormal" if it doesn't fall within the 95% confidence interval [98.12, 98.37]
In [73]:
ub = mean + 1.96*se
lb = mean - 1.96*se
print('Mean: ' + str(mean))
print('95 % Confidence Interval: [' + str(lb) + ', ' + str(ub) + ']')
(5) We can use two-sample z test:
$H_0: T_M = T_F$
$H_A: T_M \neq T_F$
The p value is 0.02, which is smaller than 0.05. This indicates that there is a significant difference between males and females in normal temperature
In [74]:
male_temp = df[df.gender=='M'].temperature
female_temp = df[df.gender=='F'].temperature
mean_diff = abs(np.mean(male_temp) - np.mean(female_temp))
se = math.sqrt(np.var(male_temp)/male_temp.count() + np.var(female_temp)/female_temp.count() )
z = mean_diff/se
p_z = (1-stats.norm.cdf(z))*2
print('mean for male is ' + str(np.mean(male_temp)))
print('mean for female is ' + str(np.mean(female_temp)))
print('p value for z test is ' + str(p_z))
In summary, the human body temperature approximately follows a normal distribution. The temperature (mean = 98.25) measured in 1992 is significant different from that (mean=98.68) measured in 1868. In addition, there is a significant different in body temperature between males (mean=98.10) and females (98.39).