In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
df = pd.read_csv('data/human_body_temperature.csv')
df.head()
Out[1]:
In [2]:
x=df.sort_values("temperature",axis=0)
t=x["temperature"]
#print(np.mean(t))
plot_fit = stats.norm.pdf(t, np.mean(t), np.std(t))
plt.plot(t,plot_fit,'-o')
plt.hist(df.temperature, bins = 20 ,normed = True)
plt.ylabel('Frequency')
plt.xlabel('Temperature')
plt.show()
stats.normaltest(t)
Out[2]:
To check if the distribution of temperature is normal, it is always better to visualize it. We plot the histogram of the values and plot the fitted values to obtain a normal distribution. We see that there are a few outliers in the distribution on the right side but still it correlates as a normal distribution.
Performing the Normaltest using Scipy's normal function and we obtain the p value of 0.25. Assuming the statistical significance to be 0.05 and the Null hypothesis being the distribution is normal. We can accept the Null hypothesis as the obtained p-value is greater than 0.05 which can also confirm the normal distribution.
In [3]:
#Question 2:
no_of_samples=df["temperature"].count()
print(no_of_samples)
We see the sample size is n= 130 and as a general rule of thumb inorder for CLT to be validated it is necessary for n>30. Hence the sample size is compartively large.
In [4]:
from statsmodels.stats.weightstats import ztest
from scipy.stats import ttest_ind
from scipy.stats import ttest_1samp
t_score=ttest_1samp(t,98.6)
t_score_abs=abs(t_score[0])
t_score_p_abs=abs(t_score[1])
z_score=ztest(t,value=98.6)
z_score_abs=abs(z_score[0])
p_value_abs=abs(z_score[1])
print("The z score is given by: %F and the p-value is given by %6.9F"%(z_score_abs,p_value_abs))
print("The t score is given by: %F and the p-value is given by %6.9F"%(t_score_abs,t_score_p_abs))
Choosing one sample test vs two sample test:
The problem defined has a single sample and we need to test against the population mean and hence we would use a one sample test as against the two sample test.
T-test vs Z-test:
T-test is chosen and best suited when n<30 and hence we can choose z-test for this particular distribution.Also here we are comparing the mean of the population against a predetermined value i.e. 98.6 and it is best to use z-test. T- test is more useful when we compare the means of two sample distributions and check to see if there is a difference between them.
The p value is 0.000000049 which is less than the usual significance level 0.05 and hence we can reject the Null hypothesis and say that the population mean is not 98.6
Trying the t-test: Since we are comparing the mean value to a reference number, the calculation of both z score and t score remains same and hence value remains same. However the p-value differs slighlty from the other.
In [5]:
#Question 4:
#For a 95% Confidence Interval the Confidence interval can be computed as:
variance_=np.std(t)/np.sqrt(no_of_samples)
mean_=np.mean(t)
confidence_interval = stats.norm.interval(0.95, loc=mean_, scale=variance_)
print("The Confidence Interval Lies between %F and %F"%(confidence_interval[0],confidence_interval[1]))
Any temperatures out of this range should be considered abnormal.
Question 5: Here we use t-test statistic because we want to compare the mean of two groups involved, the male and the female group and it is better to use a t-test.
In [6]:
temp_male=df.temperature[df.gender=='M']
female_temp=df.temperature[df.gender=='F']
ttest_ind(temp_male,female_temp)
Out[6]:
Considering the Null hypothesis that there is no difference between the two groups, the p-value observed is lesser than the significance level and hence we can reject the Null hypothesis saying that there is a difference in the body temperature amongst men and women.