In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [1]:
import pandas as pd
df = pd.read_csv('data/human_body_temperature.csv')
In [2]:
# Your work here.
In [37]:
# Load Matplotlib + Seaborn and SciPy libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import norm
from statsmodels.stats.weightstats import ztest
%matplotlib inline
In [4]:
df.head(5)
Out[4]:
Yes. Based on the shape of the curve plotted with sample data, we have a normal distribution of body temperature.
In [5]:
ax = sns.distplot(df[['temperature']], rug=True, axlabel='Temperature (o F)')
In [6]:
print("Yes. We have *" + str(df['temperature'].size) + "* records in the sample data file.")
print("There is no connection or dependence between the measured temperature values, in other words, the observations are independent.")
In [7]:
# Sample (dataset) size
df['temperature'].describe()
Out[7]:
In [8]:
# Population mean temperature
POP_MEAN = 98.6
# Sample size, mean and standard deviation
sample_size = df['temperature'].count()
sample_mean = df['temperature'].mean()
sample_std = df['temperature'].std(axis=0)
In [9]:
print("Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print("Sample size: sample_size = " + str(sample_size))
print("Sample mean: sample_mean = "+ str(sample_mean))
print("Sample standard deviation: sample_std = "+ str(sample_std))
In [10]:
print("* Ho or Null hypothesis: Average body temperature *is* " + str(POP_MEAN)+" degrees F.")
print("* Ha or Alternative hypothesis: Average body temperature *is not* " + str(POP_MEAN)+" degrees F.")
In [11]:
t = ((sample_mean - POP_MEAN)/sample_std)*np.sqrt(sample_size)
print("t = " + str(t))
In [12]:
degree = sample_size - 1
print("degrees of freedom =" + str(degree))
In [57]:
p = 1 - stats.t.cdf(abs(t),df=degree)
print("p-value = %.10f" % p)
In [14]:
p2 = 2*p
print("p-value = %.10f (2 * p-value)" % p2)
In [47]:
ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.10f" % p2)
In [16]:
print("----")
print(". Sample mean: sample_mean = "+ str(sample_mean))
print(". Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print(". Population standard deviation: sample_std = "+ str(sample_std))
print(". Sample size: sample_size = " + str(sample_size))
print("----")
In [35]:
z = ((sample_mean - POP_MEAN)/sample_std)*np.sqrt(sample_size)
print("Z value or z_score: z = " + str(z))
In [76]:
# P-Value two sided
p_value_z = 1 - (norm.sf(abs(z))*2)
In [77]:
print("P-Value = %.15f" % p_value_z)
In [78]:
ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.15f" % p_value_z)
In [21]:
# A sample with randomly 10 records from original dataset
df_sample10 = df.sample(n=10)
df_sample10['temperature'].count()
Out[21]:
In [22]:
ax = sns.distplot(df_sample10[['temperature']], rug=True, axlabel='Temperature (o F)')
In [23]:
sample10_size = df_sample10['temperature'].count()
sample10_mean = df_sample10['temperature'].mean()
sample10_std = df_sample10['temperature'].std(axis=0)
In [24]:
print("Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print("Sample-10 size: sample_size = " + str(sample10_size))
print("Sample-10 mean: sample_mean = "+ str(sample10_mean))
print("Sample-10 standard deviation: sample_std = "+ str(sample10_std))
In [25]:
t = ((sample10_mean - POP_MEAN)/sample10_std)*np.sqrt(sample10_size)
print("t = " + str(t))
In [26]:
degree = sample10_size - 1
print("degrees of freedom =" + str(degree))
In [67]:
p_value = 1 - stats.t.cdf(abs(t),df=degree)
# p-value considering two-tails
p_value = 2*p_value
print("p-value =" + str(p_value))
In [69]:
ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.15f" % p_value)
In [66]:
z = ((sample10_mean - POP_MEAN)/sample10_std)*np.sqrt(sample10_size)
print("Z value or z_score: z = " + str(z))
In [74]:
# P-Value two sided
p_value_z = 1 - (norm.sf(abs(z))*2)
print("P-Value = %.15f" % p_value_z)
In [75]:
ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.15f" % p_value_z)
In [79]:
# Sample (dataset) size
df['temperature'].describe()
Out[79]:
In [110]:
median = df['temperature'].mean()
std = df['temperature'].std(axis=0)
print("One standard deviation (std) is %.3f degrees F." %std)
print("Three standard deviation (std) is %.3f degrees F." % (3*std))
In [118]:
lim_low = median - (3*std)
lim_high = median + (3*std)
print("A body temperature different than 99.7% of the population is: greater than "+ str(lim_high) + " and less than " + str(lim_low) + " degrees F.")
In [28]:
# Female temperature (mean and standard deviation)
df_female = df.loc[df['gender'] == 'F']
ax = sns.distplot(df_female[['temperature']])
print("Female temperature: mean = %f | std = %f" % (df_female['temperature'].mean(), df_female['temperature'].std()))
In [29]:
# Male temperature (mean and standard deviation)
df_male = df.loc[df['gender'] == 'M']
ax = sns.distplot(df_male[['temperature']])
print("Male temperature: mean = %f | std = %f" % (df_male['temperature'].mean(), df_male['temperature'].std()))
In [30]:
# Plotting histogram based on gender (Female/Male)
grid = sns.FacetGrid(df, col="gender")
grid.map(plt.hist, "temperature", color="y")
Out[30]:
In [31]:
# Plotting Female/Male temperatures using Seaborn Pairplot
sns.pairplot(df, hue='gender', size=2.5)
Out[31]:
What test did you use and why?
T-test, 2 Tailed: we use this test when we want to test if the difference between the averages of two independent populations (Female and Male).
Write a story with your conclusion in the context of the original problem.
In [ ]:
[1] "What Statistical Analysis Should I Use? Statistical Analyses Using STATA". Last access: 12/25/2017 - Link: https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/
[2] "T-Score vs. Z-Score: What’s the Difference?". Last access: 12/26/2017 - Link: http://www.statisticshowto.com/when-to-use-a-t-score-vs-z-score/
[x] "Central limit theorem", Khan Acadeny. Last access: 12/26/2017. Link: https://www.khanacademy.org/math/ap-statistics/sampling-distribution-ap/sampling-distribution-mean/v/sampling-distribution-of-the-sample-mean
In [ ]: