In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
In [2]:
import pandas as pd
df = pd.read_csv('data/human_body_temperature.csv')
In [13]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [15]:
sns.distplot(df['temperature'])
Out[15]:
From the bar chart we can say that the distrubition resembles normal.
In [4]:
df.shape
Out[4]:
In [5]:
df.head()
Out[5]:
In [6]:
stddev_sd = df.temperature.std() / np.sqrt(130)
sample_mean = df.temperature.mean()
In [7]:
z_statistic = (sample_mean - 98.6) / stddev_sd
(len(df), sample_mean, sample_stddev, z_statistic)
Out[7]:
As I'm comparing a sample mean to a fixed value (assumed population mean of 98.6F), I'm using one-sample test. In the situation we have n >> 30 thus it's appropriate to use z-statistic. There is a distance of 5.45 std dev between sample mean and the assumed population mean, therefore it gives us very high confidence (>99.9%) that 98.6F is not the true population mean. In order to make a two-sample test I'm going to generate a second sample based on the assumed population mean and sample standard deviation.
In [8]:
other = pd.Series(np.random.normal(98.6, df.temperature.std(), 130))
other_mean = other.mean()
pooled_stddev = np.sqrt(sample_stddev * sample_stddev/130 + other.var()/130)
(other_mean, pooled_stddev, other_mean - sample_mean)
Out[8]:
I assume there is no difference between two sample means (H0: sample_mean - other_mean = 0). I'm going to show 99% confidence, that H0 is not true and that there is in reality a difference between those two means. For that, the distance between means has to be >= 2.58 (z-value for 0.995 - two-sided test).
In [9]:
z_statistic = (other_mean - sample_mean) / pooled_stddev
z_statistic
Out[9]:
In [10]:
(sample_mean - 1.96*sample_stddev, sample_mean + 1.96*sample_stddev, 1.96*sample_stddev)
Out[10]:
Margin of error is 0.126F and the confidence interval (98.12F, 98.38F) which means any temperature below 98.12F or above 98.38F would be considered abnormal.
In [11]:
males = df.temperature[df.gender=='M']
females = df.temperature[df.gender=='F']
(males.size, females.size)
Out[11]:
In [12]:
males_mean = males.mean()
females_mean = females.mean()
males_std = males.std()
females_std = females.std()
(males_mean, females_mean)
Out[12]:
In [13]:
(males_mean - females_mean) / sample_stddev
Out[13]:
I have used a two-sided test with H0 that there is no difference between means of the male and female samples. I'm comparing difference of two sample means and computing z-statistic for it which equals -4.5, therefore it gives us very high confidence (>99.9%) that males have different mean temperature than females.
I conclude that we should define two standard body temperatures (male and female) as there is a very high confidence that they are truly different.
In [ ]: