In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
In :import pandas as pd df = pd.read_csv('data/human_body_temperature.csv')
In :import matplotlib import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
/home/zczapran/anaconda3/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1jOut:<matplotlib.axes._subplots.AxesSubplot at 0x7f6b165315c0>
From the bar chart we can say that the distrubition resembles normal.
temperature gender heart_rate 0 99.3 F 68.0 1 98.4 F 81.0 2 97.8 M 73.0 3 99.2 F 66.0 4 98.0 F 73.0
In :stddev_sd = df.temperature.std() / np.sqrt(130) sample_mean = df.temperature.mean()
In :z_statistic = (sample_mean - 98.6) / stddev_sd (len(df), sample_mean, sample_stddev, z_statistic)
Out:(130, 98.24923076923078, 0.06430441683789101, -5.4548232923640789)
As I'm comparing a sample mean to a fixed value (assumed population mean of 98.6F), I'm using one-sample test. In the situation we have n >> 30 thus it's appropriate to use z-statistic. There is a distance of 5.45 std dev between sample mean and the assumed population mean, therefore it gives us very high confidence (>99.9%) that 98.6F is not the true population mean. In order to make a two-sample test I'm going to generate a second sample based on the assumed population mean and sample standard deviation.
In :other = pd.Series(np.random.normal(98.6, df.temperature.std(), 130)) other_mean = other.mean() pooled_stddev = np.sqrt(sample_stddev * sample_stddev/130 + other.var()/130) (other_mean, pooled_stddev, other_mean - sample_mean)
Out:(98.53220304272719, 0.062307266060956767, 0.2829722734964122)
I assume there is no difference between two sample means (H0: sample_mean - other_mean = 0). I'm going to show 99% confidence, that H0 is not true and that there is in reality a difference between those two means. For that, the distance between means has to be >= 2.58 (z-value for 0.995 - two-sided test).
In :z_statistic = (other_mean - sample_mean) / pooled_stddev z_statistic
In :(sample_mean - 1.96*sample_stddev, sample_mean + 1.96*sample_stddev, 1.96*sample_stddev)
Out:(98.123194112228518, 98.375267426233037, 0.12603665700226638)
Margin of error is 0.126F and the confidence interval (98.12F, 98.38F) which means any temperature below 98.12F or above 98.38F would be considered abnormal.
In :males = df.temperature[df.gender=='M'] females = df.temperature[df.gender=='F'] (males.size, females.size)
In :males_mean = males.mean() females_mean = females.mean() males_std = males.std() females_std = females.std() (males_mean, females_mean)
In :(males_mean - females_mean) / sample_stddev
I have used a two-sided test with H0 that there is no difference between means of the male and female samples. I'm comparing difference of two sample means and computing z-statistic for it which equals -4.5, therefore it gives us very high confidence (>99.9%) that males have different mean temperature than females.
I conclude that we should define two standard body temperatures (male and female) as there is a very high confidence that they are truly different.
In [ ]: