In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
In [10]:
import pandas as pd
import re
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import seaborn as sns
sns.set_style("whitegrid")
import calendar
df = pd.read_csv('data/human_body_temperature.csv')
In [49]:
# 1. The distribution is roughly normal according the below visualization,
# with the follow qualifications:
# Let's see how much it differs from the norm
print '\nWith .25 p-value, we fail to reject the null hypothesis that the distribution comes from a normal distribution:'
print '\n', sp.stats.normaltest(df['temperature'])
print '\nStandard error of measurement: ', sp.stats.sem(df['temperature'])
print '\nStandard Deviation: ', np.std(df['temperature'])
print '\n', sp.stats.describe(df['temperature'])
sns.distplot(df['temperature']);
In [31]:
# 2. Is the sample size large? Are the observations independent?
print 'The sample size is: ', len(df)
print '\nThe same size is greater than 30, which for CTL means that the sample mean should be a good approximation of the population mean'
In [36]:
# 3. Is the true population mean really 98.6 degrees F?
print 'The true population mean is not 98.6. The statistical mean of the population can be approximated using the data: ', df['temperature'].mean()
In [55]:
# 4. At what temperature should we consider someone's temperature to be "abnormal"?
# For general Bayesian statistics:
sp.stats.bayes_mvs(df['temperature'])
# For confidence intervals
print 'abnormal body temperatures fall outside the following interval: \n', \
sp.stats.t.interval(0.95, len(df)-1, loc=np.mean(df['temperature']), scale=np.std(df['temperature']))
In [79]:
# 5. Is there a significant difference between males and females in normal temperature?
Male = df[df['gender'] == 'M']['temperature']
Female = df[df['gender'] == 'F']['temperature']
print 'Mean of Male samples : ', df[df['gender'] == 'M']['temperature'].mean()
print 'Mean of Female samples: ', df[df['gender'] == 'F']['temperature'].mean()
print '\nMale : ', sp.stats.normaltest(Male)
print 'Female: ', sp.stats.normaltest(Female)
print '\nStandard error of measurement (Male) : ', sp.stats.sem(Male)
print 'Standard error of measurement (Female): ', sp.stats.sem(Female)
print '\nMale : ', sp.stats.describe(Male)
print '\nFemale: ', sp.stats.describe(Female)
sns.distplot(Male, label='Male', color='blue')
sns.distplot(Female, color='green');
In [61]:
from scipy.stats import ks_2samp
# Kolmogorov-Smirnov Statistic
ks_2samp(Male, Female)
# We CANNOT reject the null hypothesis that the samples are drawen from the same distribution
# Because the p-value is high and the K-S (Kolmogorov-Smirnov) statistic is high
# Thus there is no significant difference between male and female normal temperature.
Out[61]: