In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In :import pandas as pd df = pd.read_csv('data/human_body_temperature.csv')
In :# Your work here.
In :# Load Matplotlib + Seaborn and SciPy libraries import matplotlib.pyplot as plt import seaborn as sns import numpy as np from scipy import stats %matplotlib inline
temperature gender heart_rate 0 99.3 F 68.0 1 98.4 F 81.0 2 97.8 M 73.0 3 99.2 F 66.0 4 98.0 F 73.0
Yes. Based on the shape of the curve plotted with sample data, we have a normal distribution of body temperature.
In :ax = sns.distplot(df[['temperature']], rug=True, axlabel='Temperature (o F)')
Yes. We have 390 records in the sample data file (df.size).
There is no connection or dependence between the measured temperature values, in other words, the observations are independent.
In :# Sample (dataset) size df['temperature'].describe()
Out:count 130.000000 mean 98.249231 std 0.733183 min 96.300000 25% 97.800000 50% 98.300000 75% 98.700000 max 100.800000 Name: temperature, dtype: float64
In :# Population mean temperature POP_MEAN = 98.6 # Sample size, mean and standard deviation sample_size = df['temperature'].count() sample_mean = df['temperature'].mean() sample_std = df['temperature'].std(axis=0)
In :print("Population mean temperature (given): POP_MEAN = " + str(POP_MEAN)) print("Sample size: sample_size = " + str(sample_size)) print("Sample mean: sample_mean = "+ str(sample_mean)) print("Sample standard deviation: sample_std = "+ str(sample_std))
Population mean temperature (given): POP_MEAN = 98.6 Sample size: sample_size = 130 Sample mean: sample_mean = 98.24923076923078 Sample standard deviation: sample_std = 0.7331831580389454
In :# t distribuition # t = ((sample_mean - reference_value)/ std_deviation ) * sqrt(sample_size) # ...
In :# degrees of freedom degree = 130 - 1
In :# p-value # p = stats.t.cdf(t,df=degree)
In :# t-stats and p-value # print("t = " + str(t)) # print("p = " + str(2*p))
a) Would you use a one-sample or two-sample test? Why?
b) In this situation, is it appropriate to use the t or z statistic?
c) Now try using the other test. How is the result be different? Why?
In :# A sample with randomly 10 records from original dataset df_sample10 = df.sample(n=10)
In :ax = sns.distplot(df_sample10[['temperature']], rug=True, axlabel='Temperature (o F)')
What test did you use and why?
Write a story with your conclusion in the context of the original problem.
 "What Statistical Analysis Should I Use? Statistical Analyses Using STATA". Last access: 12/25/2017 - Link: https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/
[x] "Central limit theorem", Khan Acadeny. Last access: 12/26/2017. Link: https://www.khanacademy.org/math/ap-statistics/sampling-distribution-ap/sampling-distribution-mean/v/sampling-distribution-of-the-sample-mean
[y] Important: [https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/hypothesis-testing-and-p-values}(https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/hypothesis-testing-and-p-values)]
In [ ]: