In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.

Answer the following questions **in this notebook below and submit to your Github account**.

- Is the distribution of body temperatures normal?
- Although this is not a requirement for CLT to hold (read CLT carefully), it gives us some peace of mind that the population may also be normally distributed if we assume that this sample is representative of the population.

- Is the sample size large? Are the observations independent?
- Remember that this is a condition for the CLT, and hence the statistical tests we are using, to apply.

- Is the true population mean really 98.6 degrees F?
- Would you use a one-sample or two-sample test? Why?
- In this situation, is it appropriate to use the $t$ or $z$ statistic?
- Now try using the other test. How is the result be different? Why?

- Draw a small sample of size 10 from the data and repeat both tests.
- Which one is the correct one to use?
- What do you notice? What does this tell you about the difference in application of the $t$ and $z$ statistic?

- At what temperature should we consider someone's temperature to be "abnormal"?
- Start by computing the margin of error and confidence interval.

- Is there a significant difference between males and females in normal temperature?
- What test did you use and why?
- Write a story with your conclusion in the context of the original problem.

You can include written notes in notebook cells using Markdown:

- In the control panel at the top, choose Cell > Cell Type > Markdown
- Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

- Information and data sources: http://www.amstat.org/publications/jse/datasets/normtemp.txt, http://www.amstat.org/publications/jse/jse_data_archive.htm
- Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

```
In [1]:
```import pandas as pd
df = pd.read_csv('data/human_body_temperature.csv')

```
In [2]:
``````
# Your work here.
```

```
In [3]:
```# Load Matplotlib + Seaborn and SciPy libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
%matplotlib inline

```
In [4]:
```df.head(5)

```
Out[4]:
```

**Yes**. Based on the shape of the curve plotted with sample data, we have a normal distribution of body temperature.

```
In [5]:
```ax = sns.distplot(df[['temperature']], rug=True, axlabel='Temperature (o F)')

```
```

*df.size*).

There is no connection or dependence between the measured temperature values, in other words, the observations are independent.

```
In [6]:
```# Sample (dataset) size
df['temperature'].describe()

```
Out[6]:
```

```
In [7]:
```# Population mean temperature
POP_MEAN = 98.6
# Sample size, mean and standard deviation
sample_size = df['temperature'].count()
sample_mean = df['temperature'].mean()
sample_std = df['temperature'].std(axis=0)

```
In [8]:
```print("Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print("Sample size: sample_size = " + str(sample_size))
print("Sample mean: sample_mean = "+ str(sample_mean))
print("Sample standard deviation: sample_std = "+ str(sample_std))

```
```

: Average body temperature*Ho or Null hypothesis***is**98.6 degrees F: Average body temperature*Ha or Alternative hypothesis***is not**98.6 degrees F

- ??? How to validate these hypotheis?

```
In [9]:
```# t distribuition
# t = ((sample_mean - reference_value)/ std_deviation ) * sqrt(sample_size)
# ...

```
In [10]:
```# degrees of freedom
degree = 130 - 1

```
In [12]:
```# p-value
# p = stats.t.cdf(t,df=degree)

```
In [13]:
```# t-stats and p-value
# print("t = " + str(t))
# print("p = " + str(2*p))

**a) Would you use a one-sample or two-sample test? Why?**

- ???

**b) In this situation, is it appropriate to use the t or z statistic?**

- ???

**c) Now try using the other test. How is the result be different? Why?**

- ?

```
In [54]:
```# A sample with randomly 10 records from original dataset
df_sample10 = df.sample(n=10)

The histogram:

```
In [47]:
```ax = sns.distplot(df_sample10[['temperature']], rug=True, axlabel='Temperature (o F)')

```
```

- ???

- ???

**What test did you use and why?**

- ???

**Write a story with your conclusion in the context of the original problem. **

- ???

*"What Statistical Analysis Should I Use? Statistical Analyses Using STATA"*. Last access: 12/25/2017 - Link: https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/

*"T-Score vs. Z-Score: What’s the Difference?"*. Last access: 12/26/2017 - Link: http://www.statisticshowto.com/when-to-use-a-t-score-vs-z-score/

*"Central limit theorem"*, Khan Acadeny. Last access: 12/26/2017. Link: https://www.khanacademy.org/math/ap-statistics/sampling-distribution-ap/sampling-distribution-mean/v/sampling-distribution-of-the-sample-mean

```
In [ ]:
```