# What is the True Normal Human Body Temperature?

#### Background

The mean normal body temperature was held to be 37\$^{\circ}\$C or 98.6\$^{\circ}\$F for more than 120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868 book. But, is this value statistically correct?

### Exercises

In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.

Answer the following questions in this notebook below and submit to your Github account.

1. Is the distribution of body temperatures normal?
• Although this is not a requirement for CLT to hold (read CLT carefully), it gives us some peace of mind that the population may also be normally distributed if we assume that this sample is representative of the population.
2. Is the sample size large? Are the observations independent?
• Remember that this is a condition for the CLT, and hence the statistical tests we are using, to apply.
3. Is the true population mean really 98.6 degrees F?
• Would you use a one-sample or two-sample test? Why?
• In this situation, is it appropriate to use the \$t\$ or \$z\$ statistic?
• Now try using the other test. How is the result be different? Why?
4. Draw a small sample of size 10 from the data and repeat both tests.
• Which one is the correct one to use?
• What do you notice? What does this tell you about the difference in application of the \$t\$ and \$z\$ statistic?
5. At what temperature should we consider someone's temperature to be "abnormal"?
• Start by computing the margin of error and confidence interval.
6. Is there a significant difference between males and females in normal temperature?
• What test did you use and why?
• Write a story with your conclusion in the context of the original problem.

You can include written notes in notebook cells using Markdown:

#### Resources

``````

In :

import pandas as pd

``````
``````

In :

``````
``````

In :

# Load Matplotlib + Seaborn and SciPy libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from scipy.stats import norm
from statsmodels.stats.weightstats import ztest
%matplotlib inline

``````
``````

In :

``````
``````

Out:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

temperature
gender
heart_rate

0
99.3
F
68.0

1
98.4
F
81.0

2
97.8
M
73.0

3
99.2
F
66.0

4
98.0
F
73.0

``````

### 1. Is the distribution of body temperatures normal?

Yes. Based on the shape of the curve plotted with sample data, we have a normal distribution of body temperature.

``````

In :

ax = sns.distplot(df[['temperature']], rug=True, axlabel='Temperature (o F)')

``````
``````

``````

### 2. Is the sample size large? Are the observations independent?

#### Sample size

``````

In :

print("Yes. We have *" + str(df['temperature'].size) + "* records in the sample data file.")
print("There is no connection or dependence between the measured temperature values, in other words, the observations are independent.")

``````
``````

Yes. We have *130* records in the sample data file.
There is no connection or dependence between the measured temperature values, in other words, the observations are independent.

``````
``````

In :

# Sample (dataset) size
df['temperature'].describe()

``````
``````

Out:

count    130.000000
mean      98.249231
std        0.733183
min       96.300000
25%       97.800000
50%       98.300000
75%       98.700000
max      100.800000
Name: temperature, dtype: float64

``````
``````

In :

# Population mean temperature
POP_MEAN = 98.6

# Sample size, mean and standard deviation
sample_size = df['temperature'].count()
sample_mean = df['temperature'].mean()
sample_std = df['temperature'].std(axis=0)

``````

### What we know about population and what we get from sample dataset

``````

In :

print("Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print("Sample size: sample_size = " + str(sample_size))
print("Sample mean: sample_mean = "+ str(sample_mean))
print("Sample standard deviation: sample_std = "+ str(sample_std))

``````
``````

Population mean temperature (given): POP_MEAN = 98.6
Sample size: sample_size = 130
Sample mean: sample_mean = 98.24923076923078
Sample standard deviation: sample_std = 0.7331831580389454

``````

### 3. Is the true population mean really 98.6 degrees F?

#### Hypothesis:

``````

In :

print("* Ho or Null hypothesis: Average body temperature *is* " + str(POP_MEAN)+" degrees F.")
print("* Ha or Alternative hypothesis: Average body temperature *is not* " + str(POP_MEAN)+" degrees F.")

``````
``````

* Ho or Null hypothesis: Average body temperature *is* 98.6 degrees F.
* Ha or Alternative hypothesis: Average body temperature *is not* 98.6 degrees F.

``````

#### where:

• x = sample mean
• uo = population mean
• s = sample standard deviation
• n = sample size

### t test

#### t = ((sample_mean - population_mean)/ sample_std_deviation ) * sqrt(sample_size)

``````

In :

t = ((sample_mean - POP_MEAN)/sample_std)*np.sqrt(sample_size)

print("t = " + str(t))

``````
``````

t = -5.45482329236

``````

### degrees of freedom

``````

In :

degree = sample_size - 1

print("degrees of freedom =" + str(degree))

``````
``````

degrees of freedom =129

``````

### p-value

``````

In :

p = 1 - stats.t.cdf(abs(t),df=degree)

print("p-value = %.10f" % p)

``````
``````

p-value = 0.4712457273

``````

### 2 * p-value is the new p-value:

``````

In :

p2 = 2*p
print("p-value = %.10f (2 * p-value)" % p2)

``````
``````

p-value = 0.0000002411 (2 * p-value)

``````

### We assume that:

#### Significant level (alfa) = 0.05 (cutoff level)

``````

In :

ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.10f" % p2)

``````
``````

. alfa = 0.05
. p-value = 0.0000002411

``````

### a) Would you use a one-sample or two-sample test? Why?

Two-sample test, once we want to know if the result is different of a reference value: 98.6 degrees F.

### b) In this situation, is it appropriate to use the t or z statistic?

Once we do not know the population standard deviation, it is appropriate to use t statistic.

### Assuming that population standard deviation = sample standard deviation, we have:

``````

In :

print("----")
print(". Sample mean: sample_mean = "+ str(sample_mean))
print(". Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print(". Population standard deviation: sample_std = "+ str(sample_std))
print(". Sample size: sample_size = " + str(sample_size))
print("----")

``````
``````

----
. Sample mean: sample_mean = 98.24923076923078
. Population mean temperature (given): POP_MEAN = 98.6
. Population standard deviation: sample_std = 0.7331831580389454
. Sample size: sample_size = 130
----

``````

### Z test

#### Z = ((sample_mean - population_mean)/ population_std_deviation ) * sqrt(sample_size)

Note: we are assuming that population standard deviation = sample standard deviation (sample_std)

``````

In :

z = ((sample_mean - POP_MEAN)/sample_std)*np.sqrt(sample_size)

print("Z value or z_score: z = " + str(z))

``````
``````

Z value or z_score: z = -5.45482329236

``````

### p-value

``````

In :

# P-Value two sided
p_value_z =  1 - (norm.sf(abs(z))*2)

``````
``````

In :

print("P-Value = %.15f" % p_value_z)

``````
``````

P-Value = 0.059131076798748

``````

### We (also) assume that:

#### Significant level (alfa) = 0.05 (cutoff level)

``````

In :

ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.15f" % p_value_z)

``````
``````

. alfa = 0.05
. p-value = 0.059131076798748

``````

### 4. Draw a small sample of size 10 from the data and repeat both tests.

``````

In :

# A sample with randomly 10 records from original dataset
df_sample10 = df.sample(n=10)
df_sample10['temperature'].count()

``````
``````

Out:

10

``````

#### The histogram:

``````

In :

ax = sns.distplot(df_sample10[['temperature']], rug=True, axlabel='Temperature (o F)')

``````
``````

``````

#### Sample size, mean and standard deviation

``````

In :

sample10_size = df_sample10['temperature'].count()
sample10_mean = df_sample10['temperature'].mean()
sample10_std = df_sample10['temperature'].std(axis=0)

``````
``````

In :

print("Population mean temperature (given): POP_MEAN = " + str(POP_MEAN))
print("Sample-10 size: sample_size = " + str(sample10_size))
print("Sample-10 mean: sample_mean = "+ str(sample10_mean))
print("Sample-10 standard deviation: sample_std = "+ str(sample10_std))

``````
``````

Population mean temperature (given): POP_MEAN = 98.6
Sample-10 size: sample_size = 10
Sample-10 mean: sample_mean = 98.62999999999998
Sample-10 standard deviation: sample_std = 1.278931845981898

``````

#### where:

• x = sample mean
• uo = population mean
• s = sample standard deviation
• n = sample size

## t test

#### t = ((sample_mean - population_mean)/ sample_std_deviation ) * sqrt(sample_size)

``````

In :

t = ((sample10_mean - POP_MEAN)/sample10_std)*np.sqrt(sample10_size)

print("t = " + str(t))

``````
``````

t = 0.074177783674

``````

#### degrees of freedom

``````

In :

degree = sample10_size - 1

print("degrees of freedom =" + str(degree))

``````
``````

degrees of freedom =9

``````

### p-value

``````

In :

p_value = 1 - stats.t.cdf(abs(t),df=degree)

# p-value considering two-tails
p_value = 2*p_value
print("p-value =" + str(p_value))

``````
``````

p-value =0.942491454639

``````

#### Significant level (alfa) = 0.05 (cutoff level)

``````

In :

ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.15f" % p_value)

``````
``````

. alfa = 0.05
. p-value = 0.942491454638975

``````

### Z test

#### Z = ((sample_mean - population_mean)/ population_std_deviation ) * sqrt(sample_size)

Note: we are assuming that population standard deviation = sample standard deviation (sample10_std)

``````

In :

z = ((sample10_mean - POP_MEAN)/sample10_std)*np.sqrt(sample10_size)

print("Z value or z_score: z = " + str(z))

``````
``````

Z value or z_score: z = 0.074177783674

``````
``````

In :

# P-Value two sided
p_value_z =  1 - (norm.sf(abs(z))*2)
print("P-Value = %.15f" % p_value_z)

``````
``````

P-Value = 0.059131076798748

``````

#### Significant level (alfa) = 0.05 (cutoff level)

``````

In :

ALFA = 0.05
print(". alfa = " + str(ALFA))
print(". p-value = %.15f" % p_value_z)

``````
``````

. alfa = 0.05
. p-value = 0.059131076798748

``````

### We can consider "abnormal" those people that have body temperature different than 99.7% of the population.

#### In other words, those whose temperature is 3-std (standard deviation) far from the mean.

From the original dataset we have:

``````

In :

# Sample (dataset) size
df['temperature'].describe()

``````
``````

Out:

count    130.000000
mean      98.249231
std        0.733183
min       96.300000
25%       97.800000
50%       98.300000
75%       98.700000
max      100.800000
Name: temperature, dtype: float64

``````
``````

In :

median = df['temperature'].mean()
std = df['temperature'].std(axis=0)
print("One standard deviation (std) is %.3f degrees F." %std)
print("Three standard deviation (std) is %.3f degrees F." % (3*std))

``````
``````

One standard deviation (std) is 0.733 degrees F.
Three standard deviation (std) is 2.200 degrees F.

``````

### So, a "abnormal" body temperature is between -3std and +3std:

``````

In :

lim_low = median - (3*std)
lim_high = median + (3*std)
print("A body temperature different than 99.7% of the population is: greater than "+ str(lim_high) + " and less than " + str(lim_low) + " degrees F.")

``````
``````

A body temperature different than 99.7% of the population is: greater than 100.44878024334761 and less than 96.04968129511394 degrees F.

``````

### 6. Is there a significant difference between males and females in normal temperature?

``````

In :

# Female temperature (mean and standard deviation)
df_female = df.loc[df['gender'] == 'F']
ax = sns.distplot(df_female[['temperature']])

print("Female temperature: mean = %f | std = %f" % (df_female['temperature'].mean(), df_female['temperature'].std()))

``````
``````

Female temperature: mean = 98.393846 | std = 0.743488

``````
``````

In :

# Male temperature (mean and standard deviation)
df_male = df.loc[df['gender'] == 'M']
ax = sns.distplot(df_male[['temperature']])

print("Male temperature: mean = %f | std = %f" % (df_male['temperature'].mean(), df_male['temperature'].std()))

``````
``````

Male temperature: mean = 98.104615 | std = 0.698756

``````
``````

In :

# Plotting histogram based on gender (Female/Male)
grid = sns.FacetGrid(df, col="gender")
grid.map(plt.hist, "temperature", color="y")

``````
``````

Out:

<seaborn.axisgrid.FacetGrid at 0x1a199cc080>

``````
``````

In :

# Plotting Female/Male temperatures using Seaborn Pairplot
sns.pairplot(df, hue='gender', size=2.5)

``````
``````

Out:

<seaborn.axisgrid.PairGrid at 0x1a195d85c0>

``````

What test did you use and why?

T-test, 2 Tailed: we use this test when we want to test if the difference between the averages of two independent populations (Female and Male).

Write a story with your conclusion in the context of the original problem.

``````

In [ ]:

``````

## References:

 "What Statistical Analysis Should I Use? Statistical Analyses Using STATA". Last access: 12/25/2017 - Link: https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/

 "T-Score vs. Z-Score: What’s the Difference?". Last access: 12/26/2017 - Link: http://www.statisticshowto.com/when-to-use-a-t-score-vs-z-score/

``````

In [ ]:

``````