1. Import the necessary packages to read in the data, plot, and create a linear regression model


In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

2. Read in the hanford.csv file


In [3]:
df = pd.read_csv('../data/hanford.csv')

In [4]:
df.head()


Out[4]:
County Exposure Mortality
0 Umatilla 2.49 147.1
1 Morrow 2.57 130.1
2 Gilliam 3.41 129.9
3 Sherman 1.25 113.5
4 Wasco 1.62 137.5

3. Calculate the basic descriptive statistics on the data

Central Tendency:

  • Mean

In [5]:
df.mean()


Out[5]:
Exposure       4.617778
Mortality    157.344444
dtype: float64
  • Median

In [6]:
df.median()


Out[6]:
Exposure       3.41
Mortality    147.10
dtype: float64
  • Mode

In [7]:
df.mode()


Out[7]:
County Exposure Mortality

Spread:

  • Range

In [9]:
max(df['Exposure']) - min(df['Exposure'])


Out[9]:
10.390000000000001

In [10]:
max(df['Mortality']) - min(df['Mortality'])


Out[10]:
96.800000000000011
  • Interquartile Range

In [11]:
df['Exposure'].quantile(q=0.75)  - df['Exposure'].quantile(q=0.25)


Out[11]:
3.9199999999999999

In [12]:
df['Mortality'].quantile(q=0.75)  - df['Mortality'].quantile(q=0.25)


Out[12]:
47.800000000000011
  • Standard Deviation

In [13]:
df.std()


Out[13]:
Exposure      3.491192
Mortality    34.791346
dtype: float64

4. Calculate the coefficient of correlation (r) and generate the scatter plot. Does there seem to be a correlation worthy of investigation?


In [14]:
df.corr()


Out[14]:
Exposure Mortality
Exposure 1.000000 0.926345
Mortality 0.926345 1.000000

In [15]:
df.plot(kind = 'scatter', x = 'Exposure', y = 'Mortality')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d60e550>

Yes.

5. Create a linear regression model based on the available data to predict the mortality rate given a level of exposure


In [16]:
lm = smf.ols(formula = 'Mortality~Exposure', data = df).fit()

In [18]:
b, m = lm.params

In [21]:
def predicted_mortality_rate(exposure):
    y = m * exposure + b
    return y

6. Plot the linear regression line on the scatter plot of values. Calculate the r^2 (coefficient of determination)


In [20]:
df.plot(kind = 'scatter', x = 'Exposure', y = 'Mortality')
plt.plot(df['Exposure'], m * df['Exposure'] + b, '-', color = 'red')


Out[20]:
[<matplotlib.lines.Line2D at 0x10d8dcc88>]

7. Predict the mortality rate (Cancer per 100,000 man years) given an index of exposure = 10


In [22]:
predicted_mortality_rate(10)


Out[22]:
207.03019352841983

In [ ]: