1. Import the necessary packages to read in the data, plot, and create a linear regression model



In [8]:

    
import pandas as pd
import matplotlib.pyplot as plt
#DISPLAY MOTPLOTLIB INLINE WITH THE NOTEBOOK AS OPPOSED TO POP UP WINDOW
%matplotlib inline
import statsmodels.formula.api as smf # package we'll be using for linear regression









    



/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/matplotlib/__init__.py:1035: UserWarning: Duplicate key in file "/Users/Monica/.matplotlib/matplotlibrc", line #2
  (fname, cnt))

2. Read in the hanford.csv file



In [9]:

    
df = pd.read_csv('../data/hanford.csv')



In [10]:

    
df.head()









    Out[10]:






  
    
      
      County
      Exposure
      Mortality
    
  
  
    
      0
      Umatilla
      2.49
      147.1
    
    
      1
      Morrow
      2.57
      130.1
    
    
      2
      Gilliam
      3.41
      129.9
    
    
      3
      Sherman
      1.25
      113.5
    
    
      4
      Wasco
      1.62
      137.5

3. Calculate the basic descriptive statistics on the data



In [11]:

    
df.corr()









    Out[11]:






  
    
      
      Exposure
      Mortality
    
  
  
    
      Exposure
      1.000000
      0.926345
    
    
      Mortality
      0.926345
      1.000000

4. Calculate the coefficient of correlation (r) and generate the scatter plot. Does there seem to be a correlation worthy of investigation?



In [12]:

    
lm = smf.ols(formula="Mortality~Exposure",data=df).fit() #notice the formula regresses Y on X (Y~X)



In [13]:

    
lm.params









    Out[13]:





Intercept    114.715631
Exposure       9.231456
dtype: float64



In [14]:

    
lm.summary() # R sQUARED IS 0.858 WHICH should be investigated!









    



/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/scipy/stats/stats.py:1535: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
  "anyway, n=%i" % int(n))






    Out[14]:





OLS Regression Results

  Dep. Variable:         Mortality       R-squared:             0.858


  Model:                    OLS          Adj. R-squared:        0.838


  Method:              Least Squares     F-statistic:           42.34


  Date:              Thu, 28 Jul 2016    Prob (F-statistic):  0.000332


  Time:                  10:29:52        Log-Likelihood:      -35.397


  No. Observations:            9         AIC:                   74.79


  Df Residuals:                7         BIC:                   75.19


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept    114.7156      8.046     14.258   0.000     95.691   133.741


  Exposure       9.2315      1.419      6.507   0.000      5.877    12.586




  Omnibus:         2.914    Durbin-Watson:         1.542


  Prob(Omnibus):   0.233    Jarque-Bera (JB):      0.915


  Skew:           -0.030    Prob(JB):              0.633


  Kurtosis:        1.439    Cond. No.               9.97



In [15]:

    
intercept, slope = lm.params



In [16]:

    
ax = df.plot(kind='scatter', x='Exposure', y='Mortality', alpha=0.5)
ax.set_title('Camcer Mortality Rates Based on Exposure')
ax.set_xlabel('Index of Exposure')
ax.set_ylabel('Cancer Mortality per 100,000 man-years')









    Out[16]:





<matplotlib.text.Text at 0x107890c50>

5. Create a linear regression model based on the available data to predict the mortality rate given a level of exposure



In [17]:

    
df.plot(kind="scatter",x="Exposure",y="Mortality")
plt.plot(df["Exposure"],slope*df["Exposure"]+intercept,"-",color="red") #we create the best fit line from the values in the fit model









    Out[17]:





[<matplotlib.lines.Line2D at 0x1079f7c88>]

6. Plot the linear regression line on the scatter plot of values. Calculate the r^2 (coefficient of determination)



In [18]:

    
R_squared = 0.858

7. Predict the mortality rate (Cancer per 100,000 man years) given an index of exposure = 10



In [19]:

    
index_ex = 10
plt.plot(index_ex,slope*index_ex+intercept,"-",color="red") #we create the best fit line from the values in the fit model









    Out[19]:





[<matplotlib.lines.Line2D at 0x107b009e8>]



In [20]:

    
# y = mx + b
intercept = 114.7156



In [21]:

    
slope * 10 + 114.7156









    Out[21]:





207.03016270763393



In [22]:

    
def predicting_mortality_rate(exposure):
    return slope * exposure + intercept



In [23]:

    
predicting_mortality_rate(10)









    Out[23]:





207.03019352841989



In [ ]:

	County	Exposure	Mortality
0	Umatilla	2.49	147.1
1	Morrow	2.57	130.1
2	Gilliam	3.41	129.9
3	Sherman	1.25	113.5
4	Wasco	1.62	137.5

Dep. Variable:	Mortality	R-squared:	0.858
Model:	OLS	Adj. R-squared:	0.838
Method:	Least Squares	F-statistic:	42.34
Date:	Thu, 28 Jul 2016	Prob (F-statistic):	0.000332
Time:	10:29:52	Log-Likelihood:	-35.397
No. Observations:	9	AIC:	74.79
Df Residuals:	7	BIC:	75.19
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	114.7156	8.046	14.258	0.000	95.691 133.741
Exposure	9.2315	1.419	6.507	0.000	5.877 12.586

Omnibus:	2.914	Durbin-Watson:	1.542
Prob(Omnibus):	0.233	Jarque-Bera (JB):	0.915
Skew:	-0.030	Prob(JB):	0.633
Kurtosis:	1.439	Cond. No.	9.97