1. Import the necessary packages to read in the data, plot, and create a linear regression model



In [1]:

    
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt # package for doing plotting (necessary for adding the line)
import statsmodels.formula.api as smf

2. Read in the hanford.csv file



In [7]:

    
cd C:\Users\Harsha Devulapalli\Desktop\algorithms\class6









    



C:\Users\Harsha Devulapalli\Desktop\algorithms\class6



In [8]:

    
df=pd.read_csv("data/hanford.csv")

3. Calculate the basic descriptive statistics on the data



In [10]:

    
df.describe()

4. Calculate the coefficient of correlation (r) and generate the scatter plot. Does there seem to be a correlation worthy of investigation?



In [11]:

    
df.corr()









    Out[11]:






  
    
      
      Exposure
      Mortality
    
  
  
    
      Exposure
      1.000000
      0.926345
    
    
      Mortality
      0.926345
      1.000000



In [15]:

    
df.plot(kind='scatter',x='Exposure',y='Mortality')









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x233ec2e82e8>

5. Create a linear regression model based on the available data to predict the mortality rate given a level of exposure



In [16]:

    
lm = smf.ols(formula="Mortality~Exposure",data=df).fit()



In [18]:

    
lm.params









    Out[18]:





Intercept    114.715631
Exposure       9.231456
dtype: float64



In [19]:

    
intercept, slope = lm.params



In [ ]:

6. Plot the linear regression line on the scatter plot of values. Calculate the r^2 (coefficient of determination)



In [22]:

    
df.plot(kind="scatter",x="Exposure",y="Mortality")
plt.plot(df["Exposure"],slope*df["Exposure"]+intercept,"-",color="red")









    Out[22]:





[<matplotlib.lines.Line2D at 0x233ec60e438>]



In [26]:

    
r = df.corr()['Exposure']['Mortality']
r*r









    Out[26]:





0.85811472686989476

7. Predict the mortality rate (Cancer per 100,000 man years) given an index of exposure = 10



In [23]:

    
def predictor(exposure):
    return intercept+float(exposure)*slope



In [24]:

    
predictor(10)









    Out[24]:





207.03019352841989



In [ ]:

	Exposure	Mortality
count	9.000000	9.000000
mean	4.617778	157.344444
std	3.491192	34.791346
min	1.250000	113.500000
25%	2.490000	130.100000
50%	3.410000	147.100000
75%	6.410000	177.900000
max	11.640000	210.300000