Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model



In [21]:

    
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression
import statsmodels.formula.api as smf

2. Read in the hanford.csv file in the `data/` folder



In [9]:

    
df = pd.read_csv("hanford.csv")

3. Calculate the basic descriptive statistics on the data



In [43]:

    
df.describe()
df.corr()









    Out[43]:






  
    
      
      Exposure
      Mortality
    
  
  
    
      Exposure
      1.000000
      0.926345
    
    
      Mortality
      0.926345
      1.000000

4. Find a reasonable threshold to say exposure is high and recode the data



In [32]:

    
# I could define "high exposure" as 1.5 x IQR, which would be: Q3-Q1, or 6.41-2.49
high_exposure = 4.08*1.5



In [33]:

    
df['Exposure'].describe()









    Out[33]:





count     9.000000
mean      4.617778
std       3.491192
min       1.250000
25%       2.490000
50%       3.410000
75%       6.410000
max      11.640000
Name: Exposure, dtype: float64



In [ ]:



In [ ]:

5. Create a logistic regression model



In [40]:

    
lm = smf.ols(formula="Mortality~Exposure",data=df).fit() #notice the formula regresses Y on X (Y~X)



In [41]:

    
intercept, slope = lm.params



In [42]:

    
lm.params









    Out[42]:





Intercept    114.715631
Exposure       9.231456
dtype: float64

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50



In [ ]:

    
#y=mx+b



In [ ]:

Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model

2. Read in the hanford.csv file in the data/ folder

3. Calculate the basic descriptive statistics on the data

4. Find a reasonable threshold to say exposure is high and recode the data

5. Create a logistic regression model

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50

2. Read in the hanford.csv file in the `data/` folder