Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [1]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the data/ folder


In [3]:
df=pd.read_csv("hanford.csv")

3. Calculate the basic descriptive statistics on the data


In [4]:
df.describe()


Out[4]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

In [5]:
df.corr()


Out[5]:
Exposure Mortality
Exposure 1.000000 0.926345
Mortality 0.926345 1.000000

4. Find a reasonable threshold to say exposure is high and recode the data


In [8]:
df['Mortality'].hist()


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x11715d2b0>

In [9]:
df['Mortality'].mean()


Out[9]:
157.34444444444446

In [13]:
#use the median as a threshold 
df['Mort_high']=df['Mortality'].apply(lambda x:1 if x>=147.1 else 0)
df['Expo_high']=df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)

In [14]:
def exposure_high(x):
    if x>=3.41:
        return 1
    else:
        return 0

In [15]:
df


Out[15]:
County Exposure Mortality Mort_high Expo_high
0 Umatilla 2.49 147.1 1 0
1 Morrow 2.57 130.1 0 0
2 Gilliam 3.41 129.9 0 1
3 Sherman 1.25 113.5 0 0
4 Wasco 1.62 137.5 0 0
5 HoodRiver 3.83 162.3 1 1
6 Portland 11.64 207.5 1 1
7 Columbia 6.41 177.9 1 1
8 Clatsop 8.34 210.3 1 1

In [7]:
Q1=df['Exposure'].quantile(q=0.25)
Q1


Out[7]:
2.4900000000000002

In [ ]:
Q2=df['Exposure'].quantile(q=0.)

In [ ]:


In [ ]:

5. Create a logistic regression model


In [16]:
lm =


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-44d480e072fd> in <module>()
----> 1 lm = smf.ols(formula="Mortality~Exposure",data=df).fit()

NameError: name 'smf' is not defined

In [ ]:
lm=lm.fit(x.y)

In [ ]:


In [ ]:

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [ ]:


In [ ]: