Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [1]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression


/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/matplotlib/__init__.py:1035: UserWarning: Duplicate key in file "/Users/Monica/.matplotlib/matplotlibrc", line #2
  (fname, cnt))

2. Read in the hanford.csv file in the data/ folder


In [4]:
df = pd.read_csv('../data/hanford.csv')

3. Calculate the basic descriptive statistics on the data


In [6]:
df.corr()


Out[6]:
Exposure Mortality
Exposure 1.000000 0.926345
Mortality 0.926345 1.000000

In [13]:
df.describe()


Out[13]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

4. Find a reasonable threshold to say exposure is high and recode the data


In [9]:
Q1 = df['Exposure'].quantile(q=0.25)
Q1


Out[9]:
2.4900000000000002

In [10]:
Q2 = df['Exposure'].quantile(q=0.5)
Q2


Out[10]:
3.4100000000000001

In [11]:
Q3 = df['Exposure'].quantile(q=0.75)
Q3


Out[11]:
6.4100000000000001

In [14]:
df['Mortality'].hist(bins=5)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ba52eb8>

In [15]:
df['Mortality'].mean()


Out[15]:
157.34444444444446

In [17]:
df['Mort_high'] = df['Mortality'].apply(lambda x:1 if x>=157.1 else 0)
df['Expo_high'] = df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)

In [18]:
df # logistic regression , high and low mortality are classified.


Out[18]:
County Exposure Mortality Mort_high Expo_high
0 Umatilla 2.49 147.1 0 0
1 Morrow 2.57 130.1 0 0
2 Gilliam 3.41 129.9 0 1
3 Sherman 1.25 113.5 0 0
4 Wasco 1.62 137.5 0 0
5 HoodRiver 3.83 162.3 1 1
6 Portland 11.64 207.5 1 1
7 Columbia 6.41 177.9 1 1
8 Clatsop 8.34 210.3 1 1

5. Create a logistic regression model


In [19]:
lm = linear


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-0fc46e0387ae> in <module>()
----> 1 lm = lm.fit(x,y)

NameError: name 'lm' is not defined

In [ ]:


In [ ]:
lm = lm.fit(x,y)

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [ ]:
lm.predict([50])

In [ ]: