Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [19]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the data/ folder


In [25]:
df = pd.read_csv('hanford.csv')
df.columns


Out[25]:
Index(['County', 'Exposure', 'Mortality'], dtype='object')

In [26]:
df.head(2)


Out[26]:
County Exposure Mortality
0 Umatilla 2.49 147.1
1 Morrow 2.57 130.1

3. Calculate the basic descriptive statistics on the data


In [21]:
df.describe()


Out[21]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

4. Find a reasonable threshold to say exposure is high and recode the data


In [43]:
df['Mortality'].hist(bins=5)


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x109462400>

In [46]:
df['Mortality'].mean()


Out[46]:
157.34444444444446

In [50]:
df['Mort_high'] = df['Mortality'].apply(lambda x:1 if x>=147.1 else 0)
df['Expo_high'] = df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)
#lambda is a temporary function

In [51]:
df


Out[51]:
County Exposure Mortality Exposure_low Mort_high Expo_high
0 Umatilla 2.49 147.1 1 1 0
1 Morrow 2.57 130.1 1 0 0
2 Gilliam 3.41 129.9 1 0 1
3 Sherman 1.25 113.5 1 0 0
4 Wasco 1.62 137.5 1 0 0
5 HoodRiver 3.83 162.3 0 1 1
6 Portland 11.64 207.5 0 1 1
7 Columbia 6.41 177.9 0 1 1
8 Clatsop 8.34 210.3 0 1 1

5. Create a logistic regression model


In [54]:
lm = LogisticRegression()

In [64]:
x = np.asarray(df[['Expo_high']])
y = np.asarray(df['Mort_high'])

In [65]:
lm = lm.fit(x,y)

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [66]:
lm.predict([50])


/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Out[66]:
array([1])