Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [15]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the data/ folder


In [16]:
df = pd.read_csv("../data/hanford.csv")

3. Calculate the basic descriptive statistics on the data


In [17]:
df.describe()


Out[17]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

4. Find a reasonable threshold to say exposure is high and recode the data


In [5]:
df['Mortality'].hist(bins=5)


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4a3f8cf4a8>

In [6]:
df['Mortality'].mean()


Out[6]:
157.34444444444446

In [18]:
df['Mort_high'] = df['Mortality'].apply(lambda x:1 if x>=147.1 else 0)
df['Expo_high'] = df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)

In [20]:
def exposure_high(x):
    if x >= 3.41:
        return 1
    else:
        return 0

In [19]:
df


Out[19]:
County Exposure Mortality Mort_high Expo_high
0 Umatilla 2.49 147.1 1 0
1 Morrow 2.57 130.1 0 0
2 Gilliam 3.41 129.9 0 1
3 Sherman 1.25 113.5 0 0
4 Wasco 1.62 137.5 0 0
5 HoodRiver 3.83 162.3 1 1
6 Portland 11.64 207.5 1 1
7 Columbia 6.41 177.9 1 1
8 Clatsop 8.34 210.3 1 1

5. Create a logistic regression model


In [10]:
lm = LogisticRegression()

In [25]:
x = np.asarray(df[['Exposure']])
y = np.asarray(df['Mort_high'])

In [26]:
lm = lm.fit(x,y)

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [27]:
lm.predict([50])


/home/ec2-user/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Out[27]:
array([1])

In [ ]: