Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [21]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression
import statsmodels.formula.api as smf

2. Read in the hanford.csv file in the data/ folder


In [9]:
df = pd.read_csv("hanford.csv")

3. Calculate the basic descriptive statistics on the data


In [43]:
df.describe()
df.corr()


Out[43]:
Exposure Mortality
Exposure 1.000000 0.926345
Mortality 0.926345 1.000000

4. Find a reasonable threshold to say exposure is high and recode the data


In [32]:
# I could define "high exposure" as 1.5 x IQR, which would be: Q3-Q1, or 6.41-2.49
high_exposure = 4.08*1.5

In [33]:
df['Exposure'].describe()


Out[33]:
count     9.000000
mean      4.617778
std       3.491192
min       1.250000
25%       2.490000
50%       3.410000
75%       6.410000
max      11.640000
Name: Exposure, dtype: float64

In [ ]:


In [ ]:

5. Create a logistic regression model


In [40]:
lm = smf.ols(formula="Mortality~Exposure",data=df).fit() #notice the formula regresses Y on X (Y~X)

In [41]:
intercept, slope = lm.params

In [42]:
lm.params


Out[42]:
Intercept    114.715631
Exposure       9.231456
dtype: float64

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [ ]:
#y=mx+b

In [ ]: