Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [1]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

In [5]:
cd C:\Users\Harsha Devulapalli\Desktop\algorithms\class7


C:\Users\Harsha Devulapalli\Desktop\algorithms\class7

2. Read in the hanford.csv file in the data/ folder


In [6]:
df=pd.read_csv('data/hanford.csv')

In [13]:
len(df)


Out[13]:
9

3. Calculate the basic descriptive statistics on the data


In [7]:
df.describe()


Out[7]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

In [20]:
df['Mortality'].hist(bins=5)


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x226633b2160>

4. Find a reasonable threshold to say exposure is high and recode the data


In [23]:
df['Mort_high']=df['Mortality'].apply(lambda x:1 if x>=147.1 else 0)
df['Exposure_high']=df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)

In [25]:
df


Out[25]:
County Exposure Mortality Mort_high Exposure_high
0 Umatilla 2.49 147.1 1 0
1 Morrow 2.57 130.1 0 0
2 Gilliam 3.41 129.9 0 1
3 Sherman 1.25 113.5 0 0
4 Wasco 1.62 137.5 0 0
5 HoodRiver 3.83 162.3 1 1
6 Portland 11.64 207.5 1 1
7 Columbia 6.41 177.9 1 1
8 Clatsop 8.34 210.3 1 1

In [ ]:
df exposure_high(x):
    if x >=3.41
        return 1
    else:
        return 0
 # THIS IS THE FUNCTION ONE HAD TO USE IF IT WASNT FOR LAMBDA FUNCTION

5. Create a logistic regression model


In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
lm = LogisticRegression()

In [28]:
x = np.asarray(df[['Exposure_high']])
y = np.asarray(df['Mort_high'])

In [29]:
lm = lm.fit(x,y)

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [30]:
lm.predict([50])


c:\users\harsha devulapalli\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Out[30]:
array([1], dtype=int64)