Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [51]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the data/ folder


In [52]:
df =pd.read_csv('data/hanford.csv')

In [53]:
df.head()


Out[53]:
County Exposure Mortality
0 Umatilla 2.49 147.1
1 Morrow 2.57 130.1
2 Gilliam 3.41 129.9
3 Sherman 1.25 113.5
4 Wasco 1.62 137.5

In [54]:
df['Mortality'] = [ float(x) for x in df['Mortality']]

3. Calculate the basic descriptive statistics on the data


In [55]:
df.describe()


Out[55]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

In [56]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
County       9 non-null object
Exposure     9 non-null float64
Mortality    9 non-null float64
dtypes: float64(2), object(1)
memory usage: 296.0+ bytes

4. Find a reasonable threshold to say exposure is high and recode the data

Step 01. Need to prepare features


In [57]:
def high_exposure(x):
    if x > 6.41:
        return 1
    else:
        return 0

In [58]:
df['Exposure_classification'] = df['Exposure'].apply(high_exposure)

In [59]:
df.head()


Out[59]:
County Exposure Mortality Exposure_classification
0 Umatilla 2.49 147.1 0
1 Morrow 2.57 130.1 0
2 Gilliam 3.41 129.9 0
3 Sherman 1.25 113.5 0
4 Wasco 1.62 137.5 0

In [ ]:


In [ ]:

5. Create a logistic regression model


In [60]:
from sklearn.linear_model import LogisticRegression

In [61]:
lm = LogisticRegression()

In [73]:
x = np.asarray(df[['Mortality']])
y = np.asarray(df['Exposure_classification'])

In [74]:
x


Out[74]:
array([[ 147.1],
       [ 130.1],
       [ 129.9],
       [ 113.5],
       [ 137.5],
       [ 162.3],
       [ 207.5],
       [ 177.9],
       [ 210.3]])

In [75]:
y


Out[75]:
array([0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int64)

In [ ]:


In [76]:
lm = lm.fit(x,y)

In [77]:
lm.score(x,y)


Out[77]:
0.77777777777777779

In [78]:
lm.coef_


Out[78]:
array([[-0.00122093]])

In [79]:
lm.intercept_


Out[79]:
array([-0.6709378])

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [80]:
lm.predict([0,0,1])


c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-80-f1244e4bb59d> in <module>()
----> 1 lm.predict([0,0,1])

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    266             Predicted class label per sample.
    267         """
--> 268         scores = self.decision_function(X)
    269         if len(scores.shape) == 1:
    270             indices = (scores > 0).astype(np.int)

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
    247         if X.shape[1] != n_features:
    248             raise ValueError("X has %d features per sample; expecting %d"
--> 249                              % (X.shape[1], n_features))
    250 
    251         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 3 features per sample; expecting 1

In [81]:
lm.predict([0,0,1])


c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-81-f1244e4bb59d> in <module>()
----> 1 lm.predict([0,0,1])

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    266             Predicted class label per sample.
    267         """
--> 268         scores = self.decision_function(X)
    269         if len(scores.shape) == 1:
    270             indices = (scores > 0).astype(np.int)

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
    247         if X.shape[1] != n_features:
    248             raise ValueError("X has %d features per sample; expecting %d"
--> 249                              % (X.shape[1], n_features))
    250 
    251         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 3 features per sample; expecting 1

In [ ]: