Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model



In [51]:

    
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the `data/` folder



In [52]:

    
df =pd.read_csv('data/hanford.csv')



In [53]:

    
df.head()









    Out[53]:






  
    
      
      County
      Exposure
      Mortality
    
  
  
    
      0
      Umatilla
      2.49
      147.1
    
    
      1
      Morrow
      2.57
      130.1
    
    
      2
      Gilliam
      3.41
      129.9
    
    
      3
      Sherman
      1.25
      113.5
    
    
      4
      Wasco
      1.62
      137.5



In [54]:

    
df['Mortality'] = [ float(x) for x in df['Mortality']]

3. Calculate the basic descriptive statistics on the data



In [55]:

    
df.describe()



In [56]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 3 columns):
County       9 non-null object
Exposure     9 non-null float64
Mortality    9 non-null float64
dtypes: float64(2), object(1)
memory usage: 296.0+ bytes

4. Find a reasonable threshold to say exposure is high and recode the data

Step 01. Need to prepare features



In [57]:

    
def high_exposure(x):
    if x > 6.41:
        return 1
    else:
        return 0



In [58]:

    
df['Exposure_classification'] = df['Exposure'].apply(high_exposure)



In [59]:

    
df.head()









    Out[59]:






  
    
      
      County
      Exposure
      Mortality
      Exposure_classification
    
  
  
    
      0
      Umatilla
      2.49
      147.1
      0
    
    
      1
      Morrow
      2.57
      130.1
      0
    
    
      2
      Gilliam
      3.41
      129.9
      0
    
    
      3
      Sherman
      1.25
      113.5
      0
    
    
      4
      Wasco
      1.62
      137.5
      0



In [ ]:



In [ ]:

5. Create a logistic regression model



In [60]:

    
from sklearn.linear_model import LogisticRegression



In [61]:

    
lm = LogisticRegression()



In [73]:

    
x = np.asarray(df[['Mortality']])
y = np.asarray(df['Exposure_classification'])



In [74]:

    
x









    Out[74]:





array([[ 147.1],
       [ 130.1],
       [ 129.9],
       [ 113.5],
       [ 137.5],
       [ 162.3],
       [ 207.5],
       [ 177.9],
       [ 210.3]])



In [75]:

    
y









    Out[75]:





array([0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int64)



In [ ]:



In [76]:

    
lm = lm.fit(x,y)



In [77]:

    
lm.score(x,y)









    Out[77]:





0.77777777777777779



In [78]:

    
lm.coef_









    Out[78]:





array([[-0.00122093]])



In [79]:

    
lm.intercept_









    Out[79]:





array([-0.6709378])

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50



In [80]:

    
lm.predict([0,0,1])









    



c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-80-f1244e4bb59d> in <module>()
----> 1 lm.predict([0,0,1])

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    266             Predicted class label per sample.
    267         """
--> 268         scores = self.decision_function(X)
    269         if len(scores.shape) == 1:
    270             indices = (scores > 0).astype(np.int)

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
    247         if X.shape[1] != n_features:
    248             raise ValueError("X has %d features per sample; expecting %d"
--> 249                              % (X.shape[1], n_features))
    250 
    251         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 3 features per sample; expecting 1



In [81]:

    
lm.predict([0,0,1])









    



c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)






    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-81-f1244e4bb59d> in <module>()
----> 1 lm.predict([0,0,1])

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    266             Predicted class label per sample.
    267         """
--> 268         scores = self.decision_function(X)
    269         if len(scores.shape) == 1:
    270             indices = (scores > 0).astype(np.int)

c:\users\dongjin\envs\03stat\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
    247         if X.shape[1] != n_features:
    248             raise ValueError("X has %d features per sample; expecting %d"
--> 249                              % (X.shape[1], n_features))
    250 
    251         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 3 features per sample; expecting 1



In [ ]:

	Exposure	Mortality
count	9.000000	9.000000
mean	4.617778	157.344444
std	3.491192	34.791346
min	1.250000	113.500000
25%	2.490000	130.100000
50%	3.410000	147.100000
75%	6.410000	177.900000
max	11.640000	210.300000

	County	Exposure	Mortality
0	Umatilla	2.49	147.1
1	Morrow	2.57	130.1
2	Gilliam	3.41	129.9
3	Sherman	1.25	113.5
4	Wasco	1.62	137.5

Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model

2. Read in the hanford.csv file in the data/ folder

3. Calculate the basic descriptive statistics on the data

4. Find a reasonable threshold to say exposure is high and recode the data

5. Create a logistic regression model

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50

2. Read in the hanford.csv file in the `data/` folder