Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [51]:
import pandas as pd
%matplotlib inline
import numpy as np
import scipy
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the data/ folder


In [2]:
df=pd.read_csv('../data/hanford.csv')

3. Calculate the basic descriptive statistics on the data


In [3]:
df.describe()


Out[3]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

4. Find a reasonable threshold to say exposure is high and recode the data


In [79]:
# find first and third quartiles
first_qtr=df['Exposure'].quantile(q=0.25)
third_qtr=df['Exposure'].quantile(q=0.75)

In [80]:
# calculate interquartile range
iqr = third_qtr - first_qtr

In [81]:
# define outliers
lower = first_qtr - 1.5*iqr

In [82]:
upper = third_qtr + 1.5*iqr

In [83]:
print(outlier1)


-3.39

In [84]:
print(outlier2) #  if exposure gt 12.29 then exposure is high? #this doesnt make sense bc no value of exposure is 12.29


12.29

In [99]:
# the line of code below returns true/false
# df['Exposure_High'] = df['Exposure']>=3.41 
# choose the median value instead
df['Exposure_High']=df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)

In [100]:
df


Out[100]:
County Exposure Mortality Exposure_High
0 Umatilla 2.49 147.1 0
1 Morrow 2.57 130.1 0
2 Gilliam 3.41 129.9 1
3 Sherman 1.25 113.5 0
4 Wasco 1.62 137.5 0
5 HoodRiver 3.83 162.3 1
6 Portland 11.64 207.5 1
7 Columbia 6.41 177.9 1
8 Clatsop 8.34 210.3 1

5. Create a logistic regression model


In [101]:
lm = LogisticRegression()

In [102]:
x = np.asarray(df['Mortality'])
y = np.asarray(df['Exposure_High'])

In [103]:
lm = lm.fit(x,y)


/Users/mercyemelike/.virtualenvs/data_analysis/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-103-0fc46e0387ae> in <module>()
----> 1 lm = lm.fit(x,y)

/Users/mercyemelike/.virtualenvs/data_analysis/lib/python3.5/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1140 
   1141         X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64, 
-> 1142                          order="C")
   1143         check_classification_targets(y)
   1144         self.classes_ = np.unique(y)

/Users/mercyemelike/.virtualenvs/data_analysis/lib/python3.5/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    518         y = y.astype(np.float64)
    519 
--> 520     check_consistent_length(X, y)
    521 
    522     return X, y

/Users/mercyemelike/.virtualenvs/data_analysis/lib/python3.5/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    174     if len(uniques) > 1:
    175         raise ValueError("Found arrays with inconsistent numbers of samples: "
--> 176                          "%s" % str(uniques))
    177 
    178 

ValueError: Found arrays with inconsistent numbers of samples: [1 9]

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [104]:
lm.predict([0])


---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-104-b7d9a29325c9> in <module>()
----> 1 lm.predict([0])

/Users/mercyemelike/.virtualenvs/data_analysis/lib/python3.5/site-packages/sklearn/linear_model/base.py in predict(self, X)
    266             Predicted class label per sample.
    267         """
--> 268         scores = self.decision_function(X)
    269         if len(scores.shape) == 1:
    270             indices = (scores > 0).astype(np.int)

/Users/mercyemelike/.virtualenvs/data_analysis/lib/python3.5/site-packages/sklearn/linear_model/base.py in decision_function(self, X)
    240         if not hasattr(self, 'coef_') or self.coef_ is None:
    241             raise NotFittedError("This %(name)s instance is not fitted "
--> 242                                  "yet" % {'name': type(self).__name__})
    243 
    244         X = check_array(X, accept_sparse='csr')

NotFittedError: This LogisticRegression instance is not fitted yet

In [ ]: