Apply logistic regression to categorize whether a county had high mortality rate due to contamination

1. Import the necessary packages to read in the data, plot, and create a logistic regression model


In [4]:
import pandas as pd
%matplotlib inline
import numpy as np
from sklearn.linear_model import LogisticRegression

2. Read in the hanford.csv file in the data/ folder


In [5]:
df = pd.read_csv("../data/hanford.csv")
df.head()


Out[5]:
County Exposure Mortality
0 Umatilla 2.49 147.1
1 Morrow 2.57 130.1
2 Gilliam 3.41 129.9
3 Sherman 1.25 113.5
4 Wasco 1.62 137.5

3. Calculate the basic descriptive statistics on the data


In [7]:
df.describe()


Out[7]:
Exposure Mortality
count 9.000000 9.000000
mean 4.617778 157.344444
std 3.491192 34.791346
min 1.250000 113.500000
25% 2.490000 130.100000
50% 3.410000 147.100000
75% 6.410000 177.900000
max 11.640000 210.300000

In [8]:
df.corr()


Out[8]:
Exposure Mortality
Exposure 1.000000 0.926345
Mortality 0.926345 1.000000

4. Find a reasonable threshold to say exposure is high and recode the data


In [10]:
lm = LogisticRegression()

In [13]:
df.std()


Out[13]:
Exposure      3.491192
Mortality    34.791346
dtype: float64

In [14]:
q1 = df['Exposure'].quantile(q=0.25)
q1


Out[14]:
2.4900000000000002

In [16]:
q3 = df['Exposure'].quantile(q=0.75)
q3


Out[16]:
6.4100000000000001

In [17]:
df['Mortality'].hist(bins=5)


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1147ef7b8>

In [25]:
df['Mort_high'] = df['Mortality'].apply(lambda x:1 if x>=147.1 else 0)
df['Expo_high'] = df['Exposure'].apply(lambda x:1 if x>=3.41 else 0)
df.head()


Out[25]:
County Exposure Mortality Mort_high Expo_high
0 Umatilla 2.49 147.1 1 0
1 Morrow 2.57 130.1 0 0
2 Gilliam 3.41 129.9 0 1
3 Sherman 1.25 113.5 0 0
4 Wasco 1.62 137.5 0 0

In [ ]:
## 和上面那个是一样的!

def exposure_high(x):
    if x >= 3.41:
        return 1
    else
        return 0

5. Create a logistic regression model


In [22]:
lm = LogisticRegression()

In [38]:
x = np.asarray(df['Expo_high'])
y = np.asarray(df['Mort_high'])
x,y


Out[38]:
(array([0, 0, 1, 0, 0, 1, 1, 1, 1]), array([1, 0, 0, 0, 0, 1, 1, 1, 1]))

In [39]:
lm = lm.fit(x,y)


/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-0fc46e0387ae> in <module>()
----> 1 lm = lm.fit(x,y)

/usr/local/lib/python3.5/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
   1140 
   1141         X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64, 
-> 1142                          order="C")
   1143         check_classification_targets(y)
   1144         self.classes_ = np.unique(y)

/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    518         y = y.astype(np.float64)
    519 
--> 520     check_consistent_length(X, y)
    521 
    522     return X, y

/usr/local/lib/python3.5/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    174     if len(uniques) > 1:
    175         raise ValueError("Found arrays with inconsistent numbers of samples: "
--> 176                          "%s" % str(uniques))
    177 
    178 

ValueError: Found arrays with inconsistent numbers of samples: [1 9]

6. Predict whether the mortality rate (Cancer per 100,000 man years) will be high at an exposure level of 50


In [ ]:


In [ ]: