Let's make some predictions for Verbal Autopsies


In [1]:
import pandas as pd

In [2]:
# a major part of the _art_ of applied machine learning
# is "feature engineering", which means mapping the
# raw data we loaded in the previous notebook into a 
# representation that is *good*.  I'm not going to
# get into that now. Instead, here is what it might look like:

df = pd.read_csv('../3-data/phmrc_cleaned.csv')
df.head()


Out[2]:
id site gs_code34 gs_text34 gs_code46 gs_text46 gs_code55 gs_text55 gs_comorbid1 gs_comorbid2 ... s9999162 s9999163 s9999164 s9999165 s9999166 s9999167 s9999168 s9999169 s9999170 s9999171
0 A-21360001 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 0 0 0 0 0 0
1 A-21360002 AP N17 Renal Failure N17 Renal Failure N17 Renal Failure NaN NaN ... 0 0 0 0 1 0 0 0 0 0
2 A-21360004 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 1 0 0 0 0 0
3 A-21360007 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 1 0 0 0 0 0
4 A-21360008 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 0 0 0 0 0 0

5 rows × 355 columns


In [3]:
# remember that the fundamental package for
# scientific computing with Python? 

import numpy as np

In [4]:
# we will use it to create an array of feature vectors
# and an array of the corresponding labels

#X = np.array(df.filter(like='s1'))
X = np.array(df.filter(regex='^(s[0-9]+|age|sex)').fillna(0))
y = np.array(df.gs_text34)

In [5]:
# how much data are we dealing with here?

X.shape


Out[5]:
(7846, 341)

Our first ML method for predicting cause-of-death:


In [6]:
# Here is how to train a charmingly self-deprecating
# ML method, Naive Bayes, to predict underlying CoD
# with sklearn

import sklearn.naive_bayes

In [7]:
clf = sklearn.naive_bayes.BernoulliNB()

In [8]:
clf.fit(X, y)


Out[8]:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [9]:
# Let's see how it works for a single feature vector:

clf.predict(X[[0], :])


Out[9]:
array(['Stroke'], 
      dtype='|S31')

In [10]:
# And what was the true label for this example?

y[0]


Out[10]:
'Stroke'

In [11]:
# So let's see how well it is making predictions overall:

y_pred = clf.predict(X)

In [12]:
np.mean(y == y_pred)


Out[12]:
0.51389242926331891

Want to try it again with a totally different ML method?


In [13]:
import sklearn.neighbors

In [14]:
clf = sklearn.neighbors.KNeighborsClassifier()

In [15]:
clf.fit(X, y)


Out[15]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [16]:
clf.predict(X[[2], :])


Out[16]:
array(['Other Infectious Diseases'], dtype=object)

In [17]:
y[2]


Out[17]:
'Stroke'

In [18]:
%time y_pred = clf.predict(X)


CPU times: user 1min 23s, sys: 17 ms, total: 1min 23s
Wall time: 1min 24s

In [19]:
np.mean(y == y_pred)


Out[19]:
0.54371654346163645

One more time, with the second-coolest method in town:


In [20]:
import sklearn.ensemble

In [21]:
clf = sklearn.ensemble.GradientBoostingClassifier()

In [22]:
%time clf.fit(X, y)


CPU times: user 5min 28s, sys: 50 ms, total: 5min 28s
Wall time: 5min 28s
Out[22]:
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [23]:
clf.predict(X[[1], :])


Out[23]:
array(['Other Infectious Diseases'], dtype=object)

In [24]:
y[1]


Out[24]:
'Renal Failure'

In [25]:
y_pred = clf.predict(X)

In [26]:
np.mean(y == y_pred)


Out[26]:
0.77899566658169772

Is that good?