Let's make some predictions for Verbal Autopsies



In [1]:

    
import pandas as pd



In [2]:

    
# a major part of the _art_ of applied machine learning
# is "feature engineering", which means mapping the
# raw data we loaded in the previous notebook into a 
# representation that is *good*.  I'm not going to
# get into that now. Instead, here is what it might look like:

df = pd.read_csv('../3-data/phmrc_cleaned.csv')
df.head()









    Out[2]:






  
    
      
      id
      site
      gs_code34
      gs_text34
      gs_code46
      gs_text46
      gs_code55
      gs_text55
      gs_comorbid1
      gs_comorbid2
      ...
      s9999162
      s9999163
      s9999164
      s9999165
      s9999166
      s9999167
      s9999168
      s9999169
      s9999170
      s9999171
    
  
  
    
      0
      A-21360001
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      A-21360002
      AP
      N17
      Renal Failure
      N17
      Renal Failure
      N17
      Renal Failure
      NaN
      NaN
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2
      A-21360004
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      3
      A-21360007
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      4
      A-21360008
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 355 columns



In [3]:

    
# remember that the fundamental package for
# scientific computing with Python? 

import numpy as np



In [4]:

    
# we will use it to create an array of feature vectors
# and an array of the corresponding labels

#X = np.array(df.filter(like='s1'))
X = np.array(df.filter(regex='^(s[0-9]+|age|sex)').fillna(0))
y = np.array(df.gs_text34)



In [5]:

    
# how much data are we dealing with here?

X.shape









    Out[5]:





(7846, 341)

Our first ML method for predicting cause-of-death:



In [6]:

    
# Here is how to train a charmingly self-deprecating
# ML method, Naive Bayes, to predict underlying CoD
# with sklearn

import sklearn.naive_bayes



In [7]:

    
clf = sklearn.naive_bayes.BernoulliNB()



In [8]:

    
clf.fit(X, y)









    Out[8]:





BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)



In [9]:

    
# Let's see how it works for a single feature vector:

clf.predict(X[[0], :])









    Out[9]:





array(['Stroke'], 
      dtype='|S31')



In [10]:

    
# And what was the true label for this example?

y[0]









    Out[10]:





'Stroke'



In [11]:

    
# So let's see how well it is making predictions overall:

y_pred = clf.predict(X)



In [12]:

    
np.mean(y == y_pred)









    Out[12]:





0.51389242926331891

Want to try it again with a totally different ML method?



In [13]:

    
import sklearn.neighbors



In [14]:

    
clf = sklearn.neighbors.KNeighborsClassifier()



In [15]:

    
clf.fit(X, y)









    Out[15]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')



In [16]:

    
clf.predict(X[[2], :])









    Out[16]:





array(['Other Infectious Diseases'], dtype=object)



In [17]:

    
y[2]









    Out[17]:





'Stroke'



In [18]:

    
%time y_pred = clf.predict(X)









    



CPU times: user 1min 23s, sys: 17 ms, total: 1min 23s
Wall time: 1min 24s



In [19]:

    
np.mean(y == y_pred)









    Out[19]:





0.54371654346163645

One more time, with the second-coolest method in town:



In [20]:

    
import sklearn.ensemble



In [21]:

    
clf = sklearn.ensemble.GradientBoostingClassifier()



In [22]:

    
%time clf.fit(X, y)









    



CPU times: user 5min 28s, sys: 50 ms, total: 5min 28s
Wall time: 5min 28s






    Out[22]:





GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)



In [23]:

    
clf.predict(X[[1], :])









    Out[23]:





array(['Other Infectious Diseases'], dtype=object)



In [24]:

    
y[1]









    Out[24]:





'Renal Failure'



In [25]:

    
y_pred = clf.predict(X)



In [26]:

    
np.mean(y == y_pred)









    Out[26]:





0.77899566658169772

	id	site	gs_code34	gs_text34	gs_code46	gs_text46	gs_code55	gs_text55	gs_comorbid1	gs_comorbid2	...	s9999166
0	A-21360001	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	0
1	A-21360002	AP	N17	Renal Failure	N17	Renal Failure	N17	Renal Failure	NaN	NaN	...	1
2	A-21360004	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	1
3	A-21360007	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	1
4	A-21360008	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	0

Let's make some predictions for Verbal Autopsies

Our first ML method for predicting cause-of-death:

Want to try it again with a totally different ML method?

One more time, with the second-coolest method in town:

Is that good?