Repeat setup from Notebook #2



In [1]:

    
import pandas as pd



In [2]:

    
df = pd.read_csv('../3-data/phmrc_cleaned.csv')
df.head()









    Out[2]:






  
    
      
      id
      site
      gs_code34
      gs_text34
      gs_code46
      gs_text46
      gs_code55
      gs_text55
      gs_comorbid1
      gs_comorbid2
      ...
      s9999162
      s9999163
      s9999164
      s9999165
      s9999166
      s9999167
      s9999168
      s9999169
      s9999170
      s9999171
    
  
  
    
      0
      A-21360001
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      A-21360002
      AP
      N17
      Renal Failure
      N17
      Renal Failure
      N17
      Renal Failure
      NaN
      NaN
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      2
      A-21360004
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      3
      A-21360007
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
    
    
      4
      A-21360008
      AP
      I64
      Stroke
      I64
      Stroke
      I64
      Stroke
      NaN
      NaN
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 355 columns



In [3]:

    
import numpy as np



In [4]:

    
X = np.array(df.filter(regex='^(s[0-9]+|age|sex)').fillna(0))
y = np.array(df.gs_text34)



In [5]:

    
import sklearn.naive_bayes



In [6]:

    
clf = sklearn.naive_bayes.BernoulliNB()

New, cool thing: `sklearn.model_selection`

This is new to sklearn version 0.18 (the development version), and I think it is really nicely done.



In [7]:

    
import sklearn.model_selection



In [8]:

    
cv = sklearn.model_selection.KFold(n_folds=10, shuffle=True, random_state=123456)



In [9]:

    
for train, test in cv.split(X, y):
    clf.fit(X[train], y[train])
    y_pred = clf.predict(X[test])
    acc = np.mean(y_pred == y[test])
    print(acc)









    



0.47898089172
0.449681528662
0.452229299363
0.468789808917
0.462420382166
0.449681528662
0.451530612245
0.448979591837
0.474489795918
0.488520408163



In [10]:

    
# refactor this into a function
def measure_acc(rep):
    cv = sklearn.model_selection.KFold(n_folds=10, shuffle=True, random_state=123456+rep)
    
    acc_list = []
    for train, test in cv.split(X, y):
        clf.fit(X[train], y[train])
        y_pred = clf.predict(X[test])
        acc = np.mean(y_pred == y[test])
        acc_list.append(acc)
    return acc_list
measure_acc(rep=0)









    Out[10]:





[0.47898089171974523,
 0.44968152866242039,
 0.45222929936305734,
 0.46878980891719746,
 0.4624203821656051,
 0.44968152866242039,
 0.45153061224489793,
 0.44897959183673469,
 0.47448979591836737,
 0.48852040816326531]



In [11]:

    
%%time

# repeat it 10 times
acc_list = []
for rep in range(10):
    acc_list += measure_acc(rep)









    



CPU times: user 9min 44s, sys: 1min 12s, total: 10min 56s
Wall time: 21 s

How well does it perform now?



In [12]:

    
pd.Series(acc_list).describe(percentiles=[.025, .975])









    Out[12]:





count    100.000000
mean       0.461560
std        0.015649
min        0.420382
2.5%       0.431115
50%        0.461735
97.5%      0.487293
max        0.492347
dtype: float64

Not quite right half the time anymore

	id	site	gs_code34	gs_text34	gs_code46	gs_text46	gs_code55	gs_text55	gs_comorbid1	gs_comorbid2	...	s9999166
0	A-21360001	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	0
1	A-21360002	AP	N17	Renal Failure	N17	Renal Failure	N17	Renal Failure	NaN	NaN	...	1
2	A-21360004	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	1
3	A-21360007	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	1
4	A-21360008	AP	I64	Stroke	I64	Stroke	I64	Stroke	NaN	NaN	...	0

Repeat setup from Notebook #2

New, cool thing: sklearn.model_selection

How well does it perform now?

Not quite right half the time anymore

New, cool thing: `sklearn.model_selection`