Repeat setup from Notebook #2


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../3-data/phmrc_cleaned.csv')
df.head()


Out[2]:
id site gs_code34 gs_text34 gs_code46 gs_text46 gs_code55 gs_text55 gs_comorbid1 gs_comorbid2 ... s9999162 s9999163 s9999164 s9999165 s9999166 s9999167 s9999168 s9999169 s9999170 s9999171
0 A-21360001 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 0 0 0 0 0 0
1 A-21360002 AP N17 Renal Failure N17 Renal Failure N17 Renal Failure NaN NaN ... 0 0 0 0 1 0 0 0 0 0
2 A-21360004 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 1 0 0 0 0 0
3 A-21360007 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 1 0 0 0 0 0
4 A-21360008 AP I64 Stroke I64 Stroke I64 Stroke NaN NaN ... 0 0 0 0 0 0 0 0 0 0

5 rows × 355 columns


In [3]:
import numpy as np

In [4]:
X = np.array(df.filter(regex='^(s[0-9]+|age|sex)').fillna(0))
y = np.array(df.gs_text34)

In [5]:
import sklearn.naive_bayes

In [6]:
clf = sklearn.naive_bayes.BernoulliNB()

New, cool thing: sklearn.model_selection

This is new to sklearn version 0.18 (the development version), and I think it is really nicely done.


In [7]:
import sklearn.model_selection

In [8]:
cv = sklearn.model_selection.KFold(n_folds=10, shuffle=True, random_state=123456)

In [9]:
for train, test in cv.split(X, y):
    clf.fit(X[train], y[train])
    y_pred = clf.predict(X[test])
    acc = np.mean(y_pred == y[test])
    print(acc)


0.47898089172
0.449681528662
0.452229299363
0.468789808917
0.462420382166
0.449681528662
0.451530612245
0.448979591837
0.474489795918
0.488520408163

In [10]:
# refactor this into a function
def measure_acc(rep):
    cv = sklearn.model_selection.KFold(n_folds=10, shuffle=True, random_state=123456+rep)
    
    acc_list = []
    for train, test in cv.split(X, y):
        clf.fit(X[train], y[train])
        y_pred = clf.predict(X[test])
        acc = np.mean(y_pred == y[test])
        acc_list.append(acc)
    return acc_list
measure_acc(rep=0)


Out[10]:
[0.47898089171974523,
 0.44968152866242039,
 0.45222929936305734,
 0.46878980891719746,
 0.4624203821656051,
 0.44968152866242039,
 0.45153061224489793,
 0.44897959183673469,
 0.47448979591836737,
 0.48852040816326531]

In [11]:
%%time

# repeat it 10 times
acc_list = []
for rep in range(10):
    acc_list += measure_acc(rep)


CPU times: user 9min 44s, sys: 1min 12s, total: 10min 56s
Wall time: 21 s

How well does it perform now?


In [12]:
pd.Series(acc_list).describe(percentiles=[.025, .975])


Out[12]:
count    100.000000
mean       0.461560
std        0.015649
min        0.420382
2.5%       0.431115
50%        0.461735
97.5%      0.487293
max        0.492347
dtype: float64

Not quite right half the time anymore