In [ ]:

    
!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_style('darkgrid')



In [ ]:

    
# set random seed, for reproducibility
np.random.seed(12345)

Load and clean PHMRC VA data:



In [ ]:

    
df = pd.read_csv('IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv')

What do you think we should do about that warning?



In [ ]:

    
# take a look at what is in the DataFrame



In [ ]:

    
# also load codebook (excel doc)
cb = pd.read_excel('IHME_PHMRC_VA_DATA_CODEBOOK_Y2013M09D11_0.xlsx')
cb

Every column that starts with a{NUMBER}_stuff is part of the signs and symptoms reported about an adult death:



In [ ]:

    
df.filter(regex='a\d_')

There is also a "bag of words" section from the processed results of the free-text response section:



In [ ]:

    
df.filter(like='word_')

Make the feature vectors, by putting together everything from the adult module, and everything from the free response section.



In [ ]:

    
X = np.hstack((np.array(df.filter(regex='a\d_')), np.array(df.filter(like='word_'))))

Did I get that right?



In [ ]:

    
np.array(df.filter(regex='a\d_')).shape, np.array(df.filter(like='word_')).shape



In [ ]:

    
X.shape

Actually, that won't work without more data cleaning, because sklearn expects the feature vectors to be numbers, not answers like "Yes" and "Don't know".

So let's just use the open-response text, which has already been processed.



In [ ]:

    
X = np.array(df.filter(like='word_'))

The causes of death appear in the gs_text34 column (obviously...)



In [ ]:

    
df.gs_text34

For class purposes, we will simplify the prediction task: was the death due to stroke, diabetes, or something else?



In [ ]:

    
df['Cause'] = df.gs_text34.map({'Stroke':'Stroke', 'Diabetes':'Diabetes'}).fillna('Other')
y = np.array(df.Cause)

Did that work?



In [ ]:

    
np.unique(y)

Now, let's train one of our (by now) standard classifiers to predict the underlying cause of death from the restricted cause list:



In [ ]:

    
import sklearn.tree

clf = sklearn.tree.DecisionTreeClassifier()
clf.fit(X,y)

y_pred = clf.predict(X)
np.mean(y == y_pred)

Does that look good? Too good? If it looks too good to be true, it probably is...



In [ ]:

    
import sklearn.cross_validation

scores = sklearn.cross_validation.cross_val_score(clf, X, y)

print scores.mean(), scores.std()



In [ ]:

    
sns.distplot(scores, rug=True)

What have we done here?



In [ ]:

    
help(sklearn.cross_validation.cross_val_score)

If you read through that, you will find out: we have done stratified K-Fold cross validation with $K = 3$. In other words, split the data into 3 equal parts (what if it's not divisible by three?) and then use two for training and the third for testing, in all three ways.

So how to do 10-fold CV as in the reading?



In [ ]:

    
%%time

n = len(y)
cv = sklearn.cross_validation.KFold(___________________________)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()



In [ ]:

    
sns.distplot(scores, rug=True)

I think it is important to stratify the sampling, although our book does not make a big deal about that:



In [ ]:

    
n = len(y)
cv = sklearn.cross_validation.StratifiedKFold(n, n_folds=10, shuffle=True)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

Oops, what did I get wrong here? Can you fix it?



In [ ]:

    
cv = sklearn.cross_validation.StratifiedKFold(___________________________)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()



In [ ]:

    
sns.distplot(scores, rug=True)

And I don't want you to ever rely on just 10 samples. How can you do ten repetitions of 10-fold cross validation?



In [ ]:

    
scores = []
for rep in range(10):
    print rep,
    cv = ________________________________
    scores += ________________________________



In [ ]:

    
np.mean(scores), np.std(scores)



In [ ]:

    
sns.distplot(np.hstack(scores), rug=True)

Although our books make a big deal about K-Fold cross-validation, and we have gone through all of the trouble to understand what it is doing, I am partial to what is sometimes called leave-group-out cross-validation, Monte Carlo cross-validation, or just plain, old cross-validation: split the data into training and test sets randomly a bunch of times.

Just to be perverse, in sklearn this is called ShuffleSplit cross-validation:



In [ ]:

    
%%time

n = len(y)
cv = sklearn.cross_validation.ShuffleSplit(________________________________)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

Of course, you can also do a stratified version of this:



In [ ]:

    
n = len(y)
cv = sklearn.cross_validation.StratifiedShuffleSplit(n, n_iter=10, test_size=.25)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

Oops, what did I get wrong again?



In [ ]:

    
cv = sklearn.cross_validation.StratifiedShuffleSplit(________________________________)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

One last way to do cross-validation, which (in my humble opinion) was not emphasized sufficiently in the reading: split based on what you are actually interested in. In the case of VA, I am very interested in how the approach will generalize to other parts of the world. So I could hold-out based on site:



In [ ]:

    
df.site.value_counts()



In [ ]:

    
cv = sklearn.cross_validation.LeaveOneLabelOut(df.site)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

There is also a version that holds out $P$ labels, and lets you choose $P$:



In [ ]:

    
cv = sklearn.cross_validation.LeavePLabelOut(df.site, p=2)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

What did I not mention that whole time? Leave-one-out cross-validation. That is because it takes forever.



In [ ]:

    
%%time

n = len(y)
cv = sklearn.cross_validation.LeaveOneOut(n)
scores = sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)

print scores.mean(), scores.std()

Let's return to the stratified, repeated, 10-fold cross-validation approach:



In [ ]:

    
%%time

scores = []
for rep in range(10):
    print rep,
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True)
    scores += [sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)]
    
print



In [ ]:

    
np.mean(scores), np.std(scores)

Sure, we can have a couple of minutes of break-time running this thing, but how do you think this is actually doing? Last week, we used a random prediction method as a baseline, and that is reasonable, but this week let's use something even simpler. Predicting a single class:



In [ ]:

    
y_single = 'Diabetes'
np.mean(y == y_single)



In [ ]:

    
y_single = 'Stroke'
np.mean(y == y_single)



In [ ]:

    
y_single = 'Other'
np.mean(y == y_single)

Uh, oh... we don't need any fancy classifiers to get 80% accuracy, and we don't need to wait around for five minutes to see how well that fancy classifier does. We could just say "other" and call it a day. To be a really fair (yet still damning) comparison, let's make a stratified, repeated 10-fold cross-validation of the baseline classifier that always says "other".

To do so, it will help you to know how to use a sklearn.cross_validation object a little bit more:



In [ ]:

    
cv = sklearn.cross_validation.KFold(10, n_folds=3, shuffle=False)
for train, test in cv:
    print train, test

We can use that to make a fair comparison of the decision tree to the baseline classifier that always predicts "Other":



In [ ]:

    
scores = []
for rep in range(10):
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True)
    for train, test in cv:
        scores.append(________________________________)
np.mean(scores)

To make this really, really fair, we should use the same random splits in each experiment...



In [ ]:

    
%%time

scores = []
for rep in range(10):
    print rep,
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=________________________________)
    scores += [sklearn.cross_validation.cross_val_score(clf, X, y, cv=cv)]
    
print
print np.mean(scores)



In [ ]:

    
scores = []
for rep in range(10):
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=________________________________)
    for train, test in cv:
        scores.append(np.mean(y[test] == 'Other'))
np.mean(scores)

But that is just practicing "defensive research", to prepare for the pickiest of critics...

What we really need to be thinking about now is what was called "cost-sensitive learning" in the Data Mining book.



In [ ]:

    
df.Cause.value_counts()

Make weights for examples proportional to inverse of counts:



In [ ]:

    
weights = 1000. / df.Cause.value_counts()
weights



In [ ]:

    
sample_weight = np.array(weights[y])

What is in sample_weights?



In [ ]:

    
import sklearn.metrics
y_pred = ['Other' for i in range(len(y))]
sklearn.metrics.accuracy_score(y, y_pred, sample_weight=sample_weight)

Is that what you expected?

This is a better metric for our classification challenge, and we should use it when building the classifier and when measuring its performance:



In [ ]:

    
%%time

scores = []
for rep in range(10):
    print rep,
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=123456+rep)
    for train, test in cv:
        clf = sklearn.tree.DecisionTreeClassifier()
        clf.fit(X[train], y[train], sample_weight=________________________________)

        y_pred = clf.predict(X[test])
        scores.append(sklearn.metrics.accuracy_score(y[test], y_pred, sample_weight=________________________________))
                   
print
print np.mean(scores)



In [ ]:

    
scores = []
for rep in range(10):
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=123456+rep)
    for train, test in cv:
        y_pred = ['Other' for i in range(len(test))]
        scores.append(sklearn.metrics.accuracy_score(y[test], y_pred, sample_weight=________________________________))
np.mean(scores)

The moral is: ML methods can do something, but you have to be careful about it!

This relates to a subtle point, that I think may be somewhat unique to population health metrics, where we are interested in the population-level mean of the predictions more than the individual-level predictions themselves. I will see how far we make it through this exercise and decide if we have time to really dig into it.

In short, the percent of test set examples with underlying cause Diabetes is equal to the percent of the training set with this underlying cause. So it is not really out-of-sample, even though the actual examples are completely disjoint.



In [ ]:

    
def csmf_accuracy(y_true, y_pred):
    csmf_diff = 0
    csmf_true_min = 1.
    for j in ['Stroke', 'Diabetes', 'Other']:
        csmf_true_j = np.mean(np.array(y_true) == j)
        csmf_pred_j = np.mean(np.array(y_pred) == j)
        csmf_diff += np.abs(csmf_true_j - csmf_pred_j)
        
        if csmf_true_j < csmf_true_min:
            csmf_true_min = csmf_true_j
        #print csmf_true_j, csmf_pred_j, csmf_diff, csmf_true_min
    return 1 - csmf_diff / (2 * (1 - csmf_true_min))


# test this function
y_true = ['Stroke', 'Diabetes', 'Other']
y_pred = ['Diabetes', 'Other', 'Stroke']
assert np.allclose(csmf_accuracy(y_true, y_pred), 1)

y_true = ['Stroke', 'Stroke', 'Stroke']
y_pred = ['Diabetes', 'Other', 'Diabetes']
assert np.allclose(csmf_accuracy(y_true, y_pred), 0)

The baseline predictor is pretty good:



In [ ]:

    
scores = []
for rep in range(10):
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=123456+rep)
    for train, test in cv:
        y_pred = ['Other' for i in range(len(test))]
        scores.append(csmf_accuracy(y[test], y_pred))
np.mean(scores)

But because the validation environment is leaking information, a different silly predictor is even better. I call this "Random-From-Train":



In [ ]:

    
scores = []
for rep in range(10):
    cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=123456+rep)
    for train, test in cv:
        y_pred = np.random.choice(y[train], size=len(test), replace=True)
        scores.append(csmf_accuracy(y[test], y_pred))
np.mean(scores)

To fix it, you must resample the cause distribution of the test dataset. Super-hard extra bonus homework: figure out how to do this.



In [ ]:

Did I promise a little more for this week in the syllabus?

Probability prediction:



In [ ]:

    
clf.predict_proba(X[test[:10]])

Maybe more convincing with Naive Bayes:



In [ ]:

    
import sklearn.naive_bayes



In [ ]:

    
clf = sklearn.naive_bayes.MultinomialNB()
clf.fit(X[train], y[train], )



In [ ]:

    
clf.predict_proba(X[test[:10]])

Use np.round to make this look nicer:



In [ ]:

    
np.round(________________________________, 2)

It is a bit hard to understand those probabilities. What do they mean?



In [ ]:

    
y[test[:10]]

Let us hope that the middle column is the probability of 'Other'. The column names must be stored in the clf instance somewhere:



In [ ]:

    
clf.classes_

Make that into a pretty DataFrame



In [ ]:

    
pd.DataFrame(clf.predict_proba(X[test[:10]]), columns=________________________________)

And what about numeric prediction? I'll give us a little something to show the mechanics, although it is a weird example. Can you figure out what g1_07a is? Hint: remember that cb is the codebook.



In [ ]:

    
def float_or_100(x):
    try:
        return float(x)
    except:
        return 100.
    
y = np.array(df.g1_07a.map(float_or_100))

Here is some MSE cross-validation:



In [ ]:

    
scores = []
cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=123456)

for train, test in cv:
    clf = sklearn.tree.DecisionTreeRegressor()
    clf.fit(X[train], y[train])

    y_pred = clf.predict(X[test])
    scores.append(sklearn.metrics.mean_squared_error(y[test], y_pred))
                   
print
print np.mean(scores)

I prefer root mean squared error, myself. I find it more interpretable:



In [ ]:

    
np.mean(_______________________)

There is also a mean absolute error in sklearn.



In [ ]:

    
sklearn.metrics.mean_absolute_error

Can you make a custom error function that doesn't charge for the missing values we clipped to 100?



In [ ]:

    
def rmse_when_not_100(a, b):
    diff = a-b
    rows = _______________________
    return np.sqrt(np.mean(diff[rows]**2))



In [ ]:

    
cv = sklearn.cross_validation.StratifiedKFold(y, n_folds=10, shuffle=True, random_state=123456)

for train, test in cv:
    clf = sklearn.tree.DecisionTreeRegressor()
    clf.fit(X[train], y[train], sample_weight=sample_weight[train])

    y_pred = clf.predict(X[test])
    scores.append(rmse_when_not_100(y[test], y_pred))
                   
print
print np.mean(scores)

Is this any good? How can you tell?



In [ ]: