In [110]:
!date
import numpy as np, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
%matplotlib inline
sns.set_context('talk')
sns.set_style('darkgrid')
In [111]:
# set random seed for reproducibility
np.random.seed(12345)
In [112]:
x_true = np.linspace(0,4,1000)
y_true = np.sin(x_true)
obs = np.random.choice(range(1000), size=100, replace=False)
x_obs = x_true[obs]
y_obs = np.random.normal(y_true[obs], .3)
plt.plot(x_true, y_true, label='True')
plt.plot(x_obs, y_obs, 's', label='Observed')
plt.legend(loc=(1,.1))
Out[112]:
And here are a few ways to fit this:
In [113]:
import sklearn.linear_model, sklearn.svm, sklearn.ensemble
In [114]:
m1 = sklearn.linear_model.LinearRegression()
m2 = sklearn.svm.SVR()
m3 = sklearn.ensemble.RandomForestRegressor()
In [115]:
X = x_obs[:,None]
y = y_obs
In [116]:
m1.fit(X,y)
Out[116]:
In [117]:
m2.fit(X,y)
Out[117]:
In [118]:
m3.fit(X,y)
Out[118]:
In [119]:
def plot_pred(m,label):
X = x_true[:,None]
y = m.predict(X)
plt.plot(X, y, label=label)
What do you think the predictions from m1, m2, and m3 are going to look like?
In [120]:
plt.plot(x_true, y_true, label='True')
plt.plot(x_obs, y_obs, 's', label='Observed')
plot_pred(m1, 'Linear Regression')
plot_pred(m2, 'SVM')
plot_pred(m3, 'Random Forest')
plt.legend(loc=(1,.1))
Out[120]:
Add random noise as a predictor
In [121]:
X = np.hstack((X, np.random.normal(size=(100,1))))
In [122]:
m1.fit(X,y)
m2.fit(X,y)
m3.fit(X,y)
Out[122]:
In [123]:
def plot_pred_w_noise(m,label):
X = x_true[:,None]
X = np.hstack((X, np.random.normal(size=(1000,1))))
y = m.predict(X)
plt.plot(X[:,0], y, label=label, alpha=.75)
plt.plot(x_true, y_true, label='True')
plt.plot(x_obs, y_obs, 's', label='Observed')
plot_pred_w_noise(m1, 'Linear Regression')
plot_pred_w_noise(m2, 'SVM')
plot_pred_w_noise(m3, 'Random Forest')
plt.legend(loc=(1,.1))
Out[123]:
Work through one round of forward selection
In [124]:
X = x_obs[:,None]
y = y_obs
X = np.hstack((X, np.random.normal(size=(100,10))))
In [125]:
# how does forward selection determine which column of X to include first?
# choose the model to use
m = m1
# loop through columns, see which one is most informative
n,p = X.shape
for j in range(p):
Xj = X[:,[j]]
m.fit(Xj, y)
# should we guage information in-sample or out-of-sample?
y_pred = m.predict(Xj)
print 'j=',j,'in-sample MSE=',np.round(np.mean((y - y_pred)**2),2)
So first thing to include is column zero! No surprise...
Work through one round of backwards elimination
In [126]:
X = x_obs[:,None]
y = y_obs
X = np.hstack((X, np.random.normal(size=(100,10))))
In [127]:
# how does backwards elimination determine which column of X to remove first?
# how does forward selection determine which column of X to include first?
# choose the model to use
m = m1
# see how well whole thing predicts. Let's use out-of-sample this time.
import sklearn.cross_validation
full_score = sklearn.cross_validation.cross_val_score(m, X, y)
# loop through columns, see which one is most informative
n,p = X.shape
for j in range(p):
X_wo_j = X[:, range(0,j)+range(j+1,p)]
score_wo_j = sklearn.cross_validation.cross_val_score(m, X_wo_j, y)
print 'j=',j,'out-of-sample increase in MSE=', np.round(np.mean(full_score - score_wo_j),2)
In [128]:
# so first one to kick is j=8, although nothing seems important besides j=0
If you need to use this in your project, consider learning more about the backwards elimination method in sklearn with sklearn.feature_selection.RFE and .RFECV.
In [129]:
import sklearn.feature_selection
In [130]:
r = sklearn.feature_selection.RFE(sklearn.linear_model.LinearRegression(), n_features_to_select=2, step=1)
r.fit(X,y)
Out[130]:
In [131]:
# features included
r.support_
Out[131]:
In [132]:
# load va data from Week 4
df = pd.read_csv('/homes/abie/ML4HM/Week_4/IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv', low_memory=False)
We are going to look at questions about the cough symptom here: a2_32 and a2_33
In [133]:
# did deceased have cough?
df.a2_32.head()
Out[133]:
In [134]:
# tabulate that: did deceased have cough?
df.a2_32.value_counts()
Out[134]:
In [135]:
# how long did the cough last (in days)?
df.a2_33.head()
Out[135]:
In [136]:
sns.distplot(np.log(df.a2_33+1))
plt.xlabel('log(duration+1)')
plt.ylabel('probability density')
plt.axis(xmin=0)
Out[136]:
In [137]:
# it is getting annoying to remember that silly name...
df['cough'] = df.a2_32 == 'Yes'
df['duration'] = df.a2_33.map(float)
In [138]:
df[df.cough].duration.describe(percentiles=[.025,.25,.50,.75,.975])
Out[138]:
In [139]:
14400 / 30. / 12.
Out[139]:
Duration ranges from 0 to 14400 days.
Now: discretize duration variable into $K$ equal width bins, for $K = \sqrt{2766}$.
In [140]:
np.sqrt(2766)
Out[140]:
In [141]:
pd.cut(df.duration, bins=53)
Out[141]:
In [142]:
# this is "fixed width" discretization, hence the name fw_disc:
df['fw_disc_duration'] = pd.cut(df.duration, bins=53)
And how do you make this discretized covariate something suitable for binary classifiers? One-Hot encoding, of course.
In [143]:
disc_duration = pd.cut(df.duration, bins=53)
In [144]:
# one hot encode them
for dur in disc_duration.unique():
df['fw_disc_duration_%s'%dur] = (disc_duration == dur)
How about fixed frequency discretization?
In [145]:
bins = df.duration.order().iloc[len(df.duration)*np.linspace(0,1,53,endpoint=False)]
bins = np.unique(bins)
df['ff_disc_duration'] = pd.cut(df.duration, bins)
for dur in df.ff_disc_duration.unique():
df['duration_in_bin_%s'%dur] = (df.ff_disc_duration == dur)
Or Thermometer encoding:
In [146]:
for lb in bins:
df['duration_in_bin_at_least_%.2f'%lb] = (df.duration >= lb)
Another approach, which only uses the rows where cough == True to get the distribution of duration.
In [147]:
ordered_durations = np.array(df[df.cough].duration.order())
ordered_durations[np.linspace(0,2766,52, endpoint=False).astype(int)]
Out[147]:
In [148]:
df['ff_disc_duration'] = [np.sum(df.duration[i] >= ordered_durations) for i in df.index]
What we ended up doing is looking at the mean duration stratified by cause:
In [149]:
df.groupby('gs_text34').duration.mean().order(ascending=False)
Out[149]:
In [150]:
sns.distplot(df.groupby('gs_text34').duration.mean())
Out[150]:
In [151]:
df['cough_duration_is_long'] = (df.duration > 60)
df['cough_duration_is_really_long'] = (df.duration > 90)
In [152]:
import sklearn.cross_decomposition
m4 = sklearn.cross_decomposition.PLSRegression()
So for our exercise, just two little examples of "semantic projections".
One from VA, wherein dates can be subtracted to make a more meaningful feature:
In [153]:
# year of birth
df.g1_01y.head()
Out[153]:
In [154]:
# year of death
df.g1_06y.head()
Out[154]:
Much more meaningful to have the age of the deceased (which is g1_07a in the database, but let's pretend it is not).
In [155]:
def float_or_nan(x):
try:
return float(x)
except ValueError:
return np.nan
In [156]:
age_at_death = df.g1_06y - df.g1_01y.map(float_or_nan)
age_at_death
Out[156]:
In [157]:
# for use later:
reported_age_at_death = df.g1_07a
Another time this comes up: timeseries, like Ebola case counts: http://institutefordiseasemodeling.github.io/EVD/
In [158]:
sampled_ints = np.random.choice(range(int(1e7)), 50000)
len(sampled_ints)
Out[158]:
In [159]:
np.min(sampled_ints), 1./1000 * 1e7
Out[159]:
In [160]:
np.mean(sampled_ints), 1e7 /2
Out[160]:
In [161]:
np.unique(sampled_ints).shape
Out[161]:
Again, without replacement:
In [162]:
sampled_ints=np.random.choice(np.arange(1e7), size=50000, replace=False)
In [163]:
np.unique(sampled_ints).shape
Out[163]:
Speed is a good reason to do this sort of sampling. Let us revisit the noisy sinewave to see what I mean:
In [164]:
np.random.seed(12345)
x_true = np.linspace(0,4,1000)
y_true = np.sin(x_true)
x_obs = np.linspace(0,4,1000*1000)
y_obs = np.random.normal(np.sin(x_obs), .3)
plt.plot(x_obs, y_obs, ',', label='Observed', alpha=.5)
plt.plot(x_true, y_true, label='True', linewidth=10)
plt.legend(loc=(1,.1))
Out[164]:
In [165]:
m = sklearn.ensemble.RandomForestRegressor(max_depth=10)
%time m.fit(x_obs[:,None], y_obs)
Out[165]:
In [166]:
plt.plot(x_obs, y_obs, ',', label='Observed', alpha=.5)
plt.plot(x_true, y_true, label='True', linewidth=4)
plot_pred(m, 'Random Forest')
plt.legend(loc=(1,.1))
Out[166]:
In [167]:
import time
In [168]:
time.time()
Out[168]:
In [169]:
n = len(x_obs)
In [170]:
t = {}
for n_sample in [10,50,100,500,1000,5000,10*1000,50*1000,100*1000]:
random_sample = np.random.choice(range(n), size=n_sample, replace=False)
t[n_sample] = time.time()
%time m.fit(x_obs[random_sample,None], y_obs[random_sample])
t[n_sample] = time.time() - t[n_sample]
In [171]:
t = pd.Series(t)
plt.loglog(t.index, t, 's-')
plt.ylabel('Time (seconds)')
plt.xlabel('Rows of training data (n)')
Out[171]:
Now what you should also do is have a look at the prediction accuracy as a function of number of samples, to decide how much data is worth your time. (Evaluate out-of-sample, of course!)
How long is the index of the age_at_death series we made in our semantic projection part of the exercise?
In [172]:
len(age_at_death.index)
Out[172]:
And how much of it is missing?
In [173]:
age_at_death.isnull().sum()
Out[173]:
We should fill it in somehow. What does the non-missing part look like?
In [174]:
age_at_death.describe()
Out[174]:
Negative ages are not good...
In [175]:
plt.plot(reported_age_at_death.map(float_or_nan), age_at_death, 'o')
plt.xlabel('Reported Age at Death (years)')
plt.ylabel('Reported year of death minus year of birth')
Out[175]:
Cleaning this up a mess like this is something with which I suspect many of my students are familiar.
Hidden in this section is a very interesting approach which DM calls "generating artificial data". This is a way that you can use any supervised learning algorithm to do unsupervised work. We will experiment with it with Nick's example data, and I hope he will push this further as part of his project:
In [176]:
df = pd.read_csv('/homes/abie/ML4HM/Week_9/Nick_project_data_example_ZWE_2010_DHS_cleaned.csv')
In [177]:
df.head()
Out[177]:
Start with the feature vectors:
In [178]:
X_real = np.array(df.iloc[:, 1:])
Label it all as real:
In [179]:
n,p = X_real.shape
In [180]:
y_real = np.array(['real']*n)
Then make fake data, which has the same univariate distributions, but none of the correlation structure:
In [181]:
X_fake = np.empty((n,p))
for j in range(p):
X_fake[:,j] = np.random.choice(X_real[:,j], size=n, replace=False)
label this as fake:
In [182]:
y_fake = np.array(['fake']*n)
Stick them together and you have the labeled examples for supervised learning:
In [183]:
X = np.vstack((X_real, X_fake))
y = np.hstack((y_real, y_fake))
In [184]:
X_real.shape
Out[184]:
In [185]:
y_real.shape
Out[185]:
In [186]:
c1 = sklearn.linear_model.LogisticRegression()
c1.fit(X,y)
c1.score(X,y)
Out[186]:
In [187]:
c2 = sklearn.ensemble.RandomForestClassifier()
c2.fit(X,y)
c2.score(X,y)
Out[187]:
In [188]:
y_pred = c2.predict(X)
In [189]:
np.mean(y==y_pred)
Out[189]:
In [190]:
import sklearn.cross_validation
In [191]:
sklearn.cross_validation.cross_val_score(c2, X, y)
Out[191]:
So the question for Nick is, how does RF do it?
In [192]:
# load va data from Week 4
df = pd.read_csv('/homes/abie/ML4HM/Week_4/IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv', low_memory=False)
In [193]:
df.gs_text34.value_counts()
Out[193]:
In "one-vs-all" approach, for each multiclass label, make a new problem of classifying as "is it this label?"
In [194]:
y = (df.gs_text34 == 'Asthma')
In [195]:
pd.Series(y).value_counts()
Out[195]:
In "one-vs-one approach, for each pair of labels, make a new problem with instances just for that pair:
In [196]:
rows = df.gs_text34.isin(['Stroke', 'Asthma'])
y = df[rows].gs_text34
In [197]:
y.value_counts()
Out[197]:
In [198]:
X = np.vstack((X_real, X_fake))
y = np.hstack((y_real, y_fake))
In [199]:
train = np.random.choice(range(len(X)), size=len(X)/2, replace=False)
test = list(set(range(len(X))) - set(train))
c = sklearn.ensemble.RandomForestClassifier()
c.fit(X[train], y[train])
Out[199]:
We can make probability predictions for most classifiers in sklearn:
In [200]:
c.predict_proba(X[test[0]])
Out[200]:
Did that tend towards right answer?
In [201]:
y[test[0]]
Out[201]:
Can't tell from that, actually...
In [202]:
c.classes_
Out[202]:
So yes... How does it do on things it say are 80% chance real, overall?
In [203]:
y_pr_pred = c.predict_proba(X[test])
How many things like this are there?
In [204]:
np.sum(y_pr_pred[:,1]==.8)
Out[204]:
In [205]:
# in general, better to have a little wiggle room
np.sum((y_pr_pred[:,1] >= .79)&(y_pr_pred[:,1] < .81))
Out[205]:
In [206]:
rows = np.where(y_pr_pred[:,1]==.8)
In [207]:
test = np.array(test)
np.mean(y[test[rows]] == 'real')
Out[207]:
So this is a bit conservative! What about for some low probability examples?
In [208]:
np.sum(y_pr_pred[:,1]==0)
Out[208]:
In [209]:
rows = np.where(y_pr_pred[:,1]==0)
In [210]:
np.mean(y[test[rows]] == 'real')
Out[210]:
Perhaps with more trees, things would be better. But Random Forests are not supposed to do probability prediction, so perhaps we need to work on it a little bit ourselves if this is what is desired. Got any ideas for how to do so?
In [211]:
import sklearn.linear_model, patsy
In [212]:
cal = sklearn.linear_model.LogisticRegression(C=1e4)
X_pr_pred = patsy.dmatrix('bs(x,10)', {'x':y_pr_pred[:,1]})
cal.fit(X_pr_pred,y[test])
Out[212]:
In [213]:
y_cal_pr = cal.predict_proba(X_pr_pred)
In [214]:
# in general, better to have a little wiggle room
np.sum((y_cal_pr[:,1] >= .79)&(y_cal_pr[:,1] < .81))
Out[214]:
In [215]:
rows = (y_cal_pr[:,1] >= .79)&(y_cal_pr[:,1] < .81)
In [216]:
np.mean(y[test[rows]] == 'real')
Out[216]:
Quite a bit closer... how to check this more rigorously is something I can tell you more about sometime.