Updating this to match what was found when trying to mimic their test. Was able to improve the performance by weighting. Now, going to try some simple feature selection on the entire featureset and see if performance can be improved further.


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

In [2]:
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x7fe98c649438>

Loading the data

As before, loading the data. But, now there's more data to load.


In [3]:
cd ..


/home/gavin/repositories/hail-seizure

In [4]:
import train
import json
import imp

In [5]:
settings = json.load(open('SETTINGS.json', 'r'))

In [6]:
settings['FEATURES']


Out[6]:
['ica_feat_var_',
 'ica_feat_cov_',
 'ica_feat_corrcoef_',
 'ica_feat_pib_',
 'ica_feat_xcorr_',
 'ica_feat_psd_',
 'ica_feat_psd_logf_',
 'ica_feat_coher_',
 'ica_feat_coher_logf_',
 'raw_feat_var_',
 'raw_feat_cov_',
 'raw_feat_corrcoef_',
 'raw_feat_pib_',
 'raw_feat_xcorr_',
 'raw_feat_psd_',
 'raw_feat_psd_logf_']

In [9]:
data = train.utils.get_data(settings['FEATURES'],settings)

In [10]:
!free -m


             total       used       free     shared    buffers     cached
Mem:         11933      11699        233        440       1399       2352
-/+ buffers/cache:       7948       3984
Swap:        12287         50      12237

Developing training script

It's not generally a good idea to run anything that's going to take a long time in an IPython notebook. The thing can freeze, or if it's disconnected lose work. Going to develop a script here that can be run locally or on salmon.


In [11]:
# getting a set of the subjects involved
subjects = set(list(data.values())[0].keys())
print(subjects)


{'Dog_4', 'Patient_1', 'Dog_3', 'Patient_2', 'Dog_2', 'Dog_1', 'Dog_5'}

In [12]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils

Feature selection

We want to do some simple feature selection, as even with a massive amount of RAM available there's no point in using features that are obviously useless. The first suggestion for this is a variance threshold, removing features with low variances.


In [13]:
X,y,cv,segments = utils.build_training(list(subjects)[0],list(data.keys()),data)

Something must have changed in the data since the last time I ran this script, because this used to work:


In [14]:
h=plt.hist(np.log10(np.var(X,axis=0)))


-c:1: RuntimeWarning: divide by zero encountered in log10
/usr/lib/python3.4/site-packages/numpy/core/function_base.py:97: RuntimeWarning: invalid value encountered in multiply
  y = _nx.arange(0, num, dtype=dtype) * step + start
/usr/lib/python3.4/site-packages/numpy/core/function_base.py:97: RuntimeWarning: invalid value encountered in add
  y = _nx.arange(0, num, dtype=dtype) * step + start

However, I don't really like this much, as low variance doesn't mean that there won't be information there. After all, variance scales with the multiplicative constants.

A better approach is scikit-learns SelectKBest, which can use $\chi^2$ or ANOVA f-values. Can't use $\chi^2$ as demands non-negative features. Trying each down to the 50 best and attempting to plot in 2d with PCA:


In [13]:
import sklearn.feature_selection

In [30]:
Xbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif, k=50).fit_transform(X,y)

In [14]:
import sklearn.decomposition

In [40]:
pca = sklearn.decomposition.PCA(n_components=2)
scaler = sklearn.preprocessing.StandardScaler()
twodX = pca.fit_transform(scaler.fit_transform(Xbest))
plt.scatter(twodX[:,0][y==1],twodX[:,1][y==1],c='blue')
plt.scatter(twodX[:,0][y==0],twodX[:,1][y==0],c='red')


Out[40]:
<matplotlib.collections.PathCollection at 0x7fe929030518>

Looking good, now doing the same with the magical t-SNE:


In [41]:
import sklearn.manifold

In [42]:
tsne = sklearn.manifold.TSNE()
twodX = tsne.fit_transform(scaler.fit_transform(Xbest))
plt.scatter(twodX[:,0][y==1],twodX[:,1][y==1],c='blue')
plt.scatter(twodX[:,0][y==0],twodX[:,1][y==0],c='red')


Out[42]:
<matplotlib.collections.PathCollection at 0x7fe928042d30>

Also looking good. So, all we do is add the selection to our model and then also turn the n_estimators up and we should get a better prediction.


In [15]:
selection = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif,k='all')
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('sel',selection),('scl',scaler),('clf',forest)])

Testing this model on this single subject:


In [16]:
import imp

In [23]:
imp.reload(utils)


Out[23]:
<module 'python.utils' from '/home/gavin/repositories/hail-seizure/python/utils.py'>

In [24]:
def subjpredictions(subject,model,data):
    X,y,cv,segments = utils.build_training(subject,list(data.keys()),data)
    predictions = []
    labels = []
    allweights = []
    for train,test in cv:
        # calculate weights
        weight = len(y[train])/sum(y[train])
        weights = np.array([weight if i == 1 else 1 for i in y[train]])
        model.fit(X[train],y[train],clf__sample_weight=weights)
        predictions.append(model.predict_proba(X[test]))
        weight = len(y[test])/sum(y[test])
        weights = np.array([weight if i == 1 else 1 for i in y[test]])
        allweights.append(weights)
        labels.append(y[test])
    predictions = np.vstack(predictions)[:,1]
    labels = np.hstack(labels)
    weights = np.hstack(allweights)
    return predictions,labels,weights

In [25]:
%%time
p,l,w = subjpredictions(list(subjects)[0],model,data)


CPU times: user 26.8 s, sys: 1min 17s, total: 1min 44s
Wall time: 1min 44s

In [26]:
sklearn.metrics.roc_auc_score(l,p)


Out[26]:
0.72120888746926837

In [27]:
sklearn.metrics.roc_auc_score(l,p,sample_weight=w)


Out[27]:
0.72106667463046414

In [28]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p,sample_weight=w)
plt.plot(fpr,tpr)


Out[28]:
[<matplotlib.lines.Line2D at 0x7fe9170e0b00>]

It certainly works a bit better than the classifier I was working with before. What if we increase the number of estimators to deal with the much larger number of features?


In [29]:
model.set_params(clf__n_estimators=5000,clf__max_depth=None)


Out[29]:
Pipeline(steps=[('sel', SelectKBest(k='all', score_func=<function f_classif at 0x7fe9835af268>)), ('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=5000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0))])

In [39]:
%%time
p,l,w = subjpredictions(list(subjects)[0],model,data)


CPU times: user 27min 43s, sys: 53.3 s, total: 28min 36s
Wall time: 28min 35s

In [42]:
sklearn.metrics.roc_auc_score(l,p)


Out[42]:
0.78421005450372916

In [43]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
plt.plot(fpr,tpr)


Out[43]:
[<matplotlib.lines.Line2D at 0x7fe916f66b00>]

Actually, it looks like I could probably just run this in the notebook, and it'd probably be fine. Will write the script after doing this.


In [38]:
imp.reload(utils)


Out[38]:
<module 'python.utils' from '/home/gavin/repositories/hail-seizure/python/utils.py'>

In [34]:
features = list(data.keys())

In [35]:
model.set_params(clf__n_estimators=5000,clf__max_depth=None)


Out[35]:
Pipeline(steps=[('sel', SelectKBest(k='all', score_func=<function f_classif at 0x7fe9835af268>)), ('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=5000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0))])

In [36]:
%%time
predictiondict = {}
for subj in subjects:
    # training step
    X,y,cv,segments = utils.build_training(subj,features,data)
    # weights
    weight = len(y)/sum(y)
    weights = np.array([weight if i == 1 else 1 for i in y])
    model.fit(X,y,clf__sample_weight=weights)
    # prediction step
    X,segments = utils.build_test(subj,features,data)
    predictions = model.predict_proba(X)
    for segment,prediction in zip(segments,predictions):
        predictiondict[segment] = prediction


---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-36-62d772764485> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'predictiondict = {}\nfor subj in subjects:\n    # training step\n    X,y,cv,segments = utils.build_training(subj,features,data)\n    # weights\n    weight = len(y)/sum(y)\n    weights = np.array([weight if i == 1 else 1 for i in y])\n    model.fit(X,y,clf__sample_weight=weights)\n    # prediction step\n    X,segments = utils.build_test(subj,features,data)\n    predictions = model.predict_proba(X)\n    for segment,prediction in zip(segments,predictions):\n        predictiondict[segment] = prediction')

/usr/lib/python3.4/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2160             magic_arg_s = self.var_expand(line, stack_depth)
   2161             with self.builtin_trap:
-> 2162                 result = fn(magic_arg_s, cell)
   2163             return result
   2164 

/usr/lib/python3.4/site-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)

/usr/lib/python3.4/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

/usr/lib/python3.4/site-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
   1127         else:
   1128             st = clock2()
-> 1129             exec(code, glob, local_ns)
   1130             end = clock2()
   1131             out = None

<timed exec> in <module>()

/home/gavin/repositories/hail-seizure/python/utils.py in build_training(subject, features, data, flagpseudo)
    208                 try:
    209                     Xf = np.vstack([Xf, np.ndarray.flatten(\
--> 210                             data[feature][subject][ictal][segment].T)])
    211                 except ValueError:
    212                     Xf = np.ndarray.flatten(\

/usr/lib/python3.4/site-packages/numpy/core/shape_base.py in vstack(tup)
    226 
    227     """
--> 228     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    229 
    230 def hstack(tup):

KeyboardInterrupt: 

In [37]:
import csv

In [38]:
with open("output/protosubmission.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%predictiondict[seg][-1]])

Resubmitted and got 0.49, so reducing the depth doesn't seem to help generalisation problems. Also, it's weird that the score would be that low.

Checking if there's anything obviously weird with the output file:


In [39]:
!head output/protosubmission.csv











Nope, looks ok.

There are three warnings above, so this problem is only occurring on three of the subjects. Could run each subject individually and try to find one where the ROC completely falls apart.


In [ ]:
%%time
for s in subjects:
    p,l,w = subjpredictions(s,model,data)
    print("subject %s"%s, sklearn.metrics.roc_auc_score(l,p,sample_weight=w))

Simple solution is apparently to run a variance threshold before running the feature selection (like it says on the wiki really).


In [62]:
thresh = sklearn.feature_selection.VarianceThreshold()
selection = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif,k=3000)
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('thr',thresh),('sel',selection),('scl',scaler),('clf',forest)])

In [63]:
%%time
predictiondict = {}
for subj in subjects:
    # training step
    X,y = utils.build_training(subj,features,data)
    # weights
    weight = len(y)/sum(y)
    weights = np.array([weight if i == 1 else 1 for i in y])
    model.fit(X,y,clf__sample_weight=weights)
    # prediction step
    X,segments = utils.build_test(subj,features,data)
    predictions = model.predict_proba(X)
    for segment,prediction in zip(segments,predictions):
        predictiondict[segment] = prediction


CPU times: user 2min 14s, sys: 47 s, total: 3min 1s
Wall time: 4min 58s

In [64]:
with open("output/protosubmission.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%predictiondict[seg][-1]])