The idea in this notebook is to reduce the dimensionality of the datasets by transforming individual features using classifiers. Once we've done this it will be possible to combine the subject specific datasets into a single global dataset. This might run the risk of overfitting, but it is also a nice way to create a global classifier.

Loading the data and initialisation

Same initialisation steps as in other notebooks:


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x7ff9dc955470>

In [2]:
cd ..


/home/gavin/repositories/hail-seizure

In [3]:
import train
import json
import imp

In [4]:
settings = json.load(open('SETTINGS.json', 'r'))

In [5]:
data = train.get_data(settings['FEATURES'][:3])

In [6]:
!free -m


             total       used       free     shared    buffers     cached
Mem:         11933       5066       6866        385        368       2533
-/+ buffers/cache:       2164       9768
Swap:        12287         41      12246

Random forest supervised classification

For each feature and each subject we want to train a random forest and use it to transform the data. We also want to appropriately weight the samples due to the unbalanced classes.

Since I'm a big fan of dictionaries it seems like it would be easy to do this with a dictionary iterating over subjects and features and saving predictions.


In [7]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils

In [8]:
imp.reload(utils)


Out[8]:
<module 'python.utils' from '/home/gavin/repositories/hail-seizure/python/utils.py'>

Below code copied and modified from random forest submission notes:


In [221]:
features = settings['FEATURES'][:3]

In [10]:
subjects = settings['SUBJECTS']

In [11]:
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('scl',scaler),('clf',forest)])

In [166]:
import sklearn.feature_extraction

In [179]:
oneofk = sklearn.preprocessing.OneHotEncoder(sparse=False)

In [180]:
x = np.arange(10)[np.newaxis]

In [181]:
oneofk.fit_transform(x)


Out[181]:
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])

In [240]:
%%time
predictiondict = {}
for feature in features:
    print("Processing {0}".format(feature))
    for i,subj in enumerate(subjects):
        # training step
        X,y,cv,segments = utils.build_training(subj,[feature],data)
        X = scaler.fit_transform(X)
        predictions = np.mean(X,axis=1)
        for segment,prediction in zip(segments,predictions):
            try:
                predictiondict[segment][feature] = [prediction]
            except:
                predictiondict[segment] = {}
                predictiondict[segment][feature] = [prediction]
            # add subject 1-of-k vector
            subjvector = np.zeros(len(subjects))
            subjvector[i] = 1
            predictiondict[segment]['subject'] = list(subjvector)


Processing ica_feat_var_
Processing ica_feat_cov_
Processing ica_feat_corrcoef_
CPU times: user 847 ms, sys: 30 ms, total: 877 ms
Wall time: 873 ms

Next, creating the full training set:


In [241]:
segments = list(predictiondict.keys())

In [242]:
predictiondict[segments[0]].keys()


Out[242]:
dict_keys(['ica_feat_cov_', 'ica_feat_var_', 'ica_feat_corrcoef_', 'subject'])

In [243]:
X = np.array([])[np.newaxis]
train,test = [],[]
for i,segment in enumerate(segments):
    row = []
    for feature in features+['subject']:
        row += predictiondict[segment][feature]
    try:
        X = np.vstack([X,np.array(row)[np.newaxis]])
    except:
        X = np.array(row)[np.newaxis]

In [244]:
X


Out[244]:
array([[-0.49685521,  0.00605736,  0.01370309, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.51705194,  0.04152316,  0.05378171, ...,  0.        ,
         0.        ,  0.        ],
       [-0.11685621,  0.04234982,  0.04947848, ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [-0.35475441, -0.02805797, -0.01568938, ...,  0.        ,
         0.        ,  0.        ],
       [-0.05627469,  0.0089621 ,  0.0347518 , ...,  0.        ,
         0.        ,  0.        ],
       [-0.19796733,  0.04504965,  0.06480923, ...,  0.        ,
         0.        ,  0.        ]])

In [245]:
y = [1 if 'preictal' in segment else 0 for segment in segments]

In [246]:
y = np.array(y)

In [247]:
len(y)


Out[247]:
4067

In [248]:
len(X)


Out[248]:
4067

In [249]:
len(segments)


Out[249]:
4067

In [250]:
cv = sklearn.cross_validation.StratifiedShuffleSplit(y)

In [255]:
weight = len(y)/sum(y)

In [256]:
weights = [weight if i == 1 else 1 for i in y]

In [258]:
for train,test in cv:
    forest.fit(X[train],y[train],sample_weight=weight)
    predictions = forest.predict_proba(X[test])
    score = sklearn.metrics.roc_auc_score(y[test],predictions[:,1])
    print(score)


0.653801945181
0.690848806366
0.573430592396
0.679442970822
0.703227232538
0.681520778073
0.724845269673
0.661582670203
0.772944297082
0.705702917772

In [261]:
forest.fit(X,y,sample_weight=weights)


Out[261]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [264]:
predictiondict = {}
for feature in features:
    print("Processing {0}".format(feature))
    for i,subj in enumerate(subjects):
        X,segments = utils.build_test(subj,[feature],data)
        X = scaler.fit_transform(X)
        predictions = np.mean(X,axis=1)
        for segment,prediction in zip(segments,predictions):
            try:
                predictiondict[segment][feature] = [prediction]
            except:
                predictiondict[segment] = {}
                predictiondict[segment][feature] = [prediction]
            # add subject 1-of-k vector
            subjvector = np.zeros(len(subjects))
            subjvector[i] = 1
            predictiondict[segment]['subject'] = list(subjvector)


Processing ica_feat_var_
Processing ica_feat_cov_
Processing ica_feat_corrcoef_

In [265]:
segments = list(predictiondict.keys())

In [266]:
X = np.array([])[np.newaxis]
for i,segment in enumerate(segments):
    row = []
    for feature in features+['subject']:
        row += predictiondict[segment][feature]
    try:
        X = np.vstack([X,np.array(row)[np.newaxis]])
    except:
        X = np.array(row)[np.newaxis]

In [267]:
import csv

In [268]:
predictiondict = {}
for segment,fvector in zip(segments,X):
    predictiondict[segment] = forest.predict_proba(fvector)

In [269]:
with open("output/protosubmission.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%predictiondict[seg][-1][-1]])

In [270]:
!head output/protosubmission.csv












In [271]:
!wc -l output/protosubmission.csv


3936 output/protosubmission.csv

In [272]:
!wc -l output/sampleSubmission.csv


3935 output/sampleSubmission.csv

Wrong length, but I submitted it anyway and it got 0.53141. After adding 1-of-k encoded subjects and weightings got 0.56016. So, that should hold when I add more features, or do the above in a smarter way.

Saving this operation as a script

We will probably want to be able to do this again, so we should save this operation as a function in utils.py. Will do this once it works.