The idea in this notebook is to reduce the dimensionality of the datasets by transforming individual features using classifiers. Once we've done this it will be possible to combine the subject specific datasets into a single global dataset. This might run the risk of overfitting, but it is also a nice way to create a global classifier.
Same initialisation steps as in other notebooks:
In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()
In [2]:
cd ..
In [3]:
import train
import json
import imp
In [4]:
settings = json.load(open('SETTINGS.json', 'r'))
In [6]:
data = train.get_data(settings['FEATURES'][:3])
In [7]:
!free -m
For each feature and each subject we want to train a random forest and use it to transform the data. We also want to appropriately weight the samples due to the unbalanced classes.
Since I'm a big fan of dictionaries it seems like it would be easy to do this with a dictionary iterating over subjects and features and saving predictions.
In [8]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils
In [9]:
imp.reload(utils)
Out[9]:
Below code copied and modified from random forest submission notes:
In [12]:
features = settings['FEATURES'][:3]
In [11]:
subjects = settings['SUBJECTS']
In [13]:
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('scl',scaler),('clf',forest)])
In [36]:
%%time
predictiondict = {}
for feature in features:
print("Processing {0}".format(feature))
for subj in subjects:
# training step
X,y,cv,segments = utils.build_training(subj,[feature],data)
for i, (train, test) in enumerate(cv):
weight = len(y[train])/sum(y[train])
weights = np.array([weight if i == 1 else 1 for i in y[train]])
model.fit(X[train],y[train],clf__sample_weight=weights)
predictions = model.predict_proba(X)
for name,split in [('train',train),('test',test)]:
for segment,prediction in zip(segments[split],predictions[split]):
try:
predictiondict[segment][feature] = {}
predictiondict[segment][feature][i] = (name,prediction)
except:
predictiondict[segment] = {}
predictiondict[segment][feature] = {}
predictiondict[segment][feature][i] = (name,prediction)
Next, creating the full training set for a single train/test iteration:
In [37]:
segments = list(predictiondict.keys())
In [38]:
predictiondict[segments[0]].keys()
Out[38]:
In [40]:
X = np.array([])[np.newaxis]
train,test = [],[]
for i,segment in enumerate(segments):
row = []
for feature in features:
cv = list(predictiondict[segment][feature].keys())
row.append(predictiondict[segment][feature][cv[0]][-1][-1])
name = predictiondict[segment][feature][cv[0]][0]
if name == 'train':
train.append(i)
elif name == 'test':
test.append(i)
else:
print("segment {0} does not have a valid label: {1}".format(i,name))
try:
X = np.vstack([X,np.array(row)[np.newaxis]])
except:
X = np.array(row)[np.newaxis]
In [41]:
X
Out[41]:
In [42]:
y = [1 if 'preictal' in segment else 0 for segment in segments]
In [43]:
y = np.array(y)
In [44]:
len(y)
Out[44]:
In [45]:
len(X)
Out[45]:
In [46]:
len(segments)
Out[46]:
In [48]:
model.fit(X[train],y[train])
predictions = model.predict_proba(X[test])
score = sklearn.metrics.roc_auc_score(y[test],predictions[:,1])
print(score)
In [52]:
sum(predictions)
Out[52]: