The aim here is to analyse the features we have at the moment using clustering to figure out which is likely to be useful. Hand picking features based on what we see in unsupervised clustering. The idea is to avoid overfitting through supervised feature selection.

The dimensionality reduction here is largely based on this page in the scikit-learn docs.

Starting with a single subject, looking at different features:



In [54]:

    
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams['axes.grid'] = True
plt.set_cmap('brg')









    





<matplotlib.figure.Figure at 0x7fa3baafbeb8>



In [55]:

    
import random
rnd = random.Random()
rnd.seed(0)



In [4]:

    
cd ..









    



/home/gavin/repositories/hail-seizure



In [5]:

    
from python import utils

Loading the data

Using the now standard probablygood.json file for this task. Important to note that I am including pseudo-data.



In [7]:

    
with open("settings/forestselection_gavin.json") as f:
    settings = utils.json.load(f)



In [8]:

    
data_dicts = []
for feature in settings['FEATURES'][:]:
    settings['FEATURES'] = [feature]
    data_dicts.append(utils.get_data(settings))



In [10]:

    
# refresh settings
with open("settings/forestselection_gavin.json") as f:
    settings = utils.json.load(f)

Making training matrices

Now using the data assembler to make a training matrices for different features for a single subject:



In [11]:

    
# first, need to load metadata
with open("segmentMetadata.json") as f:
    metadata = utils.json.load(f)



In [12]:

    
settings['SUBJECTS']









    Out[12]:





['Dog_1', 'Dog_2', 'Dog_3', 'Dog_4', 'Dog_5', 'Patient_1', 'Patient_2']

Choosing Dog_2:



In [13]:

    
feature_sets = {}
hour_class_vectors = {}
for data,feature in zip(data_dicts,settings['FEATURES'][:]):
    settings['FEATURES'] = [feature]
    da = utils.DataAssembler(settings,data,metadata)
    X_train,y = da.build_training(settings['SUBJECTS'][-1])
    X_test = da.build_test(settings['SUBJECTS'][-1])
    feature_sets[feature] = (np.vstack([X_train,X_test]),np.hstack([y,np.array([2]*X_test.shape[0])]))
    hrclsvector = []
    for segment in np.hstack([da.training_segments,da.test_segments]):
        if 'test' in segment:
            hrclsvector.append(8)
        else:
            segment = segment.split('.')[0]
            hrclsvector.append(metadata[segment]['hourID'])
    hour_class_vectors[feature] = hrclsvector



In [15]:

    
# refresh settings
with open("settings/forestselection_gavin.json") as f:
    settings = utils.json.load(f)

Testing different methods

To find out which is probably the best method to use, we should try the different ones available in scikit-learn. There is a good page on manifold learning available, so going to try all of the options described there.

Assuming I need to scale all of my feature matrices before I begin. Will use the StandardScaler for this.



In [16]:

    
import sklearn.preprocessing



In [17]:

    
scaler = sklearn.preprocessing.StandardScaler()



In [18]:

    
for feature in settings['FEATURES']:
    X = scaler.fit_transform(feature_sets[feature][0])
    feature_sets[feature] = [X,feature_sets[feature][1]]

Random projection

I don't know much about this technique. Quote:

reduces the dimensionality by projecting the original input space using a sparse random matrix.

So it just transforms with a random matrix. This apparently is justified as it maintains a similar embedding "quality":



In [19]:

    
import sklearn.random_projection



In [20]:

    
rp = sklearn.random_projection.SparseRandomProjection(n_components=2, random_state=42)



In [21]:

    
projected = {}
for feature in settings['FEATURES']:
    projected[feature] = rp.fit_transform(feature_sets[feature][0])



In [22]:

    
import itertools



In [50]:

    
sbpltinds = [(0,0),(0,1),(0,2),(1,0),(1,1)]



In [121]:

    
def plotfeatures(projected):
    f, axarr = plt.subplots(2,3)
    for feature,(i,j) in zip(settings['FEATURES'],sbpltinds):
        axarr[i,j].set_title(feature)
        scatter = [[],[str(i) for i in list(set(feature_sets[feature][1]))]]
        for c in set(feature_sets[feature][1]):
            cdict = {0:'r',1:'b',2:'y'}
            p_plot = projected[feature][feature_sets[feature][1]==c]
            y_plot = feature_sets[feature][1][feature_sets[feature][1]==c]
            scatter[0].append(axarr[i,j].scatter(p_plot[:,0],p_plot[:,1],
                               c=cdict[int(c)],alpha=0.5))
        axarr[i,j].get_xaxis().set_visible(False)
        axarr[i,j].get_yaxis().set_visible(False)   
        axarr[i,j].legend(scatter[0],scatter[1], loc=3)
    f.delaxes(axarr[1,2])

Splitting by class

In the below diagrams the colours map to preictal and interictal. Preictal samples are green, and interictal are blue.



In [57]:

    
plotfeatures(projected)



In [58]:

    
def plotfeatures_hrs(projected):
    f, axarr = plt.subplots(2,3)
    for feature,(i,j) in zip(settings['FEATURES'],sbpltinds):
        nrows = projected[feature].shape[0]
        revinds = reversed(np.arange(nrows))
        axarr[i,j].set_title(feature)
        p_plot = projected[feature]
        y_plot = feature_sets[feature][1]
        axarr[i,j].scatter(p_plot[:,0],p_plot[:,1],
                           c=hour_class_vectors[feature],alpha=0.5)
        axarr[i,j].get_xaxis().set_visible(False)
        axarr[i,j].get_yaxis().set_visible(False)
    f.delaxes(axarr[1,2])

Splitting by hour

In these plots the hours have been coloured differently.



In [59]:

    
plotfeatures_hrs(projected)

PCA

Good old PCA.



In [60]:

    
import sklearn.decomposition



In [61]:

    
pca = sklearn.decomposition.PCA(n_components=2)



In [62]:

    
projected = {}
for feature in settings['FEATURES']:
    projected[feature] = pca.fit_transform(feature_sets[feature][0])

By Class



In [63]:

    
plotfeatures(projected)

By Hour



In [64]:

    
plotfeatures_hrs(projected)

Isomap projection

Quote:

Non-linear dimensionality reduction through Isometric Mapping.

And there is a wikipedia page describing it.



In [65]:

    
import sklearn.manifold



In [66]:

    
# just using default from code
n_neighbours = 30



In [67]:

    
isomap = sklearn.manifold.Isomap(n_neighbours,n_components=2)



In [68]:

    
projected = {}
for feature in settings['FEATURES']:
    projected[feature] = isomap.fit_transform(feature_sets[feature][0])

By Class



In [69]:

    
plotfeatures(projected)

By Hour

This technique appears to do qute well at separating different ours on:

cln,ica,dwn_feat_mvar-GPDC_
cln,ica,dwn_feat_PSDlogfcorrcoef_



In [70]:

    
plotfeatures_hrs(projected)

Locally linear embedding

This has it's own homepage and is described on the wikipedia page for nonlinear dimensionality reduction.



In [126]:

    
lle = sklearn.manifold.LocallyLinearEmbedding(n_neighbours,n_components=2,random_state=42)



In [127]:

    
projected = {}
for feature in settings['FEATURES']:
    projected[feature] = lle.fit_transform(feature_sets[feature][0])

By Class



In [128]:

    
plotfeatures(projected)
plt.savefig("/home/gavin/Images/hslleclass.png")

By Hour



In [129]:

    
plotfeatures_hrs(projected)
plt.savefig("/home/gavin/Images/hsllehour.png")

MDS embedding

Multidimensional scaling



In [75]:

    
mds = sklearn.manifold.MDS(n_components=2,verbose=2,max_iter=100,n_jobs=-1)



In [76]:

    
#%%time
#projected = {}
#for feature in settings['FEATURES']:
#    projected[feature] = mds.fit_transform(feature_sets[feature][0])

Took too long, will come back to this.

Totally random trees embedding

Use a random forest to obtain a low dimensional representation of the data. Essentially, the forest is used to create a sparse binary representation of the data:

An unsupervised transformation of a dataset to a high-dimensional sparse representation. A datapoint is coded according to which leaf of each tree it is sorted into. Using a one-hot encoding of the leaves, this leads to a binary coding with as many ones as trees in the forest.



In [116]:

    
import sklearn.ensemble



In [130]:

    
hasher = sklearn.ensemble.RandomTreesEmbedding(n_estimators=1000,random_state=7,max_depth=5)
pca = sklearn.decomposition.TruncatedSVD(n_components=2)



In [131]:

    
projected = {}
for feature in settings['FEATURES']:
    X_transformed = hasher.fit_transform(feature_sets[feature][0])
    projected[feature] = pca.fit_transform(X_transformed)

By Class



In [132]:

    
plotfeatures(projected)
plt.savefig("/home/gavin/Images/hstrfclass.png")

By Hour



In [133]:

    
plotfeatures_hrs(projected)
plt.savefig("/home/gavin/Images/hstrfhour.png")

Spectral embedding

Here's the part on this.



In [134]:

    
embedder = sklearn.manifold.SpectralEmbedding(n_components=2,random_state=7,eigen_solver="arpack")



In [135]:

    
projected = {}
for feature in settings['FEATURES']:
    projected[feature] = embedder.fit_transform(feature_sets[feature][0])

By Class



In [136]:

    
plotfeatures(projected)
plt.savefig("/home/gavin/Images/hsspectralclass.png")



In [137]:

    
plotfeatures_hrs(projected)
plt.savefig("/home/gavin/Images/hsspectralhour.png")

t-SNE

The part on this is actually very good.



In [86]:

    
tsne = sklearn.manifold.TSNE(n_components=2,random_state=7)



In [87]:

    
projected = {}
for feature in settings['FEATURES']:
    projected[feature] = tsne.fit_transform(feature_sets[feature][0])



In [88]:

    
plotfeatures(projected)



In [89]:

    
plotfeatures_hrs(projected)