The aim here is to analyse the features we have at the moment using clustering to figure out which is likely to be useful. Hand picking features based on what we see in unsupervised clustering. The idea is to avoid overfitting through supervised feature selection.
The dimensionality reduction here is largely based on this page in the scikit-learn docs.
Starting with a single subject, looking at different features:
In [54]:
import matplotlib
matplotlib.use("agg")
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams['axes.grid'] = True
plt.set_cmap('brg')
In [55]:
import random
rnd = random.Random()
rnd.seed(0)
In [4]:
cd ..
In [5]:
from python import utils
In [7]:
with open("settings/forestselection_gavin.json") as f:
settings = utils.json.load(f)
In [8]:
data_dicts = []
for feature in settings['FEATURES'][:]:
settings['FEATURES'] = [feature]
data_dicts.append(utils.get_data(settings))
In [10]:
# refresh settings
with open("settings/forestselection_gavin.json") as f:
settings = utils.json.load(f)
In [11]:
# first, need to load metadata
with open("segmentMetadata.json") as f:
metadata = utils.json.load(f)
In [12]:
settings['SUBJECTS']
Out[12]:
Choosing Dog_2
:
In [13]:
feature_sets = {}
hour_class_vectors = {}
for data,feature in zip(data_dicts,settings['FEATURES'][:]):
settings['FEATURES'] = [feature]
da = utils.DataAssembler(settings,data,metadata)
X_train,y = da.build_training(settings['SUBJECTS'][-1])
X_test = da.build_test(settings['SUBJECTS'][-1])
feature_sets[feature] = (np.vstack([X_train,X_test]),np.hstack([y,np.array([2]*X_test.shape[0])]))
hrclsvector = []
for segment in np.hstack([da.training_segments,da.test_segments]):
if 'test' in segment:
hrclsvector.append(8)
else:
segment = segment.split('.')[0]
hrclsvector.append(metadata[segment]['hourID'])
hour_class_vectors[feature] = hrclsvector
In [15]:
# refresh settings
with open("settings/forestselection_gavin.json") as f:
settings = utils.json.load(f)
To find out which is probably the best method to use, we should try the different ones available in scikit-learn. There is a good page on manifold learning available, so going to try all of the options described there.
Assuming I need to scale all of my feature matrices before I begin. Will use the StandardScaler for this.
In [16]:
import sklearn.preprocessing
In [17]:
scaler = sklearn.preprocessing.StandardScaler()
In [18]:
for feature in settings['FEATURES']:
X = scaler.fit_transform(feature_sets[feature][0])
feature_sets[feature] = [X,feature_sets[feature][1]]
In [19]:
import sklearn.random_projection
In [20]:
rp = sklearn.random_projection.SparseRandomProjection(n_components=2, random_state=42)
In [21]:
projected = {}
for feature in settings['FEATURES']:
projected[feature] = rp.fit_transform(feature_sets[feature][0])
In [22]:
import itertools
In [50]:
sbpltinds = [(0,0),(0,1),(0,2),(1,0),(1,1)]
In [121]:
def plotfeatures(projected):
f, axarr = plt.subplots(2,3)
for feature,(i,j) in zip(settings['FEATURES'],sbpltinds):
axarr[i,j].set_title(feature)
scatter = [[],[str(i) for i in list(set(feature_sets[feature][1]))]]
for c in set(feature_sets[feature][1]):
cdict = {0:'r',1:'b',2:'y'}
p_plot = projected[feature][feature_sets[feature][1]==c]
y_plot = feature_sets[feature][1][feature_sets[feature][1]==c]
scatter[0].append(axarr[i,j].scatter(p_plot[:,0],p_plot[:,1],
c=cdict[int(c)],alpha=0.5))
axarr[i,j].get_xaxis().set_visible(False)
axarr[i,j].get_yaxis().set_visible(False)
axarr[i,j].legend(scatter[0],scatter[1], loc=3)
f.delaxes(axarr[1,2])
In [57]:
plotfeatures(projected)
In [58]:
def plotfeatures_hrs(projected):
f, axarr = plt.subplots(2,3)
for feature,(i,j) in zip(settings['FEATURES'],sbpltinds):
nrows = projected[feature].shape[0]
revinds = reversed(np.arange(nrows))
axarr[i,j].set_title(feature)
p_plot = projected[feature]
y_plot = feature_sets[feature][1]
axarr[i,j].scatter(p_plot[:,0],p_plot[:,1],
c=hour_class_vectors[feature],alpha=0.5)
axarr[i,j].get_xaxis().set_visible(False)
axarr[i,j].get_yaxis().set_visible(False)
f.delaxes(axarr[1,2])
In [59]:
plotfeatures_hrs(projected)
In [60]:
import sklearn.decomposition
In [61]:
pca = sklearn.decomposition.PCA(n_components=2)
In [62]:
projected = {}
for feature in settings['FEATURES']:
projected[feature] = pca.fit_transform(feature_sets[feature][0])
In [63]:
plotfeatures(projected)
In [64]:
plotfeatures_hrs(projected)
Non-linear dimensionality reduction through Isometric Mapping.
And there is a wikipedia page describing it.
In [65]:
import sklearn.manifold
In [66]:
# just using default from code
n_neighbours = 30
In [67]:
isomap = sklearn.manifold.Isomap(n_neighbours,n_components=2)
In [68]:
projected = {}
for feature in settings['FEATURES']:
projected[feature] = isomap.fit_transform(feature_sets[feature][0])
In [69]:
plotfeatures(projected)
In [70]:
plotfeatures_hrs(projected)
This has it's own homepage and is described on the wikipedia page for nonlinear dimensionality reduction.
In [126]:
lle = sklearn.manifold.LocallyLinearEmbedding(n_neighbours,n_components=2,random_state=42)
In [127]:
projected = {}
for feature in settings['FEATURES']:
projected[feature] = lle.fit_transform(feature_sets[feature][0])
In [128]:
plotfeatures(projected)
plt.savefig("/home/gavin/Images/hslleclass.png")
In [129]:
plotfeatures_hrs(projected)
plt.savefig("/home/gavin/Images/hsllehour.png")
In [75]:
mds = sklearn.manifold.MDS(n_components=2,verbose=2,max_iter=100,n_jobs=-1)
In [76]:
#%%time
#projected = {}
#for feature in settings['FEATURES']:
# projected[feature] = mds.fit_transform(feature_sets[feature][0])
Took too long, will come back to this.
Use a random forest to obtain a low dimensional representation of the data. Essentially, the forest is used to create a sparse binary representation of the data:
An unsupervised transformation of a dataset to a high-dimensional sparse representation. A datapoint is coded according to which leaf of each tree it is sorted into. Using a one-hot encoding of the leaves, this leads to a binary coding with as many ones as trees in the forest.
In [116]:
import sklearn.ensemble
In [130]:
hasher = sklearn.ensemble.RandomTreesEmbedding(n_estimators=1000,random_state=7,max_depth=5)
pca = sklearn.decomposition.TruncatedSVD(n_components=2)
In [131]:
projected = {}
for feature in settings['FEATURES']:
X_transformed = hasher.fit_transform(feature_sets[feature][0])
projected[feature] = pca.fit_transform(X_transformed)
In [132]:
plotfeatures(projected)
plt.savefig("/home/gavin/Images/hstrfclass.png")
In [133]:
plotfeatures_hrs(projected)
plt.savefig("/home/gavin/Images/hstrfhour.png")
Here's the part on this.
In [134]:
embedder = sklearn.manifold.SpectralEmbedding(n_components=2,random_state=7,eigen_solver="arpack")
In [135]:
projected = {}
for feature in settings['FEATURES']:
projected[feature] = embedder.fit_transform(feature_sets[feature][0])
In [136]:
plotfeatures(projected)
plt.savefig("/home/gavin/Images/hsspectralclass.png")
In [137]:
plotfeatures_hrs(projected)
plt.savefig("/home/gavin/Images/hsspectralhour.png")
The part on this is actually very good.
In [86]:
tsne = sklearn.manifold.TSNE(n_components=2,random_state=7)
In [87]:
projected = {}
for feature in settings['FEATURES']:
projected[feature] = tsne.fit_transform(feature_sets[feature][0])
In [88]:
plotfeatures(projected)
In [89]:
plotfeatures_hrs(projected)