This might be an interesting type of classifier to deal with. Essentially the plan is in the title, transform the data with a Totally Random Tree embedding into a sparse binary representation. After doing this, could go through the data and test for mutual information, removing features that are correlated to preserve the independence assumption of Naive Bayes. Then, run Naive Bayes to classify the data.

Unfortunately, this can't easily be plugged into the existing code, so it's probably best to prototype it quickly to see if it will be interesting. Going to use the same features as used by the current best performing classifier, the SVC. Specifically, just one of its features, that seems to be contributing the most to its performance: mvar_csp


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 12
plt.rcParams['axes.grid'] = True
plt.set_cmap('brg')


<matplotlib.figure.Figure at 0x7f6e740a2588>

Loading the global training set

This should be able to transform a tiled composite training set into a more workable form.


In [2]:
cd ..


/home/gavin/repositories/hail-seizure

In [3]:
from python import utils

In [8]:
with open("probablygood.gavin.json") as f:
    settings = utils.json.load(f)

In [9]:
settings['FEATURES'] = [feature for feature in settings['FEATURES'] if 'mvar' in feature]

In [11]:
data = utils.get_data(settings)

In [12]:
with open("segmentMetadata.json") as f:
    meta = utils.json.load(f)

In [13]:
da = utils.DataAssembler(settings,data,meta)

In [14]:
X,y = da.composite_tiled_training()

In [15]:
X.shape


Out[15]:
(7454, 24600)

Putting together the pipeline

First, we will fill the missing data with means, then we will apply a standard scaler to the data. After doing that, we can apply the totally random tree embedding. Then, it will be interesting to look at mutual information.

After the totally random tree embedding we can perform classification with Naive Bayes.


In [18]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble

In [21]:
imputer = sklearn.preprocessing.Imputer()
scaler = sklearn.preprocessing.StandardScaler()
hasher = sklearn.ensemble.RandomTreesEmbedding(n_estimators=3000,random_state=7,max_depth=5,n_jobs=-1)
pipe = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('hsh',hasher)])

In [22]:
%time pipe.fit(X,y)


Out[22]:
Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('hsh', RandomTreesEmbedding(max_depth=5, max_leaf_nodes=None, min_density=None,
           min_samples_leaf=1, min_samples_split=2, n_estimators=3000,
           n_jobs=1, random_state=7, sparse_output=True, verbose=0))])

In [23]:
X_hashed = pipe.transform(X)

In [26]:
X_hashed.shape


Out[26]:
(7454, 63231)

In [24]:
import sklearn.metrics

In [64]:
%%time
scores = []
for i in range(X_hashed.shape[1]):
    scores.append(sklearn.metrics.mutual_info_score(y,list(X_hashed[:,i].todense().flat)))
    if i>0:
        if i%int(X_hashed.shape[1]/100) == 0:
            print(i)


632
1264
1896
2528
3160
3792
4424
5056
5688
6320
6952
7584
8216
8848
9480
10112
10744
11376
12008
12640
13272
13904
14536
15168
15800
16432
17064
17696
18328
18960
19592
20224
20856
21488
22120
22752
23384
24016
24648
25280
25912
26544
27176
27808
28440
29072
29704
30336
30968
31600
32232
32864
33496
34128
34760
35392
36024
36656
37288
37920
38552
39184
39816
40448
41080
41712
42344
42976
43608
44240
44872
45504
46136
46768
47400
48032
48664
49296
49928
50560
51192
51824
52456
53088
53720
54352
54984
55616
56248
56880
57512
58144
58776
59408
60040
60672
61304
61936
62568
63200
CPU times: user 34min 38s, sys: 237 ms, total: 34min 38s
Wall time: 34min 36s

In [66]:
h=plt.hist(scores,log=True)



In [56]:
import sklearn.naive_bayes

In [57]:
nb = sklearn.naive_bayes.BernoulliNB()

In [58]:
pipe = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('hsh',hasher),('cls',nb)])

In [61]:
cv = utils.Sequence_CV(da.composite_training_segments,meta)

In [63]:
%%time 
for train,test in cv:
    pipe.fit(X[train],y[train])
    prds = pipe.predict_proba(X[test])
    print(sklearn.metrics.roc_auc_score(y[test],prds[:,1]))


0.596453302652
0.664414776902
0.618333286127
0.608484229462
0.594321714882
0.597103873542
0.618142108576
0.545518998143
0.636693822854
0.623119719408
CPU times: user 2min 33s, sys: 2min 6s, total: 4min 39s
Wall time: 4min 39s