This might be an interesting type of classifier to deal with. Essentially the plan is in the title, transform the data with a Totally Random Tree embedding into a sparse binary representation. After doing this, could go through the data and test for mutual information, removing features that are correlated to preserve the independence assumption of Naive Bayes. Then, run Naive Bayes to classify the data.
Unfortunately, this can't easily be plugged into the existing code, so it's probably best to prototype it quickly to see if it will be interesting.
Going to use the same features as used by the current best performing classifier, the SVC.
Specifically, just one of its features, that seems to be contributing the most to its performance: mvar_csp
In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 12
plt.rcParams['axes.grid'] = True
plt.set_cmap('brg')
In [2]:
cd ..
In [3]:
from python import utils
In [8]:
with open("probablygood.gavin.json") as f:
settings = utils.json.load(f)
In [9]:
settings['FEATURES'] = [feature for feature in settings['FEATURES'] if 'mvar' in feature]
In [11]:
data = utils.get_data(settings)
In [12]:
with open("segmentMetadata.json") as f:
meta = utils.json.load(f)
In [13]:
da = utils.DataAssembler(settings,data,meta)
In [14]:
X,y = da.composite_tiled_training()
In [15]:
X.shape
Out[15]:
First, we will fill the missing data with means, then we will apply a standard scaler to the data. After doing that, we can apply the totally random tree embedding. Then, it will be interesting to look at mutual information.
After the totally random tree embedding we can perform classification with Naive Bayes.
In [18]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
In [21]:
imputer = sklearn.preprocessing.Imputer()
scaler = sklearn.preprocessing.StandardScaler()
hasher = sklearn.ensemble.RandomTreesEmbedding(n_estimators=3000,random_state=7,max_depth=5,n_jobs=-1)
pipe = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('hsh',hasher)])
In [22]:
%time pipe.fit(X,y)
Out[22]:
In [23]:
X_hashed = pipe.transform(X)
In [26]:
X_hashed.shape
Out[26]:
In [24]:
import sklearn.metrics
In [64]:
%%time
scores = []
for i in range(X_hashed.shape[1]):
scores.append(sklearn.metrics.mutual_info_score(y,list(X_hashed[:,i].todense().flat)))
if i>0:
if i%int(X_hashed.shape[1]/100) == 0:
print(i)
In [66]:
h=plt.hist(scores,log=True)
In [56]:
import sklearn.naive_bayes
In [57]:
nb = sklearn.naive_bayes.BernoulliNB()
In [58]:
pipe = sklearn.pipeline.Pipeline([('imp',imputer),('scl',scaler),('hsh',hasher),('cls',nb)])
In [61]:
cv = utils.Sequence_CV(da.composite_training_segments,meta)
In [63]:
%%time
for train,test in cv:
pipe.fit(X[train],y[train])
prds = pipe.predict_proba(X[test])
print(sklearn.metrics.roc_auc_score(y[test],prds[:,1]))