Training the Classifier

The classification task is as follows: Given a sequence of 500 raw events, recognize the source genome.

Warning: This example is not yet polished to the point that it is easy to run. Patience please!


In [2]:
import porekit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import random
import h5py
from sklearn.preprocessing import OneHotEncoder

In [3]:
enc = OneHotEncoder()
def transform_y(y):
    y = y.reshape(len(y),1)
    return enc.fit_transform(y).toarray()

In [4]:
def transform_x(x, mean, std):
    n,m = x.shape
    x.shape = (n, m, 1)
    x = (x-mean) / std
    return x

The data has been saved to an hdf5 file.


In [5]:
h5f = h5py.File('gclassify_11.h5', 'r')
training_X = h5f['training/X'][:]
mean, std = training_X.mean(), training_X.std()
training_X  = transform_x(training_X, mean, std)

training_y = transform_y(h5f['training/y'][:])
training_yc = h5f['training/y'][:]
validation_X = transform_x(h5f['validation/X'][:], mean, std)
validation_y = transform_y(h5f['validation/y'][:])
validation_yc = h5f['validation/y'][:]
h5f.close()

Keras is a high-level deep learning library.


In [6]:
from keras.models import Sequential
from keras.optimizers import SGD
from keras.layers.core import Dense, Activation, Dropout,  Flatten
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.regularizers import l2, activity_l2


Using Theano backend.
/home/andi/anaconda3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")

This model is relatively "shallow" as far as deep learning goes:


In [7]:
model = Sequential()
model.add(Convolution1D(nb_filter=32,
                        filter_length=3,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1,
                        input_shape=(500,1),
                        #W_regularizer= l2(0.01),
                       ))
model.add(Dropout(0.2))
model.add(MaxPooling1D(pool_length=4))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(output_dim=100, init="glorot_uniform"))
model.add(Activation("relu"))
model.add(Dense(output_dim=3, init="glorot_uniform"))
model.add(Activation("softmax"))

In [ ]:
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.001, nesterov=True))

In [ ]:
for i in range(120):
    model.fit(training_X, training_y, nb_epoch=1, batch_size=64)
    shuffle = np.random.choice(np.arange(len(training_X)), 1000, False)
    y = model.predict(training_X[shuffle])
    accuracy_training = np.sum(np.isclose(y.argmax(axis=1),training_yc[shuffle])) / len(y)
    y = model.predict(validation_X)
    accuracy_validation = np.sum(np.isclose(y.argmax(axis=1),validation_yc)) / len(y)
    print("%.2f %.2f" % (accuracy_training, accuracy_validation))


Epoch 1/1
65280/90000 [====================>.........] - ETA: 30s - loss: 0.3974

The model doesn't seem to overfit very much. Yes, the accuracy on the training data is higher than on the validation data, but the validation error doesn't increase either.

My most recent run achieved 97% training and 71% validation accuracy.


In [ ]: