Text Classification

This text classification example:

trains a recurrent neural network on the IMDB large movie review dataset for sentiment analysis.
uses verta's Python client logging observations and artifacts

Set Up Environment

This notebook has been tested with the following package versions:
(you may need to change pip to pip3, depending on your own Python environment)



In [1]:

    
# Python 3.6
!pip install verta
!pip install matplotlib==3.1.1
!pip install tensorflow==2.0.0-beta1
!pip install tensorflow-hub==0.5.0
!pip install tensorflow-datasets==1.0.2

Set Up Verta



In [2]:

    
HOST = 'app.verta.ai'

PROJECT_NAME = 'Text-Classification'
EXPERIMENT_NAME = 'RNN'



In [3]:

    
# import os
# os.environ['VERTA_EMAIL'] = 
# os.environ['VERTA_DEV_KEY'] =



In [4]:

    
from verta import Client
from verta.utils import ModelAPI

client = Client(HOST, use_git=False)

proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)
run = client.set_experiment_run()

Imports



In [5]:

    
from __future__ import absolute_import, division, print_function, unicode_literals
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import json

import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow as tf

Create a helper function to plot graphs:



In [6]:

    
def plot_graphs(history, string, run, plot_title):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel('Epochs')
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    run.log_image(plot_title, plt)
    plt.show()

Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using TFDS. The dataset comes with an inbuilt subword tokenizer.



In [7]:

    
# loading the dataset

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

As this is a subwords tokenizer, it can be passed any string and the tokenizer will tokenize it.



In [8]:

    
tokenizer = info.features['text'].encoder
print ('Vocabulary size: {}'.format(tokenizer.vocab_size))



In [9]:

    
sample_string = 'The latest Marvel movie - Endgame was amazing!'

tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

assert original_string == sample_string

The tokenizer encodes the string by breaking it into subwords if the word is not in its dictionary.



In [10]:

    
for ts in tokenized_string:
    print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))



In [11]:

    
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)

test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

Create the model

Build a tf.keras.Sequential model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense layer.

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.

The tf.keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.



In [12]:

    
hyperparams = {
    'num_epochs': 10,
    'optimizer': 'adam',
    'loss': 'binary_crossentropy',
    'vocab_size': tokenizer.vocab_size,
    'metrics': 'accuracy'
}

# logging hyperparameters

run.log_hyperparameters(hyperparams)



In [13]:

    
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Compile the Keras model to configure the training process:



In [14]:

    
model.compile(loss=hyperparams['loss'],
              optimizer=hyperparams['optimizer'],
              metrics=[hyperparams['metrics']])

Train the model



In [15]:

    
# called at the end of each epoch - logging loss, accuracy as observations for the run

class LossAndErrorLoggingCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        print('The average loss for epoch {} is {:7.2f}, accuracy is {:7.2f}.'.format(epoch, logs['loss'], logs['accuracy']))
        run.log_observation("train_loss", float(logs['loss']))
        run.log_observation("train_acc", float(logs['accuracy']))
        run.log_observation("val_loss", float(logs['val_loss']))
        run.log_observation("val_acc", float(logs['val_accuracy']))



In [16]:

    
history = model.fit(train_dataset,
                    epochs=hyperparams['num_epochs'],
                    validation_data=test_dataset,
                    callbacks=[LossAndErrorLoggingCallback()])

Testing and Prediction



In [17]:

    
test_loss, test_acc = model.evaluate(test_dataset)

# logging metrics

run.log_metric('test_loss', float(test_loss))
run.log_metric('test_accuracy', float(test_acc))

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

The above model does not mask the padding applied to the sequences. This can lead to skewness if we train on padded sequences and test on un-padded sequences. Ideally the model would learn to ignore the padding, but as you can see below it does have a small effect on the output.

If the prediction is >= 0.5, it is positive else it is negative.



In [18]:

    
def pad_to_size(vec, size):
    zeros = [0] * (size - len(vec))
    vec.extend(zeros)
    return vec



In [19]:

    
def sample_predict(sentence, pad):
    tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
    if pad:
        tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)

    predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))
    return (predictions)



In [20]:

    
# predict on a sample text without padding.
sample_pred_text = ('Spiderman: Far From Home did not disappoint! I loved it!')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)

predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)



In [21]:

    
# plotting graphs to see variation in accuracy and loss
plot_graphs(history, 'accuracy', run, 'epochs_vs_acc')
plot_graphs(history, 'loss', run, 'epochs_vs_loss')

Saving Models



In [22]:

    
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

run.log_artifact('model_summary_json', 'model.json')
run.log_artifact('model_weights', 'model.h5')