This text classification example:
This notebook has been tested with the following package versions:
(you may need to change pip
to pip3
, depending on your own Python environment)
In [1]:
# Python 3.6
!pip install verta
!pip install matplotlib==3.1.1
!pip install tensorflow==2.0.0-beta1
!pip install tensorflow-hub==0.5.0
!pip install tensorflow-datasets==1.0.2
In [2]:
HOST = 'app.verta.ai'
PROJECT_NAME = 'Text-Classification'
EXPERIMENT_NAME = 'RNN'
In [3]:
# import os
# os.environ['VERTA_EMAIL'] =
# os.environ['VERTA_DEV_KEY'] =
In [4]:
from verta import Client
from verta.utils import ModelAPI
client = Client(HOST, use_git=False)
proj = client.set_project(PROJECT_NAME)
expt = client.set_experiment(EXPERIMENT_NAME)
run = client.set_experiment_run()
In [5]:
from __future__ import absolute_import, division, print_function, unicode_literals
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
import json
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
import tensorflow as tf
Create a helper function to plot graphs:
In [6]:
def plot_graphs(history, string, run, plot_title):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel('Epochs')
plt.ylabel(string)
plt.legend([string, 'val_'+string])
run.log_image(plot_title, plt)
plt.show()
The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.
Download the dataset using TFDS. The dataset comes with an inbuilt subword tokenizer.
In [7]:
# loading the dataset
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
As this is a subwords tokenizer, it can be passed any string and the tokenizer will tokenize it.
In [8]:
tokenizer = info.features['text'].encoder
print ('Vocabulary size: {}'.format(tokenizer.vocab_size))
In [9]:
sample_string = 'The latest Marvel movie - Endgame was amazing!'
tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))
assert original_string == sample_string
The tokenizer encodes the string by breaking it into subwords if the word is not in its dictionary.
In [10]:
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))
In [11]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)
Build a tf.keras.Sequential
model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.
This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense
layer.
A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.
The tf.keras.layers.Bidirectional
wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.
In [12]:
hyperparams = {
'num_epochs': 10,
'optimizer': 'adam',
'loss': 'binary_crossentropy',
'vocab_size': tokenizer.vocab_size,
'metrics': 'accuracy'
}
# logging hyperparameters
run.log_hyperparameters(hyperparams)
In [13]:
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
Compile the Keras model to configure the training process:
In [14]:
model.compile(loss=hyperparams['loss'],
optimizer=hyperparams['optimizer'],
metrics=[hyperparams['metrics']])
In [15]:
# called at the end of each epoch - logging loss, accuracy as observations for the run
class LossAndErrorLoggingCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
print('The average loss for epoch {} is {:7.2f}, accuracy is {:7.2f}.'.format(epoch, logs['loss'], logs['accuracy']))
run.log_observation("train_loss", float(logs['loss']))
run.log_observation("train_acc", float(logs['accuracy']))
run.log_observation("val_loss", float(logs['val_loss']))
run.log_observation("val_acc", float(logs['val_accuracy']))
In [16]:
history = model.fit(train_dataset,
epochs=hyperparams['num_epochs'],
validation_data=test_dataset,
callbacks=[LossAndErrorLoggingCallback()])
In [17]:
test_loss, test_acc = model.evaluate(test_dataset)
# logging metrics
run.log_metric('test_loss', float(test_loss))
run.log_metric('test_accuracy', float(test_acc))
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
The above model does not mask the padding applied to the sequences. This can lead to skewness if we train on padded sequences and test on un-padded sequences. Ideally the model would learn to ignore the padding, but as you can see below it does have a small effect on the output.
If the prediction is >= 0.5, it is positive else it is negative.
In [18]:
def pad_to_size(vec, size):
zeros = [0] * (size - len(vec))
vec.extend(zeros)
return vec
In [19]:
def sample_predict(sentence, pad):
tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
if pad:
tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)
predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))
return (predictions)
In [20]:
# predict on a sample text without padding.
sample_pred_text = ('Spiderman: Far From Home did not disappoint! I loved it!')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)
In [21]:
# plotting graphs to see variation in accuracy and loss
plot_graphs(history, 'accuracy', run, 'epochs_vs_acc')
plot_graphs(history, 'loss', run, 'epochs_vs_loss')
In [22]:
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")
run.log_artifact('model_summary_json', 'model.json')
run.log_artifact('model_weights', 'model.h5')