RNN Text Generation - Advanced


Intro

This notebook builds on the RNN with Keras - Text Generation intro notebook and explores more advanced techniques, configurations and optimization for the text generation task. We want to consider a text generation task as general as possible, with the capability of using multiple models of different nature (e.g. narrative, scientific, code, poetry, conversational) for a variety of uses (e.g. writing hints, chatbots, "QA").

We are here going to make heavy use of the code and utilities already present in the repository containing this notebook.


In [ ]:
# Basic libraries import
import numpy as np
import pandas as pd
import pickle
import spacy
nlp = spacy.load('en')
import itertools
import collections

# Keras and Tensorflow
import keras.backend as K
from keras.preprocessing import sequence
from keras.models import model_from_json
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, BatchNormalization
from keras.layers.core import Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers import LSTM, TimeDistributed, RepeatVector, Input

import tensorflow as tf

# Reference local code
import os
import sys
from os.path import join
from pathlib import Path
sys.path.append(join(os.getcwd(), 'src'))

%load_ext autoreload
%autoreload 2
from utils import preprocessing
from model.textGenModel import TextGenModel

In [ ]:
training_data_folder = join(str(Path.home()), "Documents/datasets/")
model_data_folder = join(str(Path.home()), "Documents/models/tf_rnn_{}/")

Data Loading and Preprocessing

For the moment I personally recognized two distinct types of task, which are defined by our requirements in terms on the generated text, and that in turns define different ways to pre/post-process the content.

Notice that this distinction might not follow more formal standards defined by the online community regarding the task, I simply found it useful as for the current state of my requirements and understanding.

Q&A (or Sentence Level)

Here we expect to provide a string and get one which relates to it on a semantic level, but that it is also self-contained, meaning that it makes sense by itself, and possibly follows common rules of sentence structure (e.g. start and end tokens).

This exactly reflects a question answering scenario, or a more general conversational one (consider a chatbot for example).

Continuous

Here we might provide a seed string, and expect an arbitrary length response that simply follows the "narrative flow" of the seed. We are not interested in it to be self-contained.

Implications: no start/end tokens, start/end is mostly defined by language syntax (e.g. upper case, punctuation).


In [ ]:
# constant token and params for our models
START_TOKEN = "SENTENCE_START"
END_TOKEN = "SENTENCE_END"
UNKNOWN_TOKEN = "UNKNOWN_TOKEN"
PADDING_TOKEN = "PADDING"

Sentence Level Text

[TODO]

Continuous Text


In [ ]:
vocabulary_size = 8000
sent_max_len = 20
corpus_name = 'bible'
model_data_folder = model_data_folder.format(corpus_name)

In [ ]:
with open(join(training_data_folder, "{}.txt".format(corpus_name)),
         'r', encoding='utf-8') as f:
    corpus_text = f.read()

In [ ]:
# tokenize text 
#tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
corpus_tokens = [token.orth_ for token in nlp(corpus_text)]
print('Example tokenized excerpt: {}'.format(corpus_tokens[10:20]))

In [ ]:
# get mappings and update vocabulary size
index_to_word, word_to_index = preprocessing.get_words_mappings(
                                        [corpus_tokens], #cause a list of sentences is expected
                                        vocabulary_size)
vocabulary_size = len(index_to_word)
print("Vocabulary size = " + str(vocabulary_size))

In [ ]:
# convert tokens to indexes (and replacing unknown words)
train_data = [word_to_index.get(w, word_to_index[UNKNOWN_TOKEN]) 
              for w in corpus_tokens]

In [ ]:
# create training data
# for the "continuous" case we split here into sentences
# for which output one is 1-shifted to the right (successive words)
remainer = len(train_data)%sent_max_len
X_train = np.array([train_data[i:i+sent_max_len] 
                    for i in range(0, len(train_data)-remainer, sent_max_len)])
Y_train = np.array([train_data[i:i+sent_max_len] 
                    for i in range(1, len(train_data)-remainer, sent_max_len)])
#X_train = np.expand_dims(X_train, -1)
Y_train = np.expand_dims(Y_train, -1) # needed cause out timedistributed layer

print("Example train input sentence: {}".format(X_train[0]))
print("and related output = {}".format(Y_train[0]))

In [ ]:
# check if expected shapes (samples, sentence length, ?)
print(X_train.shape)
print(Y_train.shape)

In [ ]:
# export word indexes
index_to_word_path = join(model_data_folder,
                          '{}_idxs_vocab_{}.txt'.format(corpus_name, vocabulary_size))
with open(index_to_word_path, "wb") as f:
    pickle.dump(index_to_word, f)

Load Word-Embeddings

Notice that with the following approach we are considering only the vocabulary of our training dataset. An improvements on this side would be to reinstantiate the model for testing including the entirety of the original word-embeddings, such that we can leverage also input words not present in our training vocabulary.


In [ ]:
embedding_size = 100
embeddings = preprocessing.load_embeddings(join(training_data_folder, "glove", "glove.6B.100d.txt"),
                                          word_to_index.keys())
embeddings_matrix = preprocessing.get_embeddings_matrix(embeddings,
                                                       word_to_index,
                                                        embedding_size)

In [ ]:
embeddings_matrix.shape

Model Training

We first consider a basic LSTM with embedding input layer. Initially we will learn the embedding for scratch, and consider a many-to-many model (that is, sequence-in and sequence-out).

Possible improvements and stuff to try:

  • try deep LSTM
  • include attention

[TODO] Add considerations about stateful


In [ ]:
# model parameters
hidden_size = 512
#embedding_size = 128 # already defined when loading word-embeddings
batch_size = 64
stateful = False

In [ ]:
# LSTM with embedding layer
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, 
                    #batch_input_shape=(batch_size, sent_max_len), # needed in case of stateful model
                    mask_zero=True,
                    weights=[embeddings_matrix], # use of pretained embeddings
                    trainable=False))
model.add(BatchNormalization())
#model.add(TimeDistributed(Flatten())) # not needed if proper batch_input_shape specified before
model.add(LSTM(hidden_size, 
               return_sequences=True, 
               stateful=stateful, # if stateful model, remember to avoid batches shuffling during training
              ))#activation='relu')) ??easily getting loss=nan if using RELU
model.add(TimeDistributed(Dense(vocabulary_size, activation='softmax')))

In [ ]:
model.summary()

In [ ]:
# compile model
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam')

In [ ]:
# keep track of trained epoch (for when we rerun a cell)
trained_epochs = 32

In [ ]:
model.layers[0].trainable = True

In [ ]:
num_epoch = 1
model.fit(X_train[:100000], # for stateful we need 
          Y_train[:100000], # number of samples divisible by the batch size
          epochs=num_epoch, 
          batch_size=batch_size, 
          shuffle=not stateful # don't shuffle if stateful 
         )
trained_epochs += num_epoch

In [ ]:
model.optimizer.lr=1e-4

In [ ]:
num_epoch = 10
model.fit(X_train, Y_train, 
          epochs=num_epoch, 
          batch_size=batch_size, 
          shuffle=not stateful # don't shuffle if stateful 
         )
trained_epochs += num_epoch

Encoder-Decoder Model

Encoder-decoder model. Sometime also called Sequence-to-Sequence, which however can also refer purely to the type of correspondence between input and output.

[TODO] Basic description of the model

Possible improvements and stuff to try:

  • include attention

In [ ]:
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size,
                    mask_zero=True,
                    weights=[embeddings_matrix], # use of pretained embeddings
                    trainable=False))
model.add(BatchNormalization())
model.add(LSTM(hidden_size, return_sequences=False))
model.add(RepeatVector(sent_max_len)) #??Difference between RepeatVector and return_sequences
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocabulary_size, activation='softmax')))

In [ ]:
model.summary()

In [ ]:
# compile model
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam')

In [ ]:
# keep track of trained epoch (for when we rerun a cell)
trained_epochs = 11

In [ ]:
num_epoch = 1
model.fit(X_train[:-(len(X_train)%batch_size)], # for stateful we need 
          Y_train[:-(len(X_train)%batch_size)], # number of samples divisible by the batch size
          epochs=num_epoch, 
          batch_size=batch_size, 
          shuffle=not stateful # don't shuffle if stateful 
         )
trained_epochs += num_epoch

In [ ]:
# Encoder
#encoder_inputs = Embedding(vocabulary_size, embedding_size, 
#                    mask_zero=True)
encoder_inputs = Input(shape=(None,))
x = Embedding(vocabulary_size, embedding_size)(encoder_inputs)
encoder = LSTM(hidden_size, 
               return_state=True,
               return_sequences=True
              )
#model.add(BatchNormalization())
encoder_outputs = encoder(x)
encoder_states = [encoder_outputs[1], encoder_outputs[2]]

# Decoder
decoder_inputs = Input(shape=(None,))
x = Embedding(vocabulary_size, embedding_size)(decoder_inputs)
decoder = LSTM(hidden_size, 
               return_sequences=True
              )
decoder_outputs = decoder(x, initial_state=encoder_states)
decoder_outputs = TimeDistributed(Dense(vocabulary_size, activation='softmax'))(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [ ]:
model.summary()

Evaluation

Informal evaluation. Use the trained model to generate N new sentences and print them out.


In [ ]:
#import logging
#logger = logging.getLogger()
#logger.setLevel(logging.DEBUG)

# instantiate generation class on our data
text_gen = TextGenModel(model, index_to_word, word_to_index, sent_max_len=sent_max_len, 
                                    temperature=1.0,
                                    use_embeddings=True)
# generate N new sentences
n_sents = 10
print("Epoch {}".format(trained_epochs))
for _ in range(n_sents):
    res = text_gen.pretty_print_sentence(text_gen.get_sentence(15))
    print(res)

Export Model

Explores both the common Keras export and Tensorflow serving.

Keras Basic Export

Export model architecture and weights such that it can be easily reinstantiate for testing or further training.

Notice that word-to-index data is required if we want to reuse the model, but we covered that step already during the preprocessing phase.

Note also that we could enable automatic checkpoint saving into the training loop.


In [ ]:
checkpoints_dir = join(model_data_folder, 'checkpoints')
if not os.path.exists(checkpoints_dir):
    os.makedirs(checkpoints_dir)

# export model (architecture)
model_path = join(checkpoints_dir, 
                  "base_voc_{}.json".format(vocabulary_size))
model_json = model.to_json()
with open(model_path, "w") as f:
    f.write(model_json)

# export model weights
weights_path = join(checkpoints_dir, 
                    "glove_voc_{}_epoch_{}.hdf5".format(vocabulary_size, 
                                                      trained_epochs))
model.save_weights(weights_path)

Keras Import


In [ ]:
sess = tf.Session()
#sess.run(tf.global_variables_initializer())
K.set_session(sess)
K.clear_session()

In [ ]:
#K.set_learning_phase(False) # see https://github.com/fchollet/keras/issues/2310

# Load previously saved model
with open(join(model_data_folder, 'checkpoints', 'base_voc_8002.json'), 'r') as f:
    model = model_from_json(f.read())
# Load weights into model
model.load_weights(join(model_data_folder, 'checkpoints', 'base_voc_8002_epoch_23.hdf5'))

TF Serving Export

Export model for tf-serving relying on SavedModelBuilder. The exported model can then be easily "passed" to a tf-serving server instance for consumption.


In [ ]:
version = "4"
export_dir = join(model_data_folder, version)

In [ ]:
print(model.input)
print(model.output)

In [ ]:
# Init builder
builder = tf.saved_model.builder.SavedModelBuilder(export_dir)

# Define signature (set of inputs and outputs for the graph)
prediction_signature = (
    tf.saved_model.signature_def_utils.build_signature_def(
        inputs={'inputs': tf.saved_model.utils.build_tensor_info(model.input)},
        outputs={'outputs': tf.saved_model.utils.build_tensor_info(model.output)},
        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME
    )
)

# Add meta-graph (dataflow graph, variables, assets, and signatures) 
# to the builder
#with K.get_session() as sess: #avoid because it causes the session to be closed
sess = K.get_session()
K.set_learning_phase(False) # see https://github.com/fchollet/keras/issues/2310
builder.add_meta_graph_and_variables(
    sess=sess,
    tags=[tf.saved_model.tag_constants.SERVING],
    signature_def_map={
        'predict' : prediction_signature
    }
    #legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
)

# Finally save builder
builder.save()

TF Serving Client

Showcase of a basic client that can use the models served by a tf-serving server.


In [ ]:
from grpc.beta import implementations

# reference local copy of Tensorflow Serving API Files
sys.path.append(os.path.join(os.getcwd(), os.pardir, 'ext_libs'))
import lib.predict_pb2 as predict_pb2
import lib.prediction_service_pb2 as prediction_service_pb2

In [ ]:
host='127.0.0.1'
port=9000
channel = implementations.insecure_channel(host, int(port))
stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)

# build request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'rnn' # model name, as given to bazel script
request.model_spec.signature_name = 'predict' # as defined in ModelBuilder

# define inputs
x = X_train[0]
x_tensor = tf.contrib.util.make_tensor_proto(x, dtype=tf.float32, shape=(1, 20,))
request.inputs['inputs'].CopyFrom(x_tensor)

# call prediction on the server
result = stub.Predict(request, timeout=10.0)

In [ ]:
# get the output from server response
outputs = result.outputs['outputs']

In [ ]:
# extract response-tensor shape
tensor_shape = outputs.tensor_shape
tensor_shape = [dim.size for dim in tensor_shape.dim]

In [ ]:
# reshape list of float to given shape
res_tensor = np.array(outputs.float_val).reshape(tensor_shape)

In [ ]:
res_tensor.shape

Text Generation

TF Serving Client

We rely on the implementation of the tensorflow serving client as provided in this repository. It is just a refactoring of the previous code for better modularity and reuse.


In [ ]:
from model.servingClient import ServingClient
from model.textGenerator import TextGenerator

In [ ]:
# instantiate text generator object 
# with info about which model to use and related configs
text_gen = TextGenerator('/Users/amartinelli/Documents/models/models.ini',
             'darwin_tf', 
            'standard_config')

In [ ]:
# generate text
text_gen.generate(10)