Intro

This notebook provides an overview of text generation with Recurrent Neural Networks (RNN) using Keras. We are going to build and train different models that will be able to generate new pieces of text for different contexts (e.g. motivational quotes, jokes, proverbs, narrative, conversations, Q/A).

The code will guide in all the steps necessary for this task, and it's accompanied by technical descriptions as well as external references to go deeper into the subject. See also the README.md file contained in this repository.

Some of the dataset used I got directly from this repository or here for fortune cookies galore. I will also try to include the trained models data in my repository. You can easily adapt the code for any new kind of dataset you want to experiment with. If you have any doubts or suggestions, feel free to contact me directly, and be sure to share you results if you play with the code on new data.


In [ ]:
# Basic libraries import
import numpy as np
import pandas as pd
import seaborn as sns
import pickle
import nltk
import itertools

from keras.preprocessing import sequence
from keras.models import model_from_json

# Plotting
%matplotlib notebook

sns.set_context("paper")

# Add system path local modules
import os
import sys
sys.path.append(os.path.join(os.getcwd(), 'src'))

%load_ext autoreload
%autoreload 2
from model.textGenModel import TextGenModel

Data loading and preprocessing

First we are going to load our dataset and preprocess it such it can be fed to the model. Notice we are working at word level. Moving to character level would require some adjustment to the process and overall model. These are the common steps you should follow to create your training data from your original dataset:

  • sentence segmentation (if you don't have already separate individual sentences)
  • sentence tokenization (from sentence to list of words)
  • add start and end tokens
  • generate words indexing
  • pad sequences (pad or truncate sentences to fixed length)
  • one-hot encode (if you are not going to use an embedding layer in the keras model)

In [ ]:
# load dataset with pickle
corpus_name = "short_oneliners"
with open("resources/short_oneliners.pickle", 'rb') as f: #binary mode (b) is required for pickle
    dataset = pickle.load(f, encoding='utf-8') #our dataset is simply a list of string
print('Loaded {} sentences\nExample: "{}"'.format(len(dataset), dataset[0]))

In [ ]:
# constant token and params for our models
START_TOKEN = "SENTENCE_START"
END_TOKEN = "SENTENCE_END"
UNKNOWN_TOKEN = "UNKNOWN_TOKEN"
PADDING_TOKEN = "PADDING"

vocabulary_size = 5000
sent_max_len = 20

In [ ]:
# work tokenization for each sentence, while adding start and end tokens
sentences = [[START_TOKEN] + nltk.word_tokenize(entry.lower()) + [END_TOKEN] for entry in dataset]
print('Example: {}'.format(sentences[0]))

In [ ]:
# creates index_to_word and word_to_index mappings, given the data and a max vocabulary size
def get_words_mappings(tokenized_sentences, vocabulary_size):
    # we can rely on nltk to quickly get the most common words, and then limit our vocabulary to the specified size
    frequence = nltk.FreqDist(itertools.chain(*tokenized_sentences))
    vocab = frequence.most_common(vocabulary_size)
    index_to_word = [x[0] for x in vocab]
    # Add padding for index 0
    index_to_word.insert(0, PADDING_TOKEN)
    # Append unknown token (with index = vocabulary size + 1)
    index_to_word.append(UNKNOWN_TOKEN)
    word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
    return index_to_word, word_to_index

In [ ]:
# get mappings and update vocabulary size
index_to_word, word_to_index = get_words_mappings(sentences, vocabulary_size)
vocabulary_size = len(index_to_word)
print("Vocabulary size = " + str(vocabulary_size))

In [ ]:
# Generate training data by converting tokenized sentenced to indexes (and replacing unknown words)
train_size = min(len(sentences), 100000)
train_data = [[word_to_index.get(w,word_to_index[UNKNOWN_TOKEN])  for w in sent] for sent in sentences[:train_size]]

In [ ]:
# pad sentences to fixed lenght (pad with 0s if shorter, truncate if longer)
train_data = sequence.pad_sequences(train_data, maxlen=sent_max_len, dtype='int32', padding='post', truncating='post')

In [ ]:
# quick and dirty way to one-hot encode our training data, not needed if using embeddings
#X_train = np.asarray([np.eye(vocabulary_size)[idx_sentence[:-1]] for idx_sentence in train_data])
#y_train = np.asarray([np.eye(vocabulary_size)[idx_sentence[1:]] for idx_sentence in train_data])

In [ ]:
# create training data for rnn: 
# input is sentence truncated from last word, output is sentence truncated from first word
X_train = train_data[:,:-1]
y_train = train_data[:,1:]
#X_train = X_train.reshape([X_train.shape[0], X_train.shape[1], 1])
y_train = y_train.reshape([y_train.shape[0], y_train.shape[1], 1]) # needed cause out timedistributed layer

In [ ]:
# check if expected shapes (samples, sentence length, ?)
print(X_train.shape)
print(y_train.shape)

Model training and evaluation

We are going to define a RNN model architecture, train it on our data, and eventually save the results for future usage.


In [ ]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras.layers.core import Activation, Flatten
from keras.layers.embeddings import Embedding
from keras.layers import LSTM, TimeDistributed

In [ ]:
# Define model and parameters
hidden_size = 512
embedding_size = 128

In [ ]:
# model with embedding
model = Sequential()
model.add(Embedding(vocabulary_size, embedding_size, mask_zero=True))
# add batch norm
model.add(TimeDistributed(Flatten()))
model.add(LSTM(hidden_size, return_sequences=True, activation='relu'))
model.add(TimeDistributed(Dense(vocabulary_size, activation='softmax')))

In [ ]:
# basic single-layer model
#model = Sequential()
#model.add(LSTM(hidden_size, input_shape=(None, vocabulary_size), return_sequences=True))
#model.add(TimeDistributed(Dense(vocabulary_size, activation='softmax')))

In [ ]:
# need time distributes for embedding?
#model.add(TimeDistributed(
#        Embedding(vocabulary_size, output_dim=hidden_size, mask_zero=True, input_length=sent_max_len-1),
#          input_shape=(sent_max_len-1, 1), input_dtype='int32'))

In [ ]:
model.summary()

In [ ]:
# recompile also if you just want to keep training a model just loaded from memory
loss = 'sparse_categorical_crossentropy'
optimizer = 'adam'
model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])

In [ ]:
# Train model
# you might want to train several times on few epoch, observe how loss and metrics vary
# and possibly tweak batch size and learning rate
num_epoch = 2
batch_size = 32
model.fit(X_train, y_train, epochs=num_epoch, batch_size=batch_size, verbose=1)

In [ ]:
# export model (architecture)
model_path = "resources/models/{}_vocab_{}.json".format(corpus_name, vocabulary_size)
model_json = model.to_json()
with open(model_path, "w") as f:
    f.write(model_json)

# export model weights
weights_path = "resources/models/{}_epoch_{}.hdf5".format(corpus_name, 40)
model.save_weights(weights_path)

# export word indexes
index_to_word_path = 'resources/models/{}_idxs_vocab{}.txt'.format(corpus_name, vocabulary_size)
with open(index_to_word_path, "wb") as f:
    pickle.dump(index_to_word, f)

Text Generations

For the generation part I am relying on a generic utility class (TextGenModel) included in this repo.

Main operations for this step are:

  • load previously trained model
  • instantiate class with the model and model configuration (e.g. temperature, sent max length)
  • generate new text with target class
  • prettify generated text

For the text generation task a seed sentence can be provided, plus additional requirements on text length. The class will then internally take care of predicting word after word until some criteria are met.


In [ ]:
model_path = "resources/models/jokes_vocab_5002.json"
weights_path = "resources/models/jokes_epoch_20.hdf5"
# Load previously saved model
with open(model_path, 'r') as f:
    model = model_from_json(f.read())
# Load weights into model
model.load_weights(weights_path)
# Load word indexes
with open(index_to_word_path, 'rb') as f:
    index_to_word = pickle.load(f, encoding='utf-8')

In [ ]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# instantiate generation class on our data
text_gen = TextGenModel(model, index_to_word, word_to_index, sent_max_len=sent_max_len, 
                                    temperature=1.0,
                                    use_embeddings=True)
# generate N new sentences
n_sents = 10
original_sentences = [entry.lower() for entry in dataset]
for _ in range(n_sents):
    res = text_gen.pretty_print_sentence(text_gen.get_sentence(15), skip_last=True)
    if res in original_sentences:
        print("* {} SAME".format(res))
    else:
        print(res)

Conclusion

I am trying to polish the code in this repository and generalize it even further, such that there is a clear separation between text generation and models training. The idea is that for the latter I can keep experimenting with different techniques and tools, and generate different models, while with the former I can provide a reusable text-generation interface for a multitude of use-cases.

I would exactly be interested to see more diffuse and creative usage of text generation: from purely artistic tasks, to personal optimization ones (e.g. text suggestion and check), passing through a bit more automation for all other relevant scenarios.


In [ ]: