Take a look a this great article for an introduction to recurrent neural networks and LSTMs in particular.
In this tutorial we will show how to train a recurrent neural network on a challenging task of language modeling. The goal of the problem is to fit a probabilistic model which assigns probabilities to sentences. It does so by predicting next words in a text given a history of previous words. For this purpose we will use the Penn Tree Bank(PTB) dataset, which is a popular benchmark for measuring quality of these models, whilst being small and relatively fast to train.
Language modeling is key to many interesting problems such as speech recognition, machine translation, or image captioning. It is also fun, too -- take a look here
For the purpose of this tutorial, we will reproduce the results from Zaremba et al., 2014(pdf), which achieves very good results on the PTB dataset.
File | Purpose |
---|---|
ptb_word_lm.py | The code to train a language model on the PTB dataset. |
reader.py | The code to reset the dataset. |
The data required for this tutorial is in the data/ directory of the PTB dataset from Tomas Mikolov's webpage: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
The dataset is already preprocessed and contains overall 10000 different words, including the end-of-sentence marker and a special symbol (
The core of the model consists of an LSTM cell that processes one word at a time and computes probabilities of the possible continuations of the sentence. The memory state of the network is initialized with a vector of zeros and gets updated after reading each word. Also, for computational reasons, we will process data in mini-batches of size batch_size
.
The basic pseudocode looks as follows:
In order to make the learning process tractable, it is a common practice to truncate the gradients for backpropagation to a fixed number (num_steps
) of unrolled steps. This is easy to implement by feeding inputs of length num_steps
at a time and doing backward pass after each iteration.
A simplified version of the code for the graph creation for truncated backpropagation:
And this is how to implement an iteration over the whole dataset:
The word IDs will be embedded into a dense representation (see the Vector Representations Tutorial) before feeding to the LSTM. This allows the model to efficiently represent the knowledge about particular words. It is also easy to write:
The embedding matrix will be initialized randomly and the model will learn to differentiate the meaning of words just by looking at the data.
We want to minimize the average negative log probability of the target words:
$$ loss = -\frac{1}{N}\sum_{i=1}^{N} ln p_{target_i} $$It is not very difficult to implement but the function sequence_loss_by_example
is already available, so we can just use it here.
The typical measure reported in the papers is average per-word perplexity (often just called perplexity), which is equal to
$$ e^{-\frac{1}{N}\sum_{i=1}^{N} ln p_{target_i}} = e^{loss} $$and we will monitor its value throughout the training process.
We are assuming you have already installed via the pip package, have cloned the tensorflow git repository, and are in the root of the git tree. (If building from source, build the tensorflow/models/rnn/ptb:ptb_word_lm target using bazel).
Next: cd tensorflow/models/rnn/ptb python ptb_word_lm --data_path=/tmp/simple-examples/data/ --model small
There are 3 supported model configurations in the tutorial code: "small", "medium" and "large". The difference between them is in size of the LSTMs and the set of hyperparameters used for training.
The larger the model, the better results it should get. The small model should be able to reach perplexity below 120 on the test set and the large one below 80, though it might take several hours to train.
In [ ]: