Deep Learning for Language

A tutorial prepared by Maluuba

Prepared by Justin Harris, Tavian Barnes, and Adam Atkinson for the ImplementAI Hackathon run by the McGill A.I. Society

About this tutorial

This notebook will teach the audience how deep learning is used for natural language processing (NLP) tasks, including:

  • Named Entity Recognition
  • Part of Speech Tagging & Syntactic Parsing
  • Language Modelling
  • Natural Language Generation
  • Intent Classification
  • Translation
  • Question-Answering
  • Dialogue

We introduce the following concepts necessary to build a "deep NLP" system:

  • Word embeddings
  • Recurrent neural networks
  • Deep architectures for NLP

We also motivate and demonstrate these ideas with a sample model, inspired by Maluuba's own work on a new task: generating questions from source texts. See our recent research here: (1), (2).

For the purposes of this tutorial, some knowledge of machine learning, neural networks, and natural language processing will be helpful.

About Maluuba

Maluuba develops artificial intelligence that understands language. Our mission is to build a literate machine.

We research problems and develop solutions related to:

  • Machine reading comprehension
  • Dialogue systems
  • Reinforcement learning

Contents

  1. Deep NLP Pipeline
  2. Word Embeddings
  3. Recurrent Neural Networks
  4. Question Generation
  5. Architectures and Advanced RNNs
  6. References
  7. Resources

Deep NLP Pipeline

This pipeline is for supervised learning approaches (i.e. all training data has labels).

1. Pre-process the corpus

This means cleaning your data and getting it into the format you want. You'll strip characters, fold case, tokenize, and maybe lemmatize.

2. Prepare embeddings

These can be computed yourself, learned as a part of your model, or you can use pre-trained ones (e.g. GoogleNews word2vec vectors, Stanford GloVe vectors from Wikipedia or Common Crawl).

3. Define input and output representations

You need to define:

  1. How input is encoded for your model.
  2. How output is encoded for your model.
  3. How output is decoded from your model into something that is meaningful to humans. This is done by looking at the class that maximizes the probability of a class or word, or the probability distribution itself.

You'll need to consider things like sequence lengths, input/output vector or tensor representations, padding, and out-of-vocabulary (OOV) words.

Make sure that the inputs and outputs are neural network friendly. This may involve a one-hot vector encoding, standardizing values, scaling values, or centering values around zero.

You may use things like a softmax layer or beam search in the decoding step.

4. Construct the model

  1. Input: the X values, features, or co-variates of your formatted data.
  2. (Optional) Embedding layer: if your input isn't word vectors but needs them, then this layer will transform your input using some word embeddings.
  3. Your neural model: any combination of differentiable units and modules.
  4. Output: the Y values, predictions, or targets corresponding to your formatted data.

5. Train the model

Use backpropagation with your favourite loss functions on your model, using labelled training data, to learn model parameters.

6. Use the model

Use your model to make predictions on unseen data. This will involve encoding your input and decoding the output to a friendly format.

Word Embeddings

Problem: How do we represent words numerically?

  • Can we use the index to a word in the vocabulary?
  • What if the vocabulary is large, how do we save space?
  • How do we decode this representation back into text?
  • Can we capture any semantic information or context in this representation?

Solution: Word embeddings!

  • Use a neural network to predict a word given the words around it (continuous bag of words/CBOW), or predict the words around a given word (skipgram).
  • Use one-hot encodings of the words as input and output, keep the neural network smaller than the length of the vocabulary/one-hot vectors, then take the weights of the neural network as the word vectors!
  • Think of word vectors as columns in a matrix. Then the data is transformed by multiplying the one-hot vector by this matrix.
  • Train the neural network over a lot of text (a large corpus)
  • Word embeddings are compact, dense, and also embed sematic information.

(Source: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/)

Recurrent Neural Networks

Language is sequential, so we need models that can work with data in this way.

Recurrent neural networks (RNNs) work across time slices of an input while maintaining some state over time.

A recurrent module takes input at a timestep $t$ and uses the input and state at previous timestep $t-1$ to compute

  1. the output at timestep $t$.
  2. the updated state at timestep $t$.

A recurrent module can be any type of stateful differentiable neural model.

You can think of an RNN as a really narrow but deep feedforward/vanilla neural network or multilayer perceptron (MLP). We train an RNN by unrolling the computation in time and backpropagating over the sequence.

Problems

RNNs have had these issues in the past:

1) Long sequences => really deep neural network => computationally intense.

  • Solution: don't backpropagate through the entire sequence, use truncated backpropagation through time.

2) RNNs are bad at remembering things over longer time spans.

  • Solution: control the way state is updated or forgetten using units like GRU and LSTM.

3) Gradient signals between timesteps get smaller and become zero (vanishing gradients).

  • Solution: neural units that prevent this, like GRU and LSTM.

4) Gradient signals between timesteps grow exponentially (exploding gradients).

  • Solution: clip the gradients so they never get larger than a certain value.

LSTM

  • Long Short-Term Memory.
  • Has two forms of memory: "hidden" state and cell state
    • Hidden state is also the output for the unit.
    • Cell state represents how we want to update the hidden state.
  • Has three gates:
    • forget gate updates the cell state by determining what part of the hidden state to forget.
    • input gate updates the cell state by determining how input will affect the hidden state.
    • output gate combines the cell and hidden states.
  • More parameters makes it more expressive but more computationally intensive.

(Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

GRU

  • Gated Recurrent Unit.
  • GRU is a simplification of LSTM.
  • One form of memory: hidden state.
  • Has two gates which produce the output and new hidden state:
    • reset gate combines input with the previous state.
    • update gate discards part of the previous state.
  • Fewer parameters makes it less computationally intensive, but this unit has less capacity and is less expressive.

(Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Question Generation

Architectures and Advanced RNNs

Input-Output Cardinality

The task you choose is dependent on how many inputs and outputs there are and how they are distributed across time.

One-to-one: Take the whole sequence at once and predict one value for the entire sequence.

One-to-many: Take the whole sequence at once and predict multiple subsequent values.

Many-to-many: Take the sequence one word at a time and predict something for each word. This is commonly known as sequence-to-sequence (seq2seq).

Many-to-one: Take the sequence one word at a time, predict something for each word, then combine the outputs for each time step into a single output. This is often achieved by pooling over the outputs at each timestep.

(Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

Encoder-Decoder

  • Many neural models for language use an intermediate representation for the data.
  • Two separate models called an encoder and a decoder write to and read from this format.
  • The encoder and decoder are stacked on top of each other, and the decoder reads from the intermediate representation after the encoder has finished writing to it.

(Source: https://talbaumel.github.io/attention/)

Attention

  • Gives the neural network the ability to "focus" on part of the input.
  • An attentive decoder learns a weighted combination of all time-distributed features, or intermediate representation, produced by the layer below (i.e. the encoder).
  • The weighting parameters are learned and determine the strength of the features from the encoder at each time step.

(Source: https://distill.pub/2016/augmented-rnns/)

Pointer Networks

  • We interpret the token index that maximizes the attention at a time step to be a pointer.
  • We look at the attention for two consecutive time steps - the start and end of a segment - meaning we only unroll twice.

(Source: https://medium.com/@devnag/pointer-networks-in-tensorflow-with-sample-code-14645063f264)

Example Model

Here is our model to generate questions from a document. It uses many of the above concepts in order to demonstrate combining them into one model. This model is a simplification of a few ideas from some of our recent papers on question generation.

Let's start with detecting potentional answers in documents. The following is a simple entity recognition model.

It's possible to train a word embedding from scratch, but it's usually better to start with a pre-trained word embedding. We'll use the Stanford's GloVe embeddings.

(https://nlp.stanford.edu/projects/glove/)


In [ ]:
import tensorflow as tf

from qgen.embedding import glove

embedding = tf.get_variable("embedding", initializer=glove)

EMBEDDING_DIMENS = glove.shape[1]

Now let's build the part of the model that finds potential answers in the document. We use a bidirectional recurrent network to predict, for each word in the document, whether it's part of an answer.


In [ ]:
from tensorflow.contrib.rnn import GRUCell

document_tokens = tf.placeholder(tf.int32, shape=[None, None], name="document_tokens")
document_lengths = tf.placeholder(tf.int32, shape=[None], name="document_lengths")

document_emb = tf.nn.embedding_lookup(embedding, document_tokens)

forward_cell = GRUCell(EMBEDDING_DIMENS)
backward_cell = GRUCell(EMBEDDING_DIMENS)

answer_outputs, _ = tf.nn.bidirectional_dynamic_rnn(
    forward_cell, backward_cell, document_emb, document_lengths, dtype=tf.float32,
    scope="answer_rnn")
answer_outputs = tf.concat(answer_outputs, 2)

answer_tags = tf.layers.dense(inputs=answer_outputs, units=2)

When training the model, we'll feed it the expected answers from the training set, and compute how far away the models predictions were from them. The optimizer will try to minimize this value, known as the loss.


In [ ]:
import tensorflow.contrib.seq2seq as seq2seq

answer_labels = tf.placeholder(tf.int32, shape=[None, None], name="answer_labels")

answer_mask = tf.sequence_mask(document_lengths, dtype=tf.float32)
answer_loss = seq2seq.sequence_loss(
    logits=answer_tags, targets=answer_labels, weights=answer_mask, name="answer_loss")

To generate questions from the answers, we'll use a sequence-to-sequence model that transforms an answer into a question. We can feed it the recurrent states from the answer-finding RNN above to give each answer some context from the rest of the document.

First, the encoder:


In [ ]:
encoder_input_mask = tf.placeholder(
    tf.float32, shape=[None, None, None], name="encoder_input_mask")
encoder_inputs = tf.matmul(encoder_input_mask, answer_outputs, name="encoder_inputs")
encoder_lengths = tf.placeholder(tf.int32, shape=[None], name="encoder_lengths")

encoder_cell = GRUCell(forward_cell.state_size + backward_cell.state_size)

_, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encoder_inputs, encoder_lengths, dtype=tf.float32, scope="encoder_rnn")

Now, the decoder. It takes the last state of the encoder as input, and begins generating question words one at a time.


In [ ]:
from tensorflow.python.layers.core import Dense

decoder_inputs = tf.placeholder(tf.int32, shape=[None, None], name="decoder_inputs")
decoder_lengths = tf.placeholder(tf.int32, shape=[None], name="decoder_lengths")

decoder_emb = tf.nn.embedding_lookup(embedding, decoder_inputs)
helper = seq2seq.TrainingHelper(decoder_emb, decoder_lengths)

projection = Dense(embedding.shape[0], use_bias=False)

decoder_cell = GRUCell(encoder_cell.state_size)

decoder = seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection)

decoder_outputs, _, _ = seq2seq.dynamic_decode(decoder, scope="decoder")
decoder_outputs = decoder_outputs.rnn_output

decoder_labels = tf.placeholder(tf.int32, shape=[None, None], name="decoder_labels")
question_mask = tf.sequence_mask(decoder_lengths, dtype=tf.float32)
question_loss = seq2seq.sequence_loss(
    logits=decoder_outputs, targets=decoder_labels, weights=question_mask,
    name="question_loss")

loss = tf.add(answer_loss, question_loss, name="loss")

To train both parts of the model at once, we add both losses (from the questions and answers) together, and tell our optimizer to minimize the sum. Neural network training proceeds by iterating over the data in batches, feeding each one to the network and letting the optimizer tune the weights by some variant of Stochastic Gradient Descent.


In [ ]:
from qgen.data import training_data

optimizer = tf.train.AdamOptimizer().minimize(loss)

saver = tf.train.Saver()
session = tf.InteractiveSession()
session.run(tf.global_variables_initializer())

EPOCHS = 5

for epoch in range(1, EPOCHS + 1):
    print("Epoch {0}".format(epoch))
    for batch in training_data():
        _, loss_value = session.run([optimizer, loss], {
            document_tokens: batch["document_tokens"],
            document_lengths: batch["document_lengths"],
            answer_labels: batch["answer_labels"],
            encoder_input_mask: batch["answer_masks"],
            encoder_lengths: batch["answer_lengths"],
            decoder_inputs: batch["question_input_tokens"],
            decoder_labels: batch["question_output_tokens"],
            decoder_lengths: batch["question_lengths"],
        })
        print("Loss: {0}".format(loss_value))
    saver.save(session, "model", epoch)

Now that we have a trained model, let's use it to generate some new questions and answers! First, we predict some answers. Notice how we're only feeding the document itself to the model.


In [ ]:
import numpy as np

from qgen.data import test_data, collapse_documents

saver = tf.train.Saver()
session = tf.InteractiveSession()
saver.restore(session, "model-5")

batch = next(test_data())
batch = collapse_documents(batch)

answers = session.run(answer_tags, {
    document_tokens: batch["document_tokens"],
    document_lengths: batch["document_lengths"],
})
answers = np.argmax(answers, 2)

Now that we have some answers, we can use the sequence-to-sequence model to generate some questions that (hopefully) have the predicted answers. To make new predictions with the decoder, we'll have to change its implementation slightly, to wire up its inputs from its previous outputs instead of the training data.


In [ ]:
import itertools

from qgen.data import expand_answers
from qgen.embedding import look_up_token, UNKNOWN_TOKEN, START_TOKEN, END_TOKEN

batch = expand_answers(batch, answers)

helper = seq2seq.GreedyEmbeddingHelper(embedding, tf.fill([batch["size"]], START_TOKEN), END_TOKEN)
decoder = seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection)
decoder_outputs, _, _ = seq2seq.dynamic_decode(decoder, maximum_iterations=16)
decoder_outputs = decoder_outputs.rnn_output

questions = session.run(decoder_outputs, {
    document_tokens: batch["document_tokens"],
    document_lengths: batch["document_lengths"],
    answer_labels: batch["answer_labels"],
    encoder_input_mask: batch["answer_masks"],
    encoder_lengths: batch["answer_lengths"],
})
questions[:,:,UNKNOWN_TOKEN] = 0
questions = np.argmax(questions, 2)

for i in range(batch["size"]):
    question = itertools.takewhile(lambda t: t != END_TOKEN, questions[i])
    print("Question: " + " ".join(look_up_token(token) for token in question))
    print("Answer: " + batch["answer_text"][i])
    print()