Prepared by Justin Harris, Tavian Barnes, and Adam Atkinson for the ImplementAI Hackathon run by the McGill A.I. Society
This notebook will teach the audience how deep learning is used for natural language processing (NLP) tasks, including:
We introduce the following concepts necessary to build a "deep NLP" system:
We also motivate and demonstrate these ideas with a sample model, inspired by Maluuba's own work on a new task: generating questions from source texts. See our recent research here: (1), (2).
For the purposes of this tutorial, some knowledge of machine learning, neural networks, and natural language processing will be helpful.
Maluuba develops artificial intelligence that understands language. Our mission is to build a literate machine.
We research problems and develop solutions related to:
This pipeline is for supervised learning approaches (i.e. all training data has labels).
This means cleaning your data and getting it into the format you want. You'll strip characters, fold case, tokenize, and maybe lemmatize.
These can be computed yourself, learned as a part of your model, or you can use pre-trained ones (e.g. GoogleNews word2vec vectors, Stanford GloVe vectors from Wikipedia or Common Crawl).
You need to define:
You'll need to consider things like sequence lengths, input/output vector or tensor representations, padding, and out-of-vocabulary (OOV) words.
Make sure that the inputs and outputs are neural network friendly. This may involve a one-hot vector encoding, standardizing values, scaling values, or centering values around zero.
You may use things like a softmax layer or beam search in the decoding step.
Use backpropagation with your favourite loss functions on your model, using labelled training data, to learn model parameters.
Use your model to make predictions on unseen data. This will involve encoding your input and decoding the output to a friendly format.
Problem: How do we represent words numerically?
Solution: Word embeddings!
Language is sequential, so we need models that can work with data in this way.
Recurrent neural networks (RNNs) work across time slices of an input while maintaining some state over time.
A recurrent module takes input at a timestep $t$ and uses the input and state at previous timestep $t-1$ to compute
A recurrent module can be any type of stateful differentiable neural model.
You can think of an RNN as a really narrow but deep feedforward/vanilla neural network or multilayer perceptron (MLP). We train an RNN by unrolling the computation in time and backpropagating over the sequence.
RNNs have had these issues in the past:
1) Long sequences => really deep neural network => computationally intense.
2) RNNs are bad at remembering things over longer time spans.
3) Gradient signals between timesteps get smaller and become zero (vanishing gradients).
4) Gradient signals between timesteps grow exponentially (exploding gradients).
The task you choose is dependent on how many inputs and outputs there are and how they are distributed across time.
One-to-one: Take the whole sequence at once and predict one value for the entire sequence.
One-to-many: Take the whole sequence at once and predict multiple subsequent values.
Many-to-many: Take the sequence one word at a time and predict something for each word. This is commonly known as sequence-to-sequence (seq2seq).
Many-to-one: Take the sequence one word at a time, predict something for each word, then combine the outputs for each time step into a single output. This is often achieved by pooling over the outputs at each timestep.
Here is our model to generate questions from a document. It uses many of the above concepts in order to demonstrate combining them into one model. This model is a simplification of a few ideas from some of our recent papers on question generation.
Let's start with detecting potentional answers in documents. The following is a simple entity recognition model.
It's possible to train a word embedding from scratch, but it's usually better to start with a pre-trained word embedding. We'll use the Stanford's GloVe embeddings.
In [ ]:
import tensorflow as tf
from qgen.embedding import glove
embedding = tf.get_variable("embedding", initializer=glove)
EMBEDDING_DIMENS = glove.shape[1]
Now let's build the part of the model that finds potential answers in the document. We use a bidirectional recurrent network to predict, for each word in the document, whether it's part of an answer.
In [ ]:
from tensorflow.contrib.rnn import GRUCell
document_tokens = tf.placeholder(tf.int32, shape=[None, None], name="document_tokens")
document_lengths = tf.placeholder(tf.int32, shape=[None], name="document_lengths")
document_emb = tf.nn.embedding_lookup(embedding, document_tokens)
forward_cell = GRUCell(EMBEDDING_DIMENS)
backward_cell = GRUCell(EMBEDDING_DIMENS)
answer_outputs, _ = tf.nn.bidirectional_dynamic_rnn(
forward_cell, backward_cell, document_emb, document_lengths, dtype=tf.float32,
scope="answer_rnn")
answer_outputs = tf.concat(answer_outputs, 2)
answer_tags = tf.layers.dense(inputs=answer_outputs, units=2)
When training the model, we'll feed it the expected answers from the training set, and compute how far away the models predictions were from them. The optimizer will try to minimize this value, known as the loss.
In [ ]:
import tensorflow.contrib.seq2seq as seq2seq
answer_labels = tf.placeholder(tf.int32, shape=[None, None], name="answer_labels")
answer_mask = tf.sequence_mask(document_lengths, dtype=tf.float32)
answer_loss = seq2seq.sequence_loss(
logits=answer_tags, targets=answer_labels, weights=answer_mask, name="answer_loss")
To generate questions from the answers, we'll use a sequence-to-sequence model that transforms an answer into a question. We can feed it the recurrent states from the answer-finding RNN above to give each answer some context from the rest of the document.
First, the encoder:
In [ ]:
encoder_input_mask = tf.placeholder(
tf.float32, shape=[None, None, None], name="encoder_input_mask")
encoder_inputs = tf.matmul(encoder_input_mask, answer_outputs, name="encoder_inputs")
encoder_lengths = tf.placeholder(tf.int32, shape=[None], name="encoder_lengths")
encoder_cell = GRUCell(forward_cell.state_size + backward_cell.state_size)
_, encoder_state = tf.nn.dynamic_rnn(
encoder_cell, encoder_inputs, encoder_lengths, dtype=tf.float32, scope="encoder_rnn")
Now, the decoder. It takes the last state of the encoder as input, and begins generating question words one at a time.
In [ ]:
from tensorflow.python.layers.core import Dense
decoder_inputs = tf.placeholder(tf.int32, shape=[None, None], name="decoder_inputs")
decoder_lengths = tf.placeholder(tf.int32, shape=[None], name="decoder_lengths")
decoder_emb = tf.nn.embedding_lookup(embedding, decoder_inputs)
helper = seq2seq.TrainingHelper(decoder_emb, decoder_lengths)
projection = Dense(embedding.shape[0], use_bias=False)
decoder_cell = GRUCell(encoder_cell.state_size)
decoder = seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection)
decoder_outputs, _, _ = seq2seq.dynamic_decode(decoder, scope="decoder")
decoder_outputs = decoder_outputs.rnn_output
decoder_labels = tf.placeholder(tf.int32, shape=[None, None], name="decoder_labels")
question_mask = tf.sequence_mask(decoder_lengths, dtype=tf.float32)
question_loss = seq2seq.sequence_loss(
logits=decoder_outputs, targets=decoder_labels, weights=question_mask,
name="question_loss")
loss = tf.add(answer_loss, question_loss, name="loss")
To train both parts of the model at once, we add both losses (from the questions and answers) together, and tell our optimizer to minimize the sum. Neural network training proceeds by iterating over the data in batches, feeding each one to the network and letting the optimizer tune the weights by some variant of Stochastic Gradient Descent.
In [ ]:
from qgen.data import training_data
optimizer = tf.train.AdamOptimizer().minimize(loss)
saver = tf.train.Saver()
session = tf.InteractiveSession()
session.run(tf.global_variables_initializer())
EPOCHS = 5
for epoch in range(1, EPOCHS + 1):
print("Epoch {0}".format(epoch))
for batch in training_data():
_, loss_value = session.run([optimizer, loss], {
document_tokens: batch["document_tokens"],
document_lengths: batch["document_lengths"],
answer_labels: batch["answer_labels"],
encoder_input_mask: batch["answer_masks"],
encoder_lengths: batch["answer_lengths"],
decoder_inputs: batch["question_input_tokens"],
decoder_labels: batch["question_output_tokens"],
decoder_lengths: batch["question_lengths"],
})
print("Loss: {0}".format(loss_value))
saver.save(session, "model", epoch)
Now that we have a trained model, let's use it to generate some new questions and answers! First, we predict some answers. Notice how we're only feeding the document itself to the model.
In [ ]:
import numpy as np
from qgen.data import test_data, collapse_documents
saver = tf.train.Saver()
session = tf.InteractiveSession()
saver.restore(session, "model-5")
batch = next(test_data())
batch = collapse_documents(batch)
answers = session.run(answer_tags, {
document_tokens: batch["document_tokens"],
document_lengths: batch["document_lengths"],
})
answers = np.argmax(answers, 2)
Now that we have some answers, we can use the sequence-to-sequence model to generate some questions that (hopefully) have the predicted answers. To make new predictions with the decoder, we'll have to change its implementation slightly, to wire up its inputs from its previous outputs instead of the training data.
In [ ]:
import itertools
from qgen.data import expand_answers
from qgen.embedding import look_up_token, UNKNOWN_TOKEN, START_TOKEN, END_TOKEN
batch = expand_answers(batch, answers)
helper = seq2seq.GreedyEmbeddingHelper(embedding, tf.fill([batch["size"]], START_TOKEN), END_TOKEN)
decoder = seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection)
decoder_outputs, _, _ = seq2seq.dynamic_decode(decoder, maximum_iterations=16)
decoder_outputs = decoder_outputs.rnn_output
questions = session.run(decoder_outputs, {
document_tokens: batch["document_tokens"],
document_lengths: batch["document_lengths"],
answer_labels: batch["answer_labels"],
encoder_input_mask: batch["answer_masks"],
encoder_lengths: batch["answer_lengths"],
})
questions[:,:,UNKNOWN_TOKEN] = 0
questions = np.argmax(questions, 2)
for i in range(batch["size"]):
question = itertools.takewhile(lambda t: t != END_TOKEN, questions[i])
print("Question: " + " ".join(look_up_token(token) for token in question))
print("Answer: " + batch["answer_text"][i])
print()