In this assignment, you will use a recurrent neural network to solve Named Entity Recognition (NER) problem. NER is a common task in natural language processing systems. It serves for extraction such entities from the text as persons, organizations, locations, etc. In this task you will experiment to recognize named entities from Twitter.
For example, we want to extract persons' and organizations' names from the text. Than for the input text:
Ian Goodfellow works for Google Brain
a NER model needs to provide the following sequence of tags:
B-PER I-PER O O B-ORG I-ORG
Where B- and I- prefixes stand for the beginning and inside of the entity, while O stands for out of tag or no tag. Markup with the prefix scheme is called BIO markup. This markup is introduced for distinguishing of consequent entities with similar types.
A solution of the task will be based on neural networks, particularly, on Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs).
For this task you will need the following libraries:
If you have never worked with Tensorflow, you would probably need to read some tutorials during your work on this assignment, e.g. this one could be a good starting point.
In [ ]:
import sys
sys.path.append("..")
from common.download_utils import download_week2_resources
download_week2_resources()
We will work with a corpus, which contains twits with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.
The function read_data reads a corpus from the file_path and returns two lists: one with tokens and one with the corresponding tags. You need to complete this function by adding a code, which will replace a user's nickname to <USR>
token and any URL to <URL>
token. You could think that a URL and a nickname are just strings which start with http:// or https:// in case of URLs and a @ symbol for nicknames.
In [ ]:
def read_data(file_path):
tokens = []
tags = []
tweet_tokens = []
tweet_tags = []
for line in open(file_path, encoding='utf-8'):
line = line.strip()
if not line:
if tweet_tokens:
tokens.append(tweet_tokens)
tags.append(tweet_tags)
tweet_tokens = []
tweet_tags = []
else:
token, tag = line.split()
# Replace all urls with <URL> token
# Replace all users with <USR> token
######################################
######### YOUR CODE HERE #############
######################################
tweet_tokens.append(token)
tweet_tags.append(tag)
return tokens, tags
And now we can load three separate parts of the dataset:
In [ ]:
train_tokens, train_tags = read_data('data/train.txt')
validation_tokens, validation_tags = read_data('data/validation.txt')
test_tokens, test_tags = read_data('data/test.txt')
You should always understand what kind of data you deal with. For this purpose, you can print the data running the following cell:
In [ ]:
for i in range(3):
for token, tag in zip(train_tokens[i], train_tags[i]):
print('%s\t%s' % (token, tag))
print()
To train a neural network, we will use two mappings:
Now you need to implement the function build_dict which will return {token or tag}$\to${index} and vice versa.
In [ ]:
from collections import defaultdict
In [ ]:
def build_dict(tokens_or_tags, special_tokens):
"""
tokens_or_tags: a list of lists of tokens or tags
special_tokens: some special tokens
"""
# Create a dictionary with default value 0
tok2idx = defaultdict(lambda: 0)
idx2tok = []
# Create mappings from tokens to indices and vice versa
# Add special tokens to dictionaries
# The first special token must have index 0
######################################
######### YOUR CODE HERE #############
######################################
return tok2idx, idx2tok
After implementing the function build_dict you can make dictionaries for tokens and tags. Special tokens in our case will be:
<UNK>
token for out of vocabulary tokens;<PAD>
token for padding sentence to the same length when we create batches of sentences.
In [ ]:
special_tokens = ['<UNK>', '<PAD>']
special_tags = ['O']
# Create dictionaries
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)
The next additional functions will help you to create the mapping between tokens and ids for a sentence.
In [ ]:
def words2idxs(tokens_list):
return [token2idx[word] for word in tokens_list]
def tags2idxs(tags_list):
return [tag2idx[tag] for tag in tags_list]
def idxs2words(idxs):
return [idx2token[idx] for idx in idxs]
def idxs2tags(idxs):
return [idx2tag[idx] for idx in idxs]
Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special <PAD>
token. It is also a good practice to provide RNN with sequence lengths, so it can skip computations for padding parts. We provide the batching function batches_generator readily available for you to save time.
In [ ]:
def batches_generator(batch_size, tokens, tags,
shuffle=True, allow_smaller_last_batch=True):
"""Generates padded batches of tokens and tags."""
n_samples = len(tokens)
if shuffle:
order = np.random.permutation(n_samples)
else:
order = np.arange(n_samples)
n_batches = n_samples // batch_size
if allow_smaller_last_batch and n_samples % batch_size:
n_batches += 1
for k in range(n_batches):
batch_start = k * batch_size
batch_end = min((k + 1) * batch_size, n_samples)
current_batch_size = batch_end - batch_start
x_list = []
y_list = []
max_len_token = 0
for idx in order[batch_start: batch_end]:
x_list.append(words2idxs(tokens[idx]))
y_list.append(tags2idxs(tags[idx]))
max_len_token = max(max_len_token, len(tags[idx]))
# Fill in the data into numpy nd-arrays filled with padding indices.
x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['<PAD>']
y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
lengths = np.zeros(current_batch_size, dtype=np.int32)
for n in range(current_batch_size):
utt_len = len(x_list[n])
x[n, :utt_len] = x_list[n]
lengths[n] = utt_len
y[n, :utt_len] = y_list[n]
yield x, y, lengths
This is the most important part of the assignment. Here we will specify the network architecture based on TensorFlow building blocks. It's fun and easy as a lego constructor! We will create an LSTM network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use Bi-Directional LSTM (Bi-LSTM). Dense layer will be used on top to perform tag classification.
In [ ]:
import tensorflow as tf
import numpy as np
In [ ]:
class BiLSTMModel():
pass
First, we need to create placeholders to specify what data we are going to feed into the network during the execution time. For this task we will need the following placeholders:
It could be noticed that we use None in the shapes in the declaration, which means that data of any size can be feeded.
You need to complete the function declare_placeholders.
In [ ]:
def declare_placeholders(self):
"""Specifies placeholders for the model."""
# Placeholders for input and ground truth output.
self.input_batch = tf.placeholder(dtype=tf.int32, shape=[None, None], name='input_batch')
self.ground_truth_tags = ######### YOUR CODE HERE #############
# Placeholder for lengths of the sequences.
self.lengths = tf.placeholder(dtype=tf.int32, shape=[None], name='lengths')
# Placeholder for a dropout keep probability. If we don't feed
# a value for this placeholder, it will be equal to 1.0.
self.dropout_ph = tf.placeholder_with_default(1.0, shape=[])
# Placeholder for a learning rate (float32).
self.learning_rate_ph = ######### YOUR CODE HERE #############
In [ ]:
BiLSTMModel.__declare_placeholders = classmethod(declare_placeholders)
Now, let us specify the layers of the neural network. First, we need to perform some preparatory steps:
After that, you can build the computation graph that transforms an input_batch:
Fill in the code below. In case you need to debug something, the easiest way is to check that tensor shapes of each step match the expected ones.
In [ ]:
def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
"""Specifies bi-LSTM architecture and computes logits for inputs."""
# Create embedding variable (tf.Variable).
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = ######### YOUR CODE HERE #############
# Create RNN cells (for example, tf.nn.rnn_cell.BasicLSTMCell) with n_hidden_rnn number of units
# and dropout (tf.nn.rnn_cell.DropoutWrapper), initializing all *_keep_prob with dropout placeholder.
forward_cell = ######### YOUR CODE HERE #############
backward_cell = ######### YOUR CODE HERE #############
# Look up embeddings for self.input_batch (tf.nn.embedding_lookup).
# Shape: [batch_size, sequence_len, embedding_dim].
embeddings = ######### YOUR CODE HERE #############
# Pass them through Bidirectional Dynamic RNN (tf.nn.bidirectional_dynamic_rnn).
# Shape: [batch_size, sequence_len, 2 * n_hidden_rnn].
(rnn_output_fw, rnn_output_bw), _ = ######### YOUR CODE HERE #############
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)
# Dense layer on top.
# Shape: [batch_size, sequence_len, n_tags].
self.logits = tf.layers.dense(rnn_output, n_tags, activation=None)
In [ ]:
BiLSTMModel.__build_layers = classmethod(build_layers)
In [ ]:
def compute_predictions(self):
"""Transforms logits to probabilities and finds the most probable tags."""
# Create softmax (tf.nn.softmax) function
softmax_output = ######### YOUR CODE HERE #############
# Use argmax (tf.argmax) to get the most probable tags
self.predictions = ######### YOUR CODE HERE #############
In [ ]:
BiLSTMModel.__compute_predictions = classmethod(compute_predictions)
During training we do not need predictions of the network, but we need a loss function. We will use cross-entropy loss, efficiently implemented in TF as
cross entropy with logits. Note that it should be applied to logits of the model (not to softmax probabilities!). Also note, that we do not want to take into account loss terms coming from <PAD>
tokens. So we need to mask them out, before computing mean.
In [ ]:
def compute_loss(self, n_tags, PAD_index):
"""Computes masked cross-entopy loss with logits."""
# Create cross entropy function function (tf.nn.softmax_cross_entropy_with_logits)
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = ######### YOUR CODE HERE #############
# Create loss function which doesn't operate with <PAD> tokens (tf.reduce_mean)
mask = tf.cast(tf.not_equal(loss_tensor, PAD_index), tf.float32)
self.loss = ######### YOUR CODE HERE #############
In [ ]:
BiLSTMModel.__compute_loss = classmethod(compute_loss)
The last thing to specify is how we want to optimize the loss. We suggest that you use Adam optimizer with a learning rate from the corresponding placeholder. You will also need to apply clipping to eliminate exploding gradients. It can be easily done with clip_by_norm function.
In [ ]:
def perform_optimization(self):
"""Specifies the optimizer and train_op for the model."""
# Create an optimizer (tf.train.AdamOptimizer)
self.optimizer = ######### YOUR CODE HERE #############
self.grads_and_vars = self.optimizer.compute_gradients(self.loss)
# Gradient clipping (tf.clip_by_norm) for self.grads_and_vars
clip_norm = 1.0
self.grads_and_vars = ######### YOUR CODE HERE #############
self.train_op = self.optimizer.apply_gradients(self.grads_and_vars)
In [ ]:
BiLSTMModel.__perform_optimization = classmethod(perform_optimization)
Congratulations! You have specified all the parts of your network. You may have noticed, that we didn't deal with any real data yet, so what you have written is just recipes on how the network should function. Now we will put them to the constructor of our Bi-LSTM class to use it in the next section.
In [ ]:
def init_model(self, vocabulary_size, n_tags, embedding_dim, n_hidden_rnn, PAD_index):
self.__declare_placeholders()
self.__build_layers(vocabulary_size, embedding_dim, n_hidden_rnn, n_tags)
self.__compute_predictions()
self.__compute_loss(n_tags, PAD_index)
self.__perform_optimization()
In [ ]:
BiLSTMModel.__init__ = classmethod(init_model)
Session.run is a point which initiates computations in the graph that we have defined. To train the network, we need to compute self.train_op, which was declared in perform_optimization. To predict tags, we just need to compute self.predictions. Anyway, we need to feed actual data through the placeholders that we defined before.
In [ ]:
def train_on_batch(self, session, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability):
feed_dict = {self.input_batch: x_batch,
self.ground_truth_tags: y_batch,
self.learning_rate_ph: learning_rate,
self.dropout_ph: dropout_keep_probability,
self.lengths: lengths}
session.run(self.train_op, feed_dict=feed_dict)
In [ ]:
BiLSTMModel.train_on_batch = classmethod(train_on_batch)
Implement the function predict_for_batch by initializing feed_dict with input x_batch and lengths and running the session for self.predictions.
In [ ]:
def predict_for_batch(self, session, x_batch, lengths):
######################################
######### YOUR CODE HERE #############
######################################
return predictions
In [ ]:
BiLSTMModel.predict_for_batch = classmethod(predict_for_batch)
We finished with necessary methods of our BiLSTMModel model and almost ready to start experimenting.
To simplify the evaluation process we provide two functions for you:
In [ ]:
from evaluation import precision_recall_f1
In [ ]:
def predict_tags(model, session, token_idxs_batch, lengths):
"""Performs predictions and transforms indices to tokens and tags."""
tag_idxs_batch = model.predict_for_batch(session, token_idxs_batch, lengths)
tags_batch, tokens_batch = [], []
for tag_idxs, token_idxs in zip(tag_idxs_batch, token_idxs_batch):
tags, tokens = [], []
for tag_idx, token_idx in zip(tag_idxs, token_idxs):
if token_idx != token2idx['<PAD>']:
tags.append(idx2tag[tag_idx])
tokens.append(idx2token[token_idx])
tags_batch.append(tags)
tokens_batch.append(tokens)
return tags_batch, tokens_batch
def eval_conll(model, session, tokens, tags, short_report=True):
"""Computes NER quality measures using CONLL shared task script."""
y_true, y_pred = [], []
for x_batch, y_batch, lengths in batches_generator(1, tokens, tags):
tags_batch, tokens_batch = predict_tags(model, session, x_batch, lengths)
if len(x_batch[0]) != len(tags_batch[0]):
raise Exception("Incorrect length of prediction for the input, "
"expected length: %i, got: %i" % (len(x_batch[0]), len(tags_batch[0])))
ground_truth_tags = [idx2tag[tag_idx] for tag_idx in y_batch[0]]
# We extend every prediction and ground truth sequence with 'O' tag
# to indicate a possible end of entity.
y_true.extend(ground_truth_tags + ['O'])
y_pred.extend(tags_batch[0] + ['O'])
results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
return results
Create BiLSTMModel model with the following parameters:
<PAD>
).Set hyperparameters. You might want to start with the following recommended values:
However, feel free to conduct more experiments to tune hyperparameters and earn extra points for the assignment.
In [ ]:
tf.reset_default_graph()
model = ######### YOUR CODE HERE #############
batch_size = ######### YOUR CODE HERE #############
n_epochs = ######### YOUR CODE HERE #############
learning_rate = ######### YOUR CODE HERE #############
learning_rate_decay = ######### YOUR CODE HERE #############
dropout_keep_probability = ######### YOUR CODE HERE #############
Finally, we are ready to run the training!
In [ ]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print('Start training... \n')
for epoch in range(n_epochs):
# For each epoch evaluate the model on train and validation data
print('-' * 20 + ' Epoch {} '.format(epoch+1) + 'of {} '.format(n_epochs) + '-' * 20)
print('Train data evaluation:')
eval_conll(model, sess, train_tokens, train_tags, short_report=True)
print('Validation data evaluation:')
eval_conll(model, sess, validation_tokens, validation_tags, short_report=True)
# Train the model
for x_batch, y_batch, lengths in batches_generator(batch_size, train_tokens, train_tags):
model.train_on_batch(sess, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability)
# Decaying the learning rate
learning_rate = learning_rate / learning_rate_decay
print('...training finished.')
Now let us see full quality reports for the final model on train, validation, and test sets. To give you a hint whether you have implemented everything correctly, you might expect F-score about 40% on the validation set.
The output of the cell below (as well as the output of all the other cells) should be present in the notebook for peer2peer review!
In [ ]:
print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(model, sess, train_tokens, train_tags, short_report=False)
print('-' * 20 + ' Validation set quality: ' + '-' * 20)
validation_results = ######### YOUR CODE HERE #############
print('-' * 20 + ' Test set quality: ' + '-' * 20)
test_results = ######### YOUR CODE HERE #############
Could we say that our model is state of the art and the results are acceptable for the task? Definately, we can say so. Nowadays, Bi-LSTM is one of the state of the art approaches for solving NER problem and it outperforms other classical methods. Despite the fact that we used small training corpora (in comparison with usual sizes of corpora in Deep Learning), our results are quite good. In addition, in this task there are many possible named entities and for some of them we have only several dozens of trainig examples, which is definately small. However, the implemented model outperforms classical CRFs for this task. Even better results could be obtained by some combinations of several types of methods, e.g. see this paper if you are interested.