This notebook introduces how AllenNLP handles one of the key aspects of applying deep learning techniques to textual data: learning distributed representations of words and sentences.
Recently, there has been an explosion of different techniques to represent words and sentences in NLP, including pre-trained word vectors, character level CNN encodings and sub-word token representation (e.g byte encodings). Even more complex learned representations of higher level lingustic features, such as POS tags, named entities and dependency paths have also proven successful for a wide variety of NLP tasks.
In order to deal with this breadth of methods for representing words as vectors, AllenNLP introduces 3 key abstractions:
TokenIndexers, which generate indexed tensors representing sentences in different ways. See the Data Pipeline notebook for more info.
TokenEmbedders, which transform indexed tensors into embedded representations. At its most basic, this is just a standard Embedding layer you'd find in any neural network library. However, they can be more complex - for instance, AllenNLP has a token_characters_encoder which applies a CNN to character level representations.
TextFieldEmbedders, which are a wrapper around a set of TokenEmbedders. At it's most basic, this applies the TokenEmbedders which it is passed and concatenates their output.Using this hierarchy allows you to easily compose different representations of a sentence together in modular ways. For instance, in the Bidaf model, we use this to concatenate a character level CNN encoding of the words in the sentence to the pretrained word embeddings. You can also specify this completely from a JSON file, making experimenation with different representations extremely easy.
In [1]:
# This cell just makes sure the library paths are correct.
# You need to run this cell before you run the rest of this
# tutorial, but you can ignore the contents!
import os
import sys
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
sys.path.append(module_path)
In [2]:
from allennlp.data.fields import TextField
from allennlp.data import Instance
from allennlp.data.token_indexers import SingleIdTokenIndexer, TokenCharactersIndexer
from allennlp.data.tokenizers import Token
words = ["All", "the", "cool", "kids", "use", "character", "embeddings", "."]
sentence1 = TextField([Token(x) for x in words],
token_indexers={"tokens": SingleIdTokenIndexer(namespace="tokens"),
"characters": TokenCharactersIndexer(namespace="token_characters")})
words2 = ["I", "prefer", "word2vec", "though", "..."]
sentence2 = TextField([Token(x) for x in words2],
token_indexers={"tokens": SingleIdTokenIndexer(namespace="tokens"),
"characters": TokenCharactersIndexer(namespace="token_characters")})
instance1 = Instance({"sentence": sentence1})
instance2 = Instance({"sentence": sentence2})
Now we need to create a small vocabulary from our sentence - note that because we have used both a
SingleIdTokenIndexer and a TokenCharactersIndexer, when we call Vocabulary.from_dataset, the created Vocabulary will have two namespaces, which correspond to the namespaces of each token indexer in our TextField's.
In [3]:
from allennlp.data import Vocabulary, Dataset
# Make
dataset = Dataset([instance1, instance2])
vocab = Vocabulary.from_dataset(dataset)
print("This is the token vocabulary we created: \n")
print(vocab.get_index_to_token_vocabulary("tokens"))
print("This is the character vocabulary we created: \n")
print(vocab.get_index_to_token_vocabulary("token_characters"))
dataset.index_instances(vocab)
In [4]:
from allennlp.modules.token_embedders import Embedding, TokenCharactersEncoder
from allennlp.modules.seq2vec_encoders import CnnEncoder
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
# We're going to embed both the words and the characters, so we create
# embeddings with respect to the vocabulary size of each of the relevant namespaces
# in the vocabulary.
word_embedding = Embedding(num_embeddings=vocab.get_vocab_size("tokens"), embedding_dim=10)
char_embedding = Embedding(num_embeddings=vocab.get_vocab_size("token_characters"), embedding_dim=5)
character_cnn = CnnEncoder(embedding_dim=5, num_filters=2, output_dim=8)
# This is going to embed an integer character tensor of shape: (batch_size, max_sentence_length, max_word_length) into
# a 4D tensor with an additional embedding dimension, representing the vector for each character.
# and then apply the character_cnn we defined above over the word dimension, resulting in a tensor
# of shape: (batch_size, max_sentence_length, num_filters * ngram_filter_sizes).
token_character_encoder = TokenCharactersEncoder(embedding=char_embedding, encoder=character_cnn)
# Notice that these keys have the same keys as the TokenIndexers when we created our TextField.
# This is how the text_field_embedder knows which function to apply to which array.
# There should be a 1-1 mapping between TokenIndexers and TokenEmbedders in your model.
text_field_embedder = BasicTextFieldEmbedder({"tokens": word_embedding, "characters": token_character_encoder})
Now we've actually created all the parts which we need to create concatenated word and character CNN embeddings, let's actually apply our text_field_embedder and see what happens.
In [5]:
# Convert the indexed dataset into Pytorch Variables.
tensors = dataset.as_tensor_dict(dataset.get_padding_lengths())
print("Torch tensors for passing to a model: \n\n", tensors)
print("\n\n")
# tensors is a nested dictionary, first keyed by the
# name we gave our instances (in most cases you'd have more
# than one field in an instance) and then by the key of each
# token indexer we passed to TextField.
# This will contain two tensors: one from representing each
# word as an index and one representing each _character_
# in each word as an index.
text_field_variables = tensors["sentence"]
# This will have shape: (batch_size, sentence_length, word_embedding_dim + character_cnn_output_dim)
embedded_text = text_field_embedder(text_field_variables)
dimensions = list(embedded_text.size())
print("Post embedding with our TextFieldEmbedder: ")
print("Batch Size: ", dimensions[0])
print("Sentence Length: ", dimensions[1])
print("Embedding Size: ", dimensions[2])
Here, we've manually created the different TokenEmbedders which we wanted to use in our TextFieldEmbedder. However, all of these modules can be built using their from_params method, so you can have a TextFieldEmbedder in your model which is fixed (it encodes some sentence which is an input to your model), but vary the TokenIndexers and TokenEmbedders which it uses by changing a JSON file.