Sparse and dense representations for text data

Before we can start training we need to prepare our input data in a way that our model will understand it.


In [ ]:
import tensorflow as tf
import numpy as np
import pandas as pd
%matplotlib inline

Since we're dealing with text, we need to turn the characters into numbers in order to perform our calculations on them. We do this in two steps: first we get the sparse (one-hot encoded) representation of each character and then we learn a dense representation (so-called embeddings) as part of our model training.

Sparse representation: one-hot encoding

Our sparse representation will consist of sparse vectors of dimension n_chars, which in our case is 129 (128 ascii chars + 1 end-of-sequence char). The feature vector for a single character will thus be of the form:

$\qquad x(\text{char})\ =\ (0, 0, 1, 0, \dots, 0)$

Or equivalently in components,

$\qquad x_i(\text{char})\ =\ \left\{\begin{matrix}1&\text{if } i = h(\text{char})\\0&\text{otherwise}\end{matrix}\right.$

where $h$ is a function that maps a character to an integer (e.g. a hash function). In our case, we use the build-in function ord:

In [1]: ord('H')
Out[1]: 72

As it turns out, we don't actually need to construct the vector $x(\text{char})$ as displayed above. If you think about it, the only information that we need about $x$ is which component is switched on. In other words, the only information we need is $h(\text{char})$, in our case ord(char). So, the most efficient representation for our sparse feature vectors (single integers) turns out to be incredibly simple. For instance, the sparse representation of the phrase "Hello, world!" is simply:

In [1]: x = [ord(char) for char in "Hello, world!"]
In [2]: x
Out[2]: [72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33]

Actually, we need to append an end-of-sequence (EOS) character to tell our model to stop generating more text. Let's set the index 0 aside for the EOS character, then we one-hot encode our phrase as follows:

In [1]: x = [ord(char) + 1 for char in "Hello, world!"] + [0]
In [2]: x
Out[2]: [73, 102, 109, 109, 112, 45, 33, 120, 112, 115, 109, 101, 34, 0]

To go from a list of indices to a one-hot encoded vector in Tensorflow is super easy using tf.one_hot:

n_chars = 129
x_indices = tf.constant([73, 102, 109, 109, 112])
x_one_hot = tf.one_hot(x_indices, n_chars)  # shape = (5, 129)

Dense representation: embeddings

If we only have a few input characters, we can use the one-hot encoded representation directly as our input. In reality, though, we know that text consists of a large number characters (in our case 129). In this case it's either infeasible or at best highly inefficient to use the sparse representation for our characters.

Moreover, the sparse representation has no notion of proximity between characters such as 'a' and 'A' or more subtly 'i' and 'y'.

A trick that we often use is to translate the high-dimensional sparse feature vectors to low-dimensional dense vectors. These dense vectors are called embeddings. Because the embeddings are low-dimensional, our model needs to learn far fewer weights. Of course, the model does need to learn the embeddings themselves, but this is a trade-off that does pay off. One of the interesting properties of embeddings is that the embedding for 'a' and 'A' are very similar, which means that the rest our network can focus on learning more abstract relations between characters.

Another point of view is that learning embeddings is kind of like having an automated pre-processing step included in the model. Pre-processing in such an end-to-end setting ensures optimal performance in the task that we're actually interested in.

An embedding matrix in Tensorflow must have the shape (n_chars, emd_dim), where n_chars is the number of characters (or tokens) and emb_dim is the dimensionality of the dense embedding vector space. We typically initialize the embedding matrix randomly, e.g.

n_chars = 129
emb_dim = 10
emb = tf.Variable(tf.random_uniform([n_chars, emb_dim]))

Then, in order to get the relevant embeddings we could use the one-hot encoded (sparse) representation x_one_hot (see above) as a mask:

x_dense = tf.matmul(x_one_hot, emb)

There's a more efficient way of doing this, though. For this we use Tensorflow's embedding lookup function:

x_dense = tf.nn.embedding_lookup(emb, x_indices)

The reason why this is more efficient is that avoid constructing x_one_hot explicitly (x_indices is enough).

In the training process, our model will learn an appropriate embedding matrix emb alongside the rest of the model parameters.

Below, we show a visual representation of the character embeddings as well as the mini-batched dense input tensor.

We have supplied a simple encoder in the utils module, which implements the procedure explained above (plus some more):


In [4]:
from utils import SentenceEncoder

sents = ["Hello, world!", "Hi again!", "Bye bye now."]
encoder = SentenceEncoder(sents, batch_size=2)


for batch in encoder:
    seq = batch[0]
    print encoder.decode(seq)
    print seq
    print


['Bye bye now.', 'Hi again!']
[[ 67 122 102  33  99 122 102  33 111 112 120  47   0]
 [ 73 106  33  98 104  98 106 111  34   0   0   0   0]]

['Hello, world!', 'Bye bye now.']
[[ 73 102 109 109 112  45  33 120 112 115 109 101  34   0]
 [ 67 122 102  33  99 122 102  33 111 112 120  47   0   0]]

ExerciseIn this exercise we're going to the functions that we just learned about to translate text into numeric input tensors.

A) A simple character encoder.

Using the examples above, write a simple encoder that takes the sentences

sents = ['Hello, world!', 'Bye bye.']

and returns both the encoded sentences.


In [ ]:
# input sentences
sents = ['Hello, world!', 'Bye bye.']

# this is the expected output
out = [[ 73, 102, 109, 109, 112,  45,  33, 120, 112, 115, 109, 101,  34,   0],
       [ 67, 122, 102,  33,  99, 122, 102,  47,   0,   0,   0,   0,   0,   0]]


def encode(sents):
    '<your code here>'


print encode(sents)
np.testing.assert_array_equal(out, encode(sents))

In [ ]:
# %load sol/ex_char_encoder.py

B) Get sparse representation.

Create a one-hot encoded (sparse) representation of the sentences that we encoded above.


In [ ]:
# clear any previous computation graphs
tf.reset_default_graph()

# dimensions
n_chars = '<your code here>'
batch_size = '<your code here>'
max_seqlen = '<your code here>'

# input placeholder
sents_enc = '<your code here>'

# sparse representation
x_one_hot = '<your code here>'

# input
sents = ['Hello, world!', 'Bye bye.']


with tf.Session() as s:
    '<your code here>'

In [ ]:
# %load sol/ex_one_hot.py

C) Get dense representation.

Same as the previous exercise, except now use an embedding matrix to create a dense representation of the sentences.


In [ ]:
# clear any previous computation graphs
tf.reset_default_graph()

# dimensions
n_chars = '<your code here>'
batch_size = '<your code here>'
emb_dim = '<your code here>'
max_seqlen = '<your code here>'

# input placeholder
sents_enc = '<your code here>'

# character embeddings
emb = '<your code here>'

# dense representation
x_dense = '<your code here>'

# input
sents = ['Hello, world!', 'Bye bye.']


with tf.Session() as s:
    '<your code here>'

In [ ]:
# %load sol/ex_embedding_lookup.py