Character-level Language Modeling with LSTMs

This notebook is adapted from Keras' lstm_text_generation.py.

Steps:

  • Download a small text corpus and preprocess it.
  • Extract a character vocabulary and use it to vectorize the text.
  • Train an LSTM-based character level language model.
  • Use the trained model to sample random text with varying entropy levels.
  • Implement a beam-search deterministic decoder.

Note: fitting language models is very computation intensive. It is recommended to do this notebook on a server with a GPU or powerful CPUs that you can leave running for several hours at once.


In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Loading some text data

Let's use some publicly available philosopy:


In [ ]:
from tensorflow.keras.utils import get_file

URL = "https://s3.amazonaws.com/text-datasets/nietzsche.txt"

corpus_path = get_file('nietzsche.txt', origin=URL)
text = open(corpus_path).read().lower()
print('Corpus length: %d characters' % len(text))

In [ ]:
print(text[:600], "...")

In [ ]:
text = text.replace("\n", " ")
split = int(0.9 * len(text))
train_text = text[:split]
test_text = text[split:]

Building a vocabulary of all possible symbols

To simplify things, we build a vocabulary by extracting the list all possible characters from the full datasets (train and validation).

In a more realistic setting we would need to take into account that the test data can hold symbols never seen in the training set. This issue is limited when we work at the character level though.

Let's build the list of all possible characters and sort it to assign a unique integer to each possible symbol in the corpus:


In [ ]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

char_indices is a mapping to from characters to integer identifiers:


In [ ]:
len(char_indices)

In [ ]:
sorted(char_indices.items())[:15]

indices_char holds the reverse mapping:


In [ ]:
len(indices_char)

In [ ]:
indices_char[52]

While not strictly required to build a language model, it's a good idea to have a look at the distribution of relative frequencies of each symbol in the corpus:


In [ ]:
from collections import Counter

counter = Counter(text)
chars, counts = zip(*counter.most_common())
indices = np.arange(len(counts))

plt.figure(figsize=(14, 3))
plt.bar(indices, counts, 0.8)
plt.xticks(indices, chars);

Let's cut the dataset into fake sentences at random with some overlap. Instead of cutting at random we could use a English specific sentence tokenizer. This is explained at the end of this notebook. In the mean time random substring will be good enough to train a first language model.


In [ ]:
max_length = 40
step = 3


def make_sequences(text, max_length=max_length, step=step):
    sequences = []
    next_chars = []
    for i in range(0, len(text) - max_length, step):
        sequences.append(text[i: i + max_length])
        next_chars.append(text[i + max_length])
    return sequences, next_chars    


sequences, next_chars = make_sequences(train_text)
sequences_test, next_chars_test = make_sequences(test_text, step=10)

print('nb train sequences:', len(sequences))
print('nb test sequences:', len(sequences_test))

Let's shuffle the sequences to break some of the dependencies:


In [ ]:
from sklearn.utils import shuffle

sequences, next_chars = shuffle(sequences, next_chars,
                                random_state=42)

In [ ]:
sequences[0]

In [ ]:
next_chars[0]

Converting the training data to one-hot vectors

Unfortunately the LSTM implementation in Keras does not (yet?) accept integer indices to slice columns from an input embedding by it-self. Let's use one-hot encoding. This is slightly less space and time efficient than integer coding but should be good enough when using a small character level vocabulary.

Exercise:

One hot encoded the training data sequences as X and next_chars as y:


In [ ]:
n_sequences = len(sequences)
n_sequences_test = len(sequences_test)
voc_size = len(chars)

X = np.zeros((n_sequences, max_length, voc_size),
             dtype=np.float32)
y = np.zeros((n_sequences, voc_size), dtype=np.float32)

X_test = np.zeros((n_sequences_test, max_length, voc_size),
                  dtype=np.float32)
y_test = np.zeros((n_sequences_test, voc_size), dtype=np.float32)

# TODO

In [ ]:
# %load solutions/language_model_one_hot_data.py

In [ ]:
X.shape

In [ ]:
y.shape

In [ ]:
X[0]

In [ ]:
y[0]

Measuring per-character perplexity

The NLP community measures the quality of probabilistic model using perplexity.

In practice perplexity is just a base 2 exponentiation of the average negative log2 likelihoods:

$$perplexity_\theta = 2^{-\frac{1}{n} \sum_{i=1}^{n} log_2 (p_\theta(x_i))}$$

Note: here we define the per-character perplexity (because our model naturally makes per-character predictions). It is more common to report per-word perplexity. Note that this is not as easy to compute the per-world perplexity as we would need to tokenize the strings into a sequence of words and discard whitespace and punctuation character predictions. In practice the whitespace character is the most frequent character by far making our naive per-character perplexity lower than it should be if we ignored those.

Exercise: implement a Python function that computes the per-character perplexity with model predicted probabilities y_pred and y_true for the encoded ground truth:


In [ ]:
def perplexity(y_true, y_pred):
    """Compute the per-character perplexity of model predictions.
    
    y_true is one-hot encoded ground truth.
    y_pred is predicted likelihoods for each class.
    
    2 ** -mean(log2(p))
    """
    # TODO
    return 1.

In [ ]:
# %load solutions/language_model_perplexity.py

In [ ]:
y_true = np.array([
    [0, 1, 0],
    [0, 0, 1],
    [0, 0, 1],
])

y_pred = np.array([
    [0.1, 0.9, 0.0],
    [0.1, 0.1, 0.8],
    [0.1, 0.2, 0.7],
])

perplexity(y_true, y_pred)

A perfect model has a minimal perplexity of 1.0 bit (negative log likelihood of 0.0):


In [ ]:
perplexity(y_true, y_true)

Building recurrent model

Let's build a first model and train it on a very small subset of the data to check that it works as expected:


In [ ]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import RMSprop


model = Sequential()
model.add(LSTM(128, input_shape=(max_length, voc_size)))
model.add(Dense(voc_size, activation='softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(optimizer=optimizer, loss='categorical_crossentropy')

Let's measure the perplexity of the randomly initialized model:


In [ ]:
def model_perplexity(model, X, y):
    predictions = model(X)
    return perplexity(y, predictions)

In [ ]:
model_perplexity(model, X_test, y_test)

Let's train the model for one epoch on a very small subset of the training set to check that it's well defined:


In [ ]:
small_train = slice(0, None, 40)

model.fit(X[small_train], y[small_train], validation_split=0.1,
          batch_size=128, epochs=1)

In [ ]:
model_perplexity(model, X[small_train], y[small_train])

In [ ]:
model_perplexity(model, X_test, y_test)

Sampling random text from the model

Recursively generate one character at a time by sampling from the distribution parameterized by the model:

$$ p_{\theta}(c_n | c_{n-1}, c_{n-2}, \ldots, c_0) \cdot p_{\theta}(c_{n-1} | c_{n-2}, \ldots, c_0) \cdot \ldots \cdot p_{\theta}(c_{0}) $$

This way of parametrizing the joint probability of a set of random-variables that are structured sequentially is called auto-regressive modeling.


In [ ]:
def sample_one(preds, temperature=1.0):
    """Sample the next character according to the network output.
    
    Use a lower temperature to force the model to output more
    confident predictions: more peaky distribution.
    """
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    # Draw a single sample (size=1) from a multinoulli distribution
    # parameterized by the output of the softmax layer of our
    # network. A multinoulli distribution is a multinomial
    # distribution with a single trial with n_classes outcomes.
    probs = np.random.multinomial(1, preds, size=1)
    return np.argmax(probs)


def generate_text(model, seed_string, length=300, temperature=1.0):
    """Recursively sample a sequence of chars, one char at a time.
    
    Each prediction is concatenated to the past string of predicted
    chars so as to condition the next prediction.

    Feed seed string as a sequence of characters to condition the
    first predictions recursively. If seed_string is lower than
    max_length, pad the input with zeros at the beginning of the
    conditioning string.
    """
    generated = seed_string
    prefix = seed_string

    for i in range(length):
        # Vectorize prefix string to feed as input to the model:
        x = np.zeros((1, max_length, voc_size), dtype="float32")
        shift = max_length - len(prefix)
        for t, char in enumerate(prefix):
            x[0, t + shift, char_indices[char]] = 1.

        preds = model(x)[0]
        next_index = sample_one(preds, temperature)
        next_char = indices_char[next_index]

        generated += next_char
        prefix = prefix[1:] + next_char
    return generated

The temperature parameter makes it possible to increase or decrease the entropy into the multinouli distribution parametrized by the output of the model.

Temperature lower than 1 will yield very regular text (biased towards the most frequent patterns of the training set). Temperatures higher than 1 will render the model "more creative" but also noisier (with a large fraction of meaningless words). A temperature of 1 is neutral (the noise of the generated text only stems from the imperfection of the model).


In [ ]:
generate_text(model, 'philosophers are ', temperature=0.1)

In [ ]:
generate_text(model, 'atheism is the root of ', temperature=0.8)

Training the model

Let's train the model and monitor the perplexity after each epoch and sample some text to qualitatively evaluate the model:


In [ ]:
nb_epoch = 30
seed_strings = [
    'philosophers are ',
    'atheism is the root of ',
]

for epoch in range(nb_epoch):
    print("# Epoch %d/%d" % (epoch + 1, nb_epoch))
    print("Training on one epoch takes ~90s on a K80 GPU")
    model.fit(X, y, validation_split=0.1, batch_size=128, epochs=1,
              verbose=2)
    print("Computing perplexity on the test set:")
    test_perplexity = model_perplexity(model, X_test, y_test)
    print("Perplexity: %0.3f\n" % test_perplexity)
    
    for temperature in [0.1, 0.5, 1]:
        print("Sampling text from model at %0.2f:\n" % temperature)
        for seed_string in seed_strings:
            print(generate_text(model, seed_string, temperature=temperature))
            print()

Beam search for deterministic decoding

It is possible to improve the generation using a beam search, which will be presented in the following lab.

Better handling of sentence boundaries

To simplify things we used the lower case version of the text and we ignored any sentence boundaries. This prevents our model to learn when to stop generating characters. If we want to train a model that can start generating text at the beginning of a sentence and stop at the end of a sentence, we need to provide it with sentence boundary markers in the training set and use those special markers when sampling.

The following give an example of how to use NLTK to detect sentence boundaries in English text.

This could be used to insert an explicit "end_of_sentence" (EOS) symbol to mark separation between two consecutive sentences. This should make it possible to train a language model that explicitly generates complete sentences from start to end.

Use the following command (in a terminal) to install nltk before importing it in the notebook:

$ pip install nltk

In [ ]:
with open(corpus_path, 'rb') as f:
    text_with_case = f.read().decode('utf-8').replace("\n", " ")

In [ ]:
import nltk

nltk.download('punkt')
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text_with_case)

In [ ]:
plt.hist([len(s.split()) for s in sentences], bins=30);
plt.title('Distribution of sentence lengths')
plt.xlabel('Approximate number of words');

The first few sentences detected by NLTK are too short to be considered real sentences. Let's have a look at short sentences with at least 20 characters:


In [ ]:
sorted_sentences = sorted([s for s in sentences if len(s) > 20], key=len)
for s in sorted_sentences[:5]:
    print(s)

Some long sentences:


In [ ]:
for s in sorted_sentences[-3:]:
    print(s)

The NLTK sentence tokenizer seems to do a reasonable job despite the weird casing and '--' signs scattered around the text.

Note that here we use the original case information because it can help the NLTK sentence boundary detection model make better split decisions. Our text corpus is probably too small to train a good sentence aware language model though, especially with full case information. Using larger corpora such as a large collection of public domain books or Wikipedia dumps. The NLTK toolkit also comes from corpus loading utilities.

The following loads a selection of famous books from the Gutenberg project archive:


In [ ]:
import nltk

nltk.download('gutenberg')
book_selection_text = nltk.corpus.gutenberg.raw().replace("\n", " ")

In [ ]:
print(book_selection_text[:300])

In [ ]:
print("Book corpus length: %d characters" % len(book_selection_text))

Let's do an arbitrary split. Note the training set will have a majority of text that is not authored by the author(s) of the validation set:


In [ ]:
split = int(0.9 * len(book_selection_text))
book_selection_train = book_selection_text[:split]
book_selection_validation = book_selection_text[split:]

Bonus exercises

  • Adapt the previous language model to handle explicitly sentence boundaries with a special EOS character.
  • Train a new model on the random sentences sampled from the book selection corpus with full case information.
  • Adapt the random sampling code to start sampling at the beginning of sentence and stop when the sentence ends.
  • Train a deep GRU (e.g. two GRU layers instead of a single LSTM) to see if you can improve the validation perplexity.
  • Git clone the source code of the Linux kernel and train a C programming language model on it. Instead of sentence boundary markers, we could use source file boundary markers for this exercise. Compare your resutls with Andrej Karpathy's https://karpathy.github.io/2015/05/21/rnn-effectiveness/.
  • Try to increase the vocabulary size to 256 using a Byte Pair Encoding strategy.

In [ ]:

Why build a language model?

Building a language model is not very useful by it-self. However language models have recently been shown to be useful for transfer learning to build contextualized word embeddings as a better alternative to word2vec or GloVe.

Using language-model based word representations makes it possible to reach the state-of-the-art at many natural language understanding problems.

The workflow is the following:

  • train a (bi-directional) deep language model on a very large, unlabeled corpus (e.g. 1 billion words or more);
  • plug the resulting language model as the input layer (and sometimes also the output layer) of a task specific architecture, for instance: text classification, semantic role labeling for knowledge extraction, logical entailment, question answering and reading comprehension;
  • train the task specific parameters of the new architecture on the smaller task-labeled corpus;
  • optionally fine-tune the full architecture on the task-labeled corpus if it's big enough not to overfit.

More information on this approach:


In [ ]: