This notebook is adapted from Keras' lstm_text_generation.py.
Steps:
Note: fitting language models is very computation intensive. It is recommended to do this notebook on a server with a GPU or powerful CPUs that you can leave running for several hours at once.
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [ ]:
from tensorflow.keras.utils import get_file
URL = "https://s3.amazonaws.com/text-datasets/nietzsche.txt"
corpus_path = get_file('nietzsche.txt', origin=URL)
text = open(corpus_path).read().lower()
print('Corpus length: %d characters' % len(text))
In [ ]:
print(text[:600], "...")
In [ ]:
text = text.replace("\n", " ")
split = int(0.9 * len(text))
train_text = text[:split]
test_text = text[split:]
To simplify things, we build a vocabulary by extracting the list all possible characters from the full datasets (train and validation).
In a more realistic setting we would need to take into account that the test data can hold symbols never seen in the training set. This issue is limited when we work at the character level though.
Let's build the list of all possible characters and sort it to assign a unique integer to each possible symbol in the corpus:
In [ ]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
char_indices
is a mapping to from characters to integer identifiers:
In [ ]:
len(char_indices)
In [ ]:
sorted(char_indices.items())[:15]
indices_char
holds the reverse mapping:
In [ ]:
len(indices_char)
In [ ]:
indices_char[52]
While not strictly required to build a language model, it's a good idea to have a look at the distribution of relative frequencies of each symbol in the corpus:
In [ ]:
from collections import Counter
counter = Counter(text)
chars, counts = zip(*counter.most_common())
indices = np.arange(len(counts))
plt.figure(figsize=(14, 3))
plt.bar(indices, counts, 0.8)
plt.xticks(indices, chars);
Let's cut the dataset into fake sentences at random with some overlap. Instead of cutting at random we could use a English specific sentence tokenizer. This is explained at the end of this notebook. In the mean time random substring will be good enough to train a first language model.
In [ ]:
max_length = 40
step = 3
def make_sequences(text, max_length=max_length, step=step):
sequences = []
next_chars = []
for i in range(0, len(text) - max_length, step):
sequences.append(text[i: i + max_length])
next_chars.append(text[i + max_length])
return sequences, next_chars
sequences, next_chars = make_sequences(train_text)
sequences_test, next_chars_test = make_sequences(test_text, step=10)
print('nb train sequences:', len(sequences))
print('nb test sequences:', len(sequences_test))
Let's shuffle the sequences to break some of the dependencies:
In [ ]:
from sklearn.utils import shuffle
sequences, next_chars = shuffle(sequences, next_chars,
random_state=42)
In [ ]:
sequences[0]
In [ ]:
next_chars[0]
Unfortunately the LSTM implementation in Keras does not (yet?) accept integer indices to slice columns from an input embedding by it-self. Let's use one-hot encoding. This is slightly less space and time efficient than integer coding but should be good enough when using a small character level vocabulary.
Exercise:
One hot encoded the training data sequences
as X
and next_chars
as y
:
In [ ]:
n_sequences = len(sequences)
n_sequences_test = len(sequences_test)
voc_size = len(chars)
X = np.zeros((n_sequences, max_length, voc_size),
dtype=np.float32)
y = np.zeros((n_sequences, voc_size), dtype=np.float32)
X_test = np.zeros((n_sequences_test, max_length, voc_size),
dtype=np.float32)
y_test = np.zeros((n_sequences_test, voc_size), dtype=np.float32)
# TODO
In [ ]:
# %load solutions/language_model_one_hot_data.py
In [ ]:
X.shape
In [ ]:
y.shape
In [ ]:
X[0]
In [ ]:
y[0]
The NLP community measures the quality of probabilistic model using perplexity.
In practice perplexity is just a base 2 exponentiation of the average negative log2 likelihoods:
$$perplexity_\theta = 2^{-\frac{1}{n} \sum_{i=1}^{n} log_2 (p_\theta(x_i))}$$Note: here we define the per-character perplexity (because our model naturally makes per-character predictions). It is more common to report per-word perplexity. Note that this is not as easy to compute the per-world perplexity as we would need to tokenize the strings into a sequence of words and discard whitespace and punctuation character predictions. In practice the whitespace character is the most frequent character by far making our naive per-character perplexity lower than it should be if we ignored those.
Exercise: implement a Python function that computes the per-character perplexity with model predicted probabilities y_pred
and y_true
for the encoded ground truth:
In [ ]:
def perplexity(y_true, y_pred):
"""Compute the per-character perplexity of model predictions.
y_true is one-hot encoded ground truth.
y_pred is predicted likelihoods for each class.
2 ** -mean(log2(p))
"""
# TODO
return 1.
In [ ]:
# %load solutions/language_model_perplexity.py
In [ ]:
y_true = np.array([
[0, 1, 0],
[0, 0, 1],
[0, 0, 1],
])
y_pred = np.array([
[0.1, 0.9, 0.0],
[0.1, 0.1, 0.8],
[0.1, 0.2, 0.7],
])
perplexity(y_true, y_pred)
A perfect model has a minimal perplexity of 1.0 bit (negative log likelihood of 0.0):
In [ ]:
perplexity(y_true, y_true)
In [ ]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import RMSprop
model = Sequential()
model.add(LSTM(128, input_shape=(max_length, voc_size)))
model.add(Dense(voc_size, activation='softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(optimizer=optimizer, loss='categorical_crossentropy')
Let's measure the perplexity of the randomly initialized model:
In [ ]:
def model_perplexity(model, X, y):
predictions = model(X)
return perplexity(y, predictions)
In [ ]:
model_perplexity(model, X_test, y_test)
Let's train the model for one epoch on a very small subset of the training set to check that it's well defined:
In [ ]:
small_train = slice(0, None, 40)
model.fit(X[small_train], y[small_train], validation_split=0.1,
batch_size=128, epochs=1)
In [ ]:
model_perplexity(model, X[small_train], y[small_train])
In [ ]:
model_perplexity(model, X_test, y_test)
Recursively generate one character at a time by sampling from the distribution parameterized by the model:
$$ p_{\theta}(c_n | c_{n-1}, c_{n-2}, \ldots, c_0) \cdot p_{\theta}(c_{n-1} | c_{n-2}, \ldots, c_0) \cdot \ldots \cdot p_{\theta}(c_{0}) $$This way of parametrizing the joint probability of a set of random-variables that are structured sequentially is called auto-regressive modeling.
In [ ]:
def sample_one(preds, temperature=1.0):
"""Sample the next character according to the network output.
Use a lower temperature to force the model to output more
confident predictions: more peaky distribution.
"""
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
# Draw a single sample (size=1) from a multinoulli distribution
# parameterized by the output of the softmax layer of our
# network. A multinoulli distribution is a multinomial
# distribution with a single trial with n_classes outcomes.
probs = np.random.multinomial(1, preds, size=1)
return np.argmax(probs)
def generate_text(model, seed_string, length=300, temperature=1.0):
"""Recursively sample a sequence of chars, one char at a time.
Each prediction is concatenated to the past string of predicted
chars so as to condition the next prediction.
Feed seed string as a sequence of characters to condition the
first predictions recursively. If seed_string is lower than
max_length, pad the input with zeros at the beginning of the
conditioning string.
"""
generated = seed_string
prefix = seed_string
for i in range(length):
# Vectorize prefix string to feed as input to the model:
x = np.zeros((1, max_length, voc_size), dtype="float32")
shift = max_length - len(prefix)
for t, char in enumerate(prefix):
x[0, t + shift, char_indices[char]] = 1.
preds = model(x)[0]
next_index = sample_one(preds, temperature)
next_char = indices_char[next_index]
generated += next_char
prefix = prefix[1:] + next_char
return generated
The temperature parameter makes it possible to increase or decrease the entropy into the multinouli distribution parametrized by the output of the model.
Temperature lower than 1 will yield very regular text (biased towards the most frequent patterns of the training set). Temperatures higher than 1 will render the model "more creative" but also noisier (with a large fraction of meaningless words). A temperature of 1 is neutral (the noise of the generated text only stems from the imperfection of the model).
In [ ]:
generate_text(model, 'philosophers are ', temperature=0.1)
In [ ]:
generate_text(model, 'atheism is the root of ', temperature=0.8)
In [ ]:
nb_epoch = 30
seed_strings = [
'philosophers are ',
'atheism is the root of ',
]
for epoch in range(nb_epoch):
print("# Epoch %d/%d" % (epoch + 1, nb_epoch))
print("Training on one epoch takes ~90s on a K80 GPU")
model.fit(X, y, validation_split=0.1, batch_size=128, epochs=1,
verbose=2)
print("Computing perplexity on the test set:")
test_perplexity = model_perplexity(model, X_test, y_test)
print("Perplexity: %0.3f\n" % test_perplexity)
for temperature in [0.1, 0.5, 1]:
print("Sampling text from model at %0.2f:\n" % temperature)
for seed_string in seed_strings:
print(generate_text(model, seed_string, temperature=temperature))
print()
To simplify things we used the lower case version of the text and we ignored any sentence boundaries. This prevents our model to learn when to stop generating characters. If we want to train a model that can start generating text at the beginning of a sentence and stop at the end of a sentence, we need to provide it with sentence boundary markers in the training set and use those special markers when sampling.
The following give an example of how to use NLTK to detect sentence boundaries in English text.
This could be used to insert an explicit "end_of_sentence" (EOS) symbol to mark separation between two consecutive sentences. This should make it possible to train a language model that explicitly generates complete sentences from start to end.
Use the following command (in a terminal) to install nltk before importing it in the notebook:
$ pip install nltk
In [ ]:
with open(corpus_path, 'rb') as f:
text_with_case = f.read().decode('utf-8').replace("\n", " ")
In [ ]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text_with_case)
In [ ]:
plt.hist([len(s.split()) for s in sentences], bins=30);
plt.title('Distribution of sentence lengths')
plt.xlabel('Approximate number of words');
The first few sentences detected by NLTK are too short to be considered real sentences. Let's have a look at short sentences with at least 20 characters:
In [ ]:
sorted_sentences = sorted([s for s in sentences if len(s) > 20], key=len)
for s in sorted_sentences[:5]:
print(s)
Some long sentences:
In [ ]:
for s in sorted_sentences[-3:]:
print(s)
The NLTK sentence tokenizer seems to do a reasonable job despite the weird casing and '--' signs scattered around the text.
Note that here we use the original case information because it can help the NLTK sentence boundary detection model make better split decisions. Our text corpus is probably too small to train a good sentence aware language model though, especially with full case information. Using larger corpora such as a large collection of public domain books or Wikipedia dumps. The NLTK toolkit also comes from corpus loading utilities.
The following loads a selection of famous books from the Gutenberg project archive:
In [ ]:
import nltk
nltk.download('gutenberg')
book_selection_text = nltk.corpus.gutenberg.raw().replace("\n", " ")
In [ ]:
print(book_selection_text[:300])
In [ ]:
print("Book corpus length: %d characters" % len(book_selection_text))
Let's do an arbitrary split. Note the training set will have a majority of text that is not authored by the author(s) of the validation set:
In [ ]:
split = int(0.9 * len(book_selection_text))
book_selection_train = book_selection_text[:split]
book_selection_validation = book_selection_text[split:]
In [ ]:
Building a language model is not very useful by it-self. However language models have recently been shown to be useful for transfer learning to build contextualized word embeddings as a better alternative to word2vec or GloVe.
Using language-model based word representations makes it possible to reach the state-of-the-art at many natural language understanding problems.
The workflow is the following:
More information on this approach:
In [ ]: