Hello World! Python Workshops @ Think Coffee
3-5pm, 7/30/17
Day 3, Alice NLP generator
@python script author (original content): Rahul
@jupyter notebook converted tutorial author: Nick Giangreco
Example script to generate text from Lewis Carroll's Alice in Wonderland. At least 20 epochs are required before the generated text starts sounding coherent. It is recommended to run this script on GPU, as recurrent networks are quite computationally intensive. If you try this script on new data, make sure your corpus has at least ~100k characters. ~1M is better.
Importing modules
In [1]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
Loading and reading Alice.txt corpus, saving characters (unique alphabet and punctuation characters in corpus) in array, and making dictionary associating each character with it's position in the character array (making two dictionaries where the key and position are either the key or value)
In [2]:
path = get_file('Alice.txt', origin='https://github.com/rahulremanan/python_tutorial/blob/master/Alice.txt')
text = open(path).read().lower()
print('corpus length:', len(text))
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
Cutting the document into semi-redundant sentences, where each element in the sentences list contain 40 sentences that overlap with the previous element's sentences. Also, storing character in each next_chars array's elements, where the current element is the 40th character after the previous character.
In [3]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))
Making X boolean (false) array with a shape of the length of the sentences by the step (40) by the length of the unique characters/punctuation in the document.
Making y boolean (false) array with a shape of the length of the sentences by the length of the unique characters/punctuation in the document.
Then, going through each sentence and character in the sentence, storing a 1 (converting false to true) in the respective sentence and characters in X and y.
In [4]:
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
X[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Building the LSTM model:
Adding empty model,
Adding first layer --> an LSTM unit with 128 nodes and shape of 40 by length of characters/punctuations in the document,
Adding another layer --> a regular densely-connected NN layer with the number of nodes being the number of unique characters/punctuation in the document,
Adding another layer --> an activation function to the output, with a softmax activation.
In [5]:
# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
Instantiating a RMSProp optimizer and compiling our LSTM model with the optimizer. This optimizer is usually a good choice for RNNs. 'lr' is the learning rate.
model.compile means to start the learning process with the optimizer.
In [6]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
Defining sample function, which is a helper function to sample an index from a probability array.
In [7]:
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
Model training:
Fitting predictor data onto the response. On a CPU, this will run very slow, so decrease the number of iterations to have it run though the performance will be very poor.
Then in the for loop, we are grabbing a random set of sentences,
outputting them (with our randomly generated seed for grabbing random sentences),
then we're randomly predicting from a random multinomial model.
In [62]:
# train the model, output generated text after each iteration
for iteration in range(1, 2):
print()
print('-' * 50)
print('Iteration', iteration)
model.fit(X, y,
batch_size=128,
epochs=1)
start_index = random.randint(0, len(text) - maxlen - 1)
for diversity in [0.2, 0.5, 1.0, 1.2]:
print()
print('----- diversity:', diversity)
generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)
for i in range(400):
x = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x[0, t, char_indices[char]] = 1.
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]
generated += next_char
sentence = sentence[1:] + next_char
sys.stdout.write(next_char)
sys.stdout.flush()
print()