Recurrent Neural Networks: Character RNNs with Keras

Often we are not interested in isolated datapoints, but rather datapoints within a context of others. A datapoint may mean something different depending on what's come before it. This can typically be represented as some kind of sequence of datapoints, perhaps the most common of which is a time series.

One of the most ubiquitous sequences of data where context is especially important is natural language. We have quite a few words in English where the meaning of a word may be totally different depending on it's context. An innocuous example of this is "bank": "I went fishing down by the river bank" vs "I deposited some money into the bank".

If we consider that each word is a datapoint, most non-recurrent methods will treat "bank" in the first sentence exactly the same as "bank" in the second sentence - they are indistinguishable. If you think about it, in isolation they are indistinguishable to us as well - it's the same word!

We can only start to discern them when we consider the previous word (or words). So we might want our neural network to consider that "bank" in the first sentence is preceded by "river" and that in the second sentence "money" comes a few words before it. That's basically what RNNs do - they "remember" some of the previous context and that influences the output it produces. This "memory" (called the network's "hidden state") works by retaining some of the previous outputs and combining it with the current input; this recursing (feedback) of the network's output back into itself is where its name comes from.

This recursing makes RNNs quite deep, and thus they can be difficult to train. The gradient gets smaller and smaller the deeper it is pushed backwards through the network until it "vanishes" (effectively becomes zero), so long-term dependencies are hard to learn. The typical practice is to only extend the RNN back a certain number of time steps so the network is still trainable.

Certain units, such as the LSTM (long short-term memory) and GRU (gated recurrent unit), have been developed to mitigate some of this vanishing gradient effect.

Let's walkthrough an example of a character RNN, which is a great approach for learning a character-level language model. A language model is essentially some function which returns a probability over possible words (or in this case, characters), based on what has been seen so far. This function can vary from region to region (e.g. if terms like "pop" are used more commonly than "soda") or from person to person. You could say that a (good) language model captures the style in which someone writes.

Language models often must make the simplifying assumption that only what came immediately (one time step) before matters (this is called the "Markov assumption"), but with RNNs we do not need to make such an assumption.

We'll use Keras which makes building neural networks extremely easy (this example is an annotated version of Keras's LSTM text generation example).

First we'll do some simple preparation - import the classes we need and load up the text we want to learn from.


In [1]:
import os

#if using Theano with GPU
#os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=gpu,floatX=float32"

import random
import numpy as np
from glob import glob
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Activation, Dropout

# load up our text
text_files = glob('../data/sotu/*.txt')
text = '\n'.join([open(f, 'r').read() for f in text_files])

# extract all (unique) characters
# these are our "categories" or "labels"
chars = list(set(text))

# set a fixed vector size
# so we look at specific windows of characters
max_len = 20


Using Theano backend.
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5103)

Now we'll define our RNN. Keras makes this trivial:


In [9]:
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(max_len, len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

We're framing our task as a classification task. Given a sequence of characters, we want to predict the next character. We equate each character with some label or category (e.g. "a" is 0, "b" is 1, etc).

We use the softmax activation function on our output layer - this function is used for categorical output. It turns the output into a probability distribution over the categories (i.e. it makes the values the network outputs sum to 1). So the network will essentially tell us how strongly it feels about each character being the next one.

The categorical cross-entropy loss the standard loss function for multilabel classification, which basically penalizes the network more the further off it is from the correct label.

We use dropout here to prevent overfitting - we don't want the network to just return things already in the text, we want it to have some wiggle room and create novelty! Dropout is a technique where, in training, some percent (here, 20%) of random neurons of the associated layer are "turned off" for that epoch. This prevents overfitting by preventing the network from relying on particular neurons.

That's it for the network architecture!

To train, we have to do some additional preparation. We need to chop up the text into character sequences of the length we specified (max_len) - these are our training inputs. We match them with the character that immediately follows each sequence. These are our expected training outputs.

For example, say we have the following text (this quote is from Zhuang Zi). With max_len=20, we could manually create the first couple training examples like so:


In [10]:
example_text = "The fish trap exists because of the fish. Once you have gotten the fish you can forget the trap. The rabbit snare exists because of the rabbit. Once you have gotten the rabbit, you can forget the snare. Words exist because of meaning. Once you have gotten the meaning, you can forget the words. Where can I find a man who has forgotten words so that I may have a word with him?"

# step size here is 3, but we can vary that
input_1 = example_text[0:20]
true_output_1 = example_text[20]
# >>> 'The fish trap exists'
# >>> ' '

input_2 = example_text[3:23]
true_output_2 = example_text[23]
# >>> 'fish trap exists be'
# >>> 'c'

input_3 = example_text[6:26]
true_output_3 = example_text[26]
# >>> 'sh trap exists becau'
# >>> 's'

# etc

We can generalize this like so:


In [11]:
step = 3
inputs = []
outputs = []
for i in range(0, len(text) - max_len, step):
    inputs.append(text[i:i+max_len])
    outputs.append(text[i+max_len])

We also need to map each character to a label and create a reverse mapping to use later:


In [12]:
char_labels = {ch:i for i, ch in enumerate(chars)}
labels_char = {i:ch for i, ch in enumerate(chars)}

Now we can start constructing our numerical input 3-tensor and output matrix. Each input example (i.e. a sequence of characters) is turned into a matrix of one-hot vectors; that is, a bunch of vectors where the index corresponding to the character is set to 1 and all the rest are set to zero.

For example, if we have the following:


In [13]:
# assuming max_len = 7
# so our examples have 7 characters
example = 'cab dab'
example_char_labels = {
    'a': 0,
    'b': 1,
    'c': 2,
    'd': 3,
    ' ' : 4
}

# matrix form
# the example uses only five kinds of characters,
# so the vectors only need to have five components,
# and since the input phrase has seven characters,
# the matrix has seven vectors.
[
    [0, 0, 1, 0, 0], # c
    [1, 0, 0, 0, 0], # a
    [0, 1, 0, 0, 0], # b
    [0, 0, 0, 0, 1], # (space)
    [0, 0, 0, 1, 0], # d
    [1, 0, 0, 0, 0], # a
    [0, 1, 0, 0, 0]  # b
]


Out[13]:
[[0, 0, 1, 0, 0],
 [1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0],
 [0, 0, 0, 0, 1],
 [0, 0, 0, 1, 0],
 [1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0]]

That matrix represents a single training example, so for our full set of training examples, we'd have a stack of those matrices (hence a 3-tensor).

And the outputs for each example are each a one-hot vector (i.e. a single character). With that in mind:


In [14]:
# using bool to reduce memory usage
X = np.zeros((len(inputs), max_len, len(chars)), dtype=np.bool)
y = np.zeros((len(inputs), len(chars)), dtype=np.bool)

# set the appropriate indices to 1 in each one-hot vector
for i, example in enumerate(inputs):
    for t, char in enumerate(example):
        X[i, t, char_labels[char]] = 1
    y[i, char_labels[outputs[i]]] = 1

Now that we have our training data, we can start training. Keras also makes this easy:


In [ ]:
# more epochs is usually better, but training can be very slow if not on a GPU
epochs = 10
model.fit(X, y, batch_size=128, epochs=epochs)

It's much more fun to see your network's ramblings as it's training, so let's write a function to produce text from the network:


In [25]:
def generate(temperature=0.35, seed=None, num_chars=100):
    predicate=lambda x: len(x) < num_chars
    
    if seed is not None and len(seed) < max_len:
        raise Exception('Seed text must be at least {} chars long'.format(max_len))

    # if no seed text is specified, randomly select a chunk of text
    else:
        start_idx = random.randint(0, len(text) - max_len - 1)
        seed = text[start_idx:start_idx + max_len]

    sentence = seed
    generated = sentence

    while predicate(generated):
        # generate the input tensor
        # from the last max_len characters generated so far
        x = np.zeros((1, max_len, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_labels[char]] = 1.

        # this produces a probability distribution over characters
        probs = model.predict(x, verbose=0)[0]

        # sample the character to use based on the predicted probabilities
        next_idx = sample(probs, temperature)
        next_char = labels_char[next_idx]

        generated += next_char
        sentence = sentence[1:] + next_char
    return generated

def sample(probs, temperature):
    """samples an index from a vector of probabilities
    (this is not the most efficient way but is more robust)"""
    a = np.log(probs)/temperature
    dist = np.exp(a)/np.sum(np.exp(a))
    choices = range(len(probs))
    return np.random.choice(choices, p=dist)

The temperature controls how random we want the network to be. Lower temperatures favors more likely values, whereas higher temperatures introduce more and more randomness. At a high enough temperature, values will be chosen at random.

With this generation function we can modify how we train the network so that we see some output at each step:


In [16]:
epochs = 10
for i in range(epochs):
    print('epoch %d'%i)

    # set nb_epoch to 1 since we're iterating manually
    # comment this out if you just want to generate text
    model.fit(X, y, batch_size=128, epochs=1)

    # preview
    for temp in [0.2, 0.5, 1., 1.2]:
        print('temperature: %0.2f'%temp)
        print('%s'%generate(temperature=temp))


epoch 0
Epoch 1/1
980939/980939 [==============================] - 990s - loss: 1.6563   
temperature: 0.20
ons, neutrals and all of the progress of the first to the first to develop the country and the fisca
temperature: 0.50
that weren't, and people who for every can be a standards of American people because the community f
temperature: 1.00
would have created the somether you whan alls. Employment. 28 the health in the final changes to mee
temperature: 1.20
ure we pass a reasonk the Nation-
But our stenfityen the efforts to sif, Mrkinn-; have To cave defen
epoch 1
Epoch 1/1
980939/980939 [==============================] - 1017s - loss: 1.2809  
temperature: 0.20
 we should not be as a major the construction of the continuing of the security of the productive of
temperature: 0.50
. This work must now have been recommended to protect the prompt and the same time, and to continue 
temperature: 1.00
ot whether we shrink on the world, and when the range put a school as rises. The difficult activitie
temperature: 1.20
 over his head and what haw the curdition of the President officed year lessing the had suching that
epoch 2
Epoch 1/1
980939/980939 [==============================] - 1252s - loss: 1.2134  
temperature: 0.20
 of Americans will be successfully and the program of the Congress and the Congress to respect the p
temperature: 0.50
f low-income families, and we have already been sever the world. It should be able to protect the mi
temperature: 1.00
ighbors who cannot to meet t gorus and our relief for a new 11rian deprivious with vigorously will s
temperature: 1.20
lity in this critical relatio bodden many of classt. They say. Thkreary school loun burge will creat
epoch 3
Epoch 1/1
980939/980939 [==============================] - 1251s - loss: 1.1781  
temperature: 0.20
same cooperation and the control of the world we have to do the control of the control of the task o
temperature: 0.50
t of the hands of victory is the beginning of a bank to strengthening our children's security and ne
temperature: 1.00
egislation now in continued law, and whatever xeeping Federal Gulf to this Union wor that important 
temperature: 1.20
Last year, I called such implementation. The organ  - budget Amendmentmred task. 
And everywhere ere
epoch 4
Epoch 1/1
980939/980939 [==============================] - 1245s - loss: 1.1542  
temperature: 0.20
cooperation with one more people will be the first time we have a security of the first time the exp
temperature: 0.50
ards and rising nations to the rest of the future of the most family of unity on the first commitmen
temperature: 1.00
and day.I shall not make the past year, batter years the was not aid, money, or helping new world fo
temperature: 1.20
, and our farms. 
But notconsuprior watening, more others are now given with America  nosuccied are 
epoch 5
Epoch 1/1
980939/980939 [==============================] - 1243s - loss: 1.1368  
temperature: 0.20
uture. Support the same time to serve the first time to provide a stable program of the world we hav
temperature: 0.50
all civilian war workers and which now will be prepared to see the large of the same time we have a 
temperature: 1.00
s.
It is my conviction. In good life for the International 
but only payment around the world can be
temperature: 1.20
mic growth in the thousand of us would foint on relationship by clear major nuges of money of ;orn w
epoch 6
Epoch 1/1
980939/980939 [==============================] - 1246s - loss: 1.1228  
temperature: 0.20
ing their demands for the world who would also be able to be able to provide a strong program will b
temperature: 0.50
ey could be allowed to be the past year, the tax cut to the country of a interests in the great serv
temperature: 1.00
 the United States is a values of world tax cuts and to our national economies, conviction, and foe 
temperature: 1.20
ncroached on one of the free people the future.
To it to violence in the care of publicn jobs is a f
epoch 7
Epoch 1/1
980939/980939 [==============================] - 1246s - loss: 1.1126  
temperature: 0.20
agement and the result of the prosperity of the people of the people of the people of the world. 
Th
temperature: 0.50
. 
This program is important to a strong program to develop the company of our people will be provid
temperature: 1.00
 36 days, and 2 days for agriculture of the $15 billion depending these great ststen our in both cau
temperature: 1.20
rican. 
The executive breakdys of but sourcess business.

epecution irprovement is terrible.
She pro
epoch 8
Epoch 1/1
980939/980939 [==============================] - 1245s - loss: 1.1039  
temperature: 0.20
reat to the security of the enemies of the world. The strength of our country. 
And we must not be a
temperature: 0.50
h I'll announce very deficit in the country now are on the safety to national resources and the lead
temperature: 1.00
gely of payments to anything creritify, reveil and sufferonous nearly new generous use government an
temperature: 1.20
an pay us greater difterbny democracy--has sufficienty reulting - triticH in managing ways. I need a
epoch 9
Epoch 1/1
980939/980939 [==============================] - 1248s - loss: 1.0954  
temperature: 0.20
e to cut harmful emission that we are the first time that we can succeed to the expenditures of the 
temperature: 0.50
hat will lead to last year, the Soviet Union. 
So I recommend that the support of the world is the o
temperature: 1.00
 the youth of our Nation's necessity for spirit prants would have had have greatly edgatting the ext
temperature: 1.20
 the organization of our Administration to turn more faiile to reduce only Now is conductiongages, P

That's about all there is to it. Let's try to generate one long sample passage with 2000 characters. We'll arbitrarily pick a temperature of 0.4, which seems to work decently well -- enough randomness without being incoherent. We'll also give it a seed this time (starting text): "Today, we are facing an important challenge"


In [28]:
print('%s' % generate(temperature=0.4, seed='Today, we are facing an important challenge.', num_chars=2000))


Council to convene our policies are all of us have been a promise of the last year of our people who have the first time that we have one of them meet the same program of the world. 
I have held the strongest nations to provide a new strategy of the past year, and we should start the public schools and well-being, and the security of the people of the Congress to maintain and a program to maintain the future of the world, and the actions of the lines of the world about the many of the price structure of the people of the situation to continue the first time that the Congress must be proud of the buildings of the operation of the area of the first process of the world. 
I cannot retain the problem of the fact of the demand of the course of the world. And the program of proposals will be made to the international scientific and police buildings of the loss of the world. 
I have proposed that no world that we must stand by a second region of the first time that the excessive and the same spending to our faith in the entire period of such a time that will also deserve the cause of the future of the future. 
I have the same time to strengthen the world that the present price and promote a stabilization and industrial programs that will see the future of the strong state of the United States that we have done a great part of the world. 
And the support of the freedom of the people. 
The process of the same time, the responsibilities that we can be a first contribution to the superiority of our country, some of the citizens of the world. It is the summer of our strength of the world. I have the many children and the economy that we can not be the deficit in the next four years. 
All people are going to the institutions and to our people are the committee of the right thing to maintain the energy state of the world. I have already in a construction of the process of our own people who have been the strongest programs of our country in the first time that is a consummer that