RNN stand for "Recurent Neural Network".
To understand why RNN are so hot you must read this!
This notebook to explain the Minimal character-level Vanilla RNN model written by Andrej Karpathy
This code create a RNN to generate a text, char after char, by learning char after char from a textfile.
I love this character-level Vanilla RNN code because it doesn't use any library except numpy. All the NN magic in 112 lines of code, no need to understand any dependency. Everything is there! I'll try to explain in detail every line of it. Disclamer: I still need to use some external links for reference.
This notebook is for real beginners who whant to understand RNN concept by reading code.
Feedback welcome @dh7net
Let's see the original code and the results for the first 1000 iterations.
In [1]:
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""
import numpy as np
# data I/O
data = open('methamorphosis.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }
# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
"""
x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
ixes = []
for t in xrange(n):
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
y = np.dot(Why, h) + by
p = np.exp(y) / np.sum(np.exp(y))
ix = np.random.choice(range(vocab_size), p=p.ravel())
x = np.zeros((vocab_size, 1))
x[ix] = 1
ixes.append(ix)
return ixes
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while n<=1000: # was while True: in original code
# prepare inputs (we're sweeping from left to right in steps seq_length long)
if p+seq_length+1 >= len(data) or n == 0:
hprev = np.zeros((hidden_size,1)) # reset RNN memory
p = 0 # go from start of data
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
# sample from the model now and then
if n % 100 == 0:
sample_ix = sample(hprev, inputs[0], 200)
txt = ''.join(ix_to_char[ix] for ix in sample_ix)
print '----\n %s \n----' % (txt, )
# forward seq_length characters through the net and fetch gradient
loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
smooth_loss = smooth_loss * 0.999 + loss * 0.001
if n % 100 == 0: print 'iter %d, loss: %f' % (n, smooth_loss) # print progress
# perform parameter update with Adagrad
for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
[dWxh, dWhh, dWhy, dbh, dby],
[mWxh, mWhh, mWhy, mbh, mby]):
mem += dparam * dparam
param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update
p += seq_length # move data pointer
n += 1 # iteration counter
If you are not a NN expert, the code is not easy to understand.
If you look to the results you can see that the code iterate 1000 time, calculate a loss that decrease over time, and output some text each 100 iteration.
The output from the first iteration looks random.
After 1000 iterations, the NN is able to create words that have plausible size, don't use too much caps, and can create correct small words like "the", "they", "be", "to".
If you let the code learn over a nigth the NN will be able to create almost correct sentences:
"with home to get there was much hadinge everything and he could that ho women this tending applear space"
This is just a simple exemple, and there is no doubt this code can do much better.
This code build a neural network that is able to predict one char from the previous one.
In this example, it learn from a text file, so he can learn words and sentence ; if you feed HTML or XML during the tranning it can produce valid HTML or XML sequences.
At each step it can use some results from the previous step to keep in memory what is going on.
For instance if the previous char are "hello worl" the model can guess that the next char is "d".
This model contain parameters that are initialized randomly and the trainning phase try to find optimal values for each of them. During the trainning process we do a "gradient descent":
Let's have a closer look to every line of the code.
Disclaimer: the following code is cut and pasted from the original ones, with some adaptation to make it clearer for this notebook, like adding some print.
The network need a big txt file as an input.
The content of the file will be used to train the network.
For this example, I used Methamorphosis from Kafka (Public Domain).
In [2]:
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""
import numpy as np
# data I/O
data = open('methamorphosis.txt', 'r').read() # should be simple plain text file
Neural networks can only works on vectors. (a vector is an array of float) So we need a way to encode and decode a char as a vector.
For this we count the number of unique char (vocab_size). It will be the size of the vector. The vector contain only zero exept for the position of the char wherae the value is 1.
In [3]:
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print 'data has %d characters, %d unique.' % (data_size, vocab_size)
In [4]:
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }
print char_to_ix
print ix_to_char
In [17]:
%matplotlib notebook
import matplotlib
import matplotlib.pyplot as plt
vector_for_char_a = np.zeros((vocab_size, 1))
vector_for_char_a[char_to_ix['a']] = 1
#print vector_for_char_a
print vector_for_char_a.ravel()
x = range(0,len(chars))
plt.figure(figsize=(10,2))
plt.bar(x, vector_for_char_a.ravel(), 0.3)
plt.xticks(x, chars)
plt.show()
The neural network is made of 3 layers:
All layers are fully connected to the next one: each node of a layer are conected to all nodes of the next layer. The hidden layer is connected to the output and to itself: the values from an iteration are used for the next one.
To centralise values that matter for the training (hyper parameters) we also define the sequence lenght and the learning rate
In [6]:
# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1
In [7]:
# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
print 'Wxh contain', Wxh.size, 'parameters'
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
print 'Whh contain', Whh.size, 'parameters'
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
print 'Why contain', Why.size, 'parameters'
bh = np.zeros((hidden_size, 1)) # hidden bias
print 'bh contain', bh.size, 'parameters'
by = np.zeros((vocab_size, 1)) # output bias
print 'by contain', by.size, 'parameters'
The model parameters are adjusted during the trainning.
You'll see in the next section how theses parameters are used to create a sentence.
In [23]:
def sample(h, seed_ix, n):
"""
sample a sequence of integers from the model
h is memory state, seed_ix is seed letter for first time step
"""
x = np.zeros((vocab_size, 1))
x[seed_ix] = 1
ixes = []
for t in xrange(n):
h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
y = np.dot(Why, h) + by
p = np.exp(y) / np.sum(np.exp(y))
ix = np.random.choice(range(vocab_size), p=p.ravel())
x = np.zeros((vocab_size, 1))
x[ix] = 1
ixes.append(ix)
txt = ''.join(ix_to_char[ix] for ix in ixes)
print '----\n %s \n----' % (txt, )
hprev = np.zeros((hidden_size,1)) # reset RNN memory
sample(hprev,char_to_ix['a'],200)
The loss is a key concept in all neural networks trainning.
It is a value that describe how bag/good is our model.
It is always positive, the closest to zero, the better is our model.
(A good model is a model where the predicted output is close to the training output)
During the trainning phase we want to minimize the loss.
The loss function calculate the loss but also the gradients (see backward pass):
This function take as input:
This function output:
Here the code:
In [9]:
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
The forward pass use the parameters of the model (Wxh, Whh, Why, bh, by) to calculate the next char given a char from the trainning set.
xs[t] is the vector that encode the char at position t ps[t] is the probabilities for next char
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
or is dirty pseudo code for each char
hs = input*Wxh + last_value_of_hidden_state*Whh + bh
ys = hs*Why + by
ps = normalized(ys)
To dive into the code, we'll work on one char only (we set t=0 ; instead of the "for each" loop).
In [24]:
# uncomment the print to get some details
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
# forward pass
t=0 # for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
# print xs[t]
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
# print ys[t]
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
# print ps[t].ravel()
# Let's build a dict to see witch probablity is associated with witch char
probability_per_char = { ch:ps[t].ravel()[i] for i,ch in enumerate(chars) }
# uncoment the next line to see the raw result
# print probability_per_char
# To print the probability in a way that is more easy to read.
for x in range(vocab_size):
print 'p('+ ix_to_char[x] + ")=", "%.4f" % ps[t].ravel()[x],
if (x%7==0):
print ""
else:
print "",
x = range(0,len(chars))
plt.figure(figsize=(10,5))
plt.bar(x, ps[t], 0.3)
plt.xticks(x, chars)
plt.show()
In [11]:
# We can create the next char from the above distribution
ix = np.random.choice(range(vocab_size), p=ps[t].ravel())
print
print "Next char code is:", ix
print "Next char is:", ix_to_char[ix]
You can run the previous code several time. A char is generated for a given probability.
For each char in the input the forward pass calculate the probability of the next char
The loss is the sum
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
The loss is calculate using Softmax. more info here and here.
In [12]:
print 'Next char from training (target) was number', targets[t], 'witch is "' + ix_to_char[targets[t]] + '"'
print 'Probability for this letter was', ps[t][targets[t],0]
loss = -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
print 'loss for this input&target pair is', loss
The goal of the backward pass is to calculate all gradients.
Gradients tell in witch direction you have to move your parameter to make a better model.
The naive way to calculate all gradients would be to recalculate a loss for small variations for each parameters.
This is possible but would be time consuming. We have more than 20k parameters.
There is a technic to calculates all the gradients for all the parameters at once: the backdrop propagation.
Gradients are calculated in the oposite order of the forward pass, using simple technics.
For instance if we have:
loss = a.x + b
If we want to minimize loss, we need to calculate d(loss)/dx and use it to calculate the new_x value.
new_x = x - d(loss)/dx * step_size
If new_loss is smaller than loss, it is a win: we succeed to find a better x input.
Lets do the math:
d(loss)/dx = d(a.x)/dx +d(b)/dx
d(loss)/dx = (d(a)/dx)1 + ad(x)/dx + 0
d(loss)/dx = 0 + a*1
d(loss)/dx = a
In [13]:
x = 10
a = 3
b = 7
loss = a+x + b
print 'initial loss =', loss
# dx stand for d(loss)/dx
dx = a #Calculate dx=d(loss)/dx analytically
step_size = 0.1
# use dx and step size to calculate new x
new_x = x - dx * step_size
new_loss = a+new_x + b
print 'new loss =',new_loss
if (new_loss<loss): print 'New loss is smaller, Yeah!'
hs = input*Wxh + last_value_of_hidden_state*Whh + bh
ys = hs*Why + by
This part need more work to explain the code, but here a great source to understand this technic in detail.
# Backdrop this: ys = hs*Why + by
dy=-1 # because the smaller the loss, the better is the model.
dWhy = np.dot(dy, hs.T)
dby = dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs * hs) * dh # backprop through tanh nonlinearity
# Backdrop this: hs = input*Wxh + last_value_of_hidden_state*Whh + bh
dbh += dhraw
dWxh += np.dot(dhraw, xs.T)
dWhh += np.dot(dhraw, hs.T)
dhnext = np.dot(Whh.T, dhraw)
In [14]:
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
t=0 #for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y
#print dy.ravel()
dWhy += np.dot(dy, hs[t].T)
#print dWhy.ravel()
dby += dy
#print dby.ravel()
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
#print dparam
This last part of the code is the main trainning loop:
In [15]:
p=0
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
print "inputs", inputs
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
print "targets", targets
The easiest technics to update the parmeters of the model is this:
param += dparam * step_size
Adagrad is a more efficient technique where the step_size are getting smaller during the training.
It use a memory variable that grow over time:
mem += dparam * dparam
and use it to calculate the step_size:
step_size = 1./np.sqrt(mem + 1e-8)
In short:
mem += dparam * dparam
param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update
Smooth_loss doesn't play any role in the training. It is just a low pass filtered version of the loss:
smooth_loss = smooth_loss * 0.999 + loss * 0.001
It is a way to average the loss on over the last iterations to better track the progress
Here the code of the main loop that does both trainning and generating text from times to times:
In [16]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while n<=1000*100:
# prepare inputs (we're sweeping from left to right in steps seq_length long)
# check "How to feed the loss function to see how this part works
if p+seq_length+1 >= len(data) or n == 0:
hprev = np.zeros((hidden_size,1)) # reset RNN memory
p = 0 # go from start of data
inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
# forward seq_length characters through the net and fetch gradient
loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
smooth_loss = smooth_loss * 0.999 + loss * 0.001
# sample from the model now and then
if n % 1000 == 0:
print 'iter %d, loss: %f' % (n, smooth_loss) # print progress
sample(hprev, inputs[0], 200)
# perform parameter update with Adagrad
for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
[dWxh, dWhh, dWhy, dbh, dby],
[mWxh, mWhh, mWhy, mbh, mby]):
mem += dparam * dparam
param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update
p += seq_length # move data pointer
n += 1 # iteration counter
Feedback welcome @dh7net!