In [1]:
# we assume that we have the pycnn module in your path.
# we also assume that LD_LIBRARY_PATH includes a pointer to where libcnn_shared.so is.
from pycnn import *

An LSTM/RNN overview:

An (1-layer) RNN can be thought of as a sequence of cells, $h_1,...,h_k$, where $h_i$ indicates the time dimenstion.

Each cell $h_i$ has an input $x_i$ and an output $r_i$. In addition to $x_i$, cell $h_i$ receives as input also $r_{i-1}$.

In a deep (multi-layer) RNN, we don't have a sequence, but a grid. That is we have several layers of sequences:

  • $h_1^3,...,h_k^3$
  • $h_1^2,...,h_k^2$
  • $h_1^1,...h_k^1$,

Let $r_i^j$ be the output of cell $h_i^j$. Then:

The input to $h_i^1$ is $x_i$ and $r_{i-1}^1$.

The input to $h_i^2$ is $r_i^1$ and $r_{i-1}^2$, and so on.

The LSTM (RNN) Interface

RNN / LSTM / GRU follow the same interface. We have a "builder" which is in charge of creating definining the parameters for the sequence.


In [2]:
model = Model()
NUM_LAYERS=2
INPUT_DIM=50
HIDDEN_DIM=10
builder = LSTMBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, model)
# or:
# builder = SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, model)

Note that when we create the builder, it adds the internal RNN parameters to the model. We do not need to care about them, but they will be optimized together with the rest of the network's parameters.


In [3]:
s0 = builder.initial_state()

In [4]:
x1 = vecInput(INPUT_DIM)

In [5]:
s1=s0.add_input(x1)
y1 = s1.output()
# here, we add x1 to the RNN, and the output we get from the top is y (a HIDEN_DIM-dim vector)

In [6]:
y1.npvalue().shape


Out[6]:
(10,)

In [7]:
s2=s1.add_input(x1) # we can add another input
y2=s2.output()

If our LSTM/RNN was one layer deep, y2 would be equal to the hidden state. However, since it is 2 layers deep, y2 is only the hidden state (= output) of the last layer.

If we were to want access to the all the hidden state (the output of both the first and the last layers), we could use the .h() method, which returns a list of expressions, one for each layer:


In [8]:
print s2.h()


(exprssion 54/0, exprssion 66/0)

The same interface that we saw until now for the LSTM, holds also for the Simple RNN:


In [9]:
# create a simple rnn builder
rnnbuilder=SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, model)

# initialize a new graph, and a new sequence
rs0 = rnnbuilder.initial_state()

# add inputs
rs1 = rs0.add_input(x1)
ry1 = rs1.output()
print "all layers:", s1.h()


all layers: (exprssion 32/0, exprssion 42/0)

In [10]:
print s1.s()


(exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)

To summarize, when calling .add_input(x) on an RNNState what happens is that the state creates a new RNN/LSTM column, passing it:

  1. the state of the current RNN column
  2. the input x

The state is then returned, and we can call it's output() method to get the output y, which is the output at the top of the column. We can access the outputs of all the layers (not only the last one) using the .h() method of the state.

.s() The internal state of the RNN may be more involved than just the outputs $h$. This is the case for the LSTM, that keeps an extra "memory" cell, that is used when calculating $h$, and which is also passed to the next column. To access the entire hidden state, we use the .s() method.

The output of .s() differs by the type of RNN being used. For the simple-RNN, it is the same as .h(). For the LSTM, it is more involved.


In [11]:
rnn_h  = rs1.h()
rnn_s  = rs1.s()
print "RNN h:", rnn_h
print "RNN s:", rnn_s


lstm_h = s1.h()
lstm_s = s1.s()
print "LSTM h:", lstm_h
print "LSTM s:", lstm_s


RNN h: (exprssion 74/0, exprssion 76/0)
RNN s: (exprssion 74/0, exprssion 76/0)
LSTM h: (exprssion 32/0, exprssion 42/0)
LSTM s: (exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)

As we can see, the LSTM has two extra state expressions (one for each hidden layer) before the outputs h.

Extra options in the RNN/LSTM interface

Stack LSTM The RNN's are shaped as a stack: we can remove the top and continue from the previous state. This is done either by remembering the previous state and continuing it with a new .add_input(), or using we can access the previous state of a given state using the .prev() method of state.

Initializing a new sequence with a given state When we call builder.initial_state(), we are assuming the state has random /0 initialization. If we want, we can specify a list of expressions that will serve as the initial state. The expected format is the same as the results of a call to .final_s(). TODO: this is not supported yet.


In [12]:
s2=s1.add_input(x1)
s3=s2.add_input(x1)
s4=s3.add_input(x1)

# let's continue s3 with a new input.
s5=s3.add_input(x1)

# we now have two different sequences:
# s0,s1,s2,s3,s4
# s0,s1,s2,s3,s5
# the two sequences share parameters.

assert(s5.prev() == s3)
assert(s4.prev() == s3)

s6=s3.prev().add_input(x1)
# we now have an additional sequence:
# s0,s1,s2,s6

In [13]:
s6.h()


Out[13]:
(exprssion 184/0, exprssion 196/0)

In [14]:
s6.s()


Out[14]:
(exprssion 180/0, exprssion 192/0, exprssion 184/0, exprssion 196/0)

Aside: memory efficient transduction

The RNNState interface is convenient, and allows for incremental input construction. However, sometimes we know the sequence of inputs in advance, and care only about the sequence of output expressions. In this case, we can use the add_inputs(xs) method, where xs is a list of Expression.


In [19]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
states = state.add_inputs(xs)
outputs = [s.output() for s in states]
hs =      [s.h() for s in states]
print outputs, hs


[exprssion 248/0, exprssion 254/0, exprssion 260/0] [(exprssion 246/0, exprssion 248/0), (exprssion 251/0, exprssion 254/0), (exprssion 257/0, exprssion 260/0)]

This is convenient.

What if we do not care about .s() and .h(), and do not need to access the previous vectors? In such cases we can use the transduce(xs) method instead of add_inputs(xs). transduce takes in a sequence of Expressions, and returns a sequence of Expressions. As a consequence of not returning RNNStates, trnasduce is much more memory efficient than add_inputs or a series of calls to add_input.


In [21]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
outputs = state.transduce(xs)
print outputs


[exprssion 280/0, exprssion 286/0, exprssion 292/0]

Charecter-level LSTM

Now that we know the basics of RNNs, let's build a character-level LSTM language-model. We have a sequence LSTM that, at each step, gets as input a character, and needs to predict the next character.


In [15]:
import random
from collections import defaultdict
from itertools import count
import sys

LAYERS = 2
INPUT_DIM = 50 
HIDDEN_DIM = 50  

characters = list("abcdefghijklmnopqrstuvwxyz ")
characters.append("<EOS>")

int2char = list(characters)
char2int = {c:i for i,c in enumerate(characters)}

VOCAB_SIZE = len(characters)

In [16]:
model = Model()


srnn = SimpleRNNBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, model)
lstm = LSTMBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, model)

model.add_lookup_parameters("lookup", (VOCAB_SIZE, INPUT_DIM))
model.add_parameters("R", (VOCAB_SIZE, HIDDEN_DIM))
model.add_parameters("bias", (VOCAB_SIZE))

# return compute loss of RNN for one sentence
def do_one_sentence(rnn, sentence):
    # setup the sentence
    renew_cg()
    s0 = rnn.initial_state()
    
    
    R = parameter(model["R"])
    bias = parameter(model["bias"])
    lookup = model["lookup"]
    sentence = ["<EOS>"] + list(sentence) + ["<EOS>"]
    sentence = [char2int[c] for c in sentence]
    s = s0
    loss = []
    for char,next_char in zip(sentence,sentence[1:]):
        s = s.add_input(lookup[char])
        probs = softmax(R*s.output() + bias)
        loss.append( -log(pick(probs,next_char)) )
    loss = esum(loss)
    return loss
 

# generate from model:
def generate(rnn):
    def sample(probs):
        rnd = random.random()
        for i,p in enumerate(probs):
            rnd -= p
            if rnd <= 0: break
        return i
    
    # setup the sentence
    renew_cg()
    s0 = rnn.initial_state()
    
    R = parameter(model["R"])
    bias = parameter(model["bias"])
    lookup = model["lookup"]
    
    s = s0.add_input(lookup[char2int["<EOS>"]])
    out=[]
    while True:
        probs = softmax(R*s.output() + bias)
        probs = probs.vec_value()
        next_char = sample(probs)
        out.append(int2char[next_char])
        if out[-1] == "<EOS>": break
        s = s.add_input(lookup[next_char])
    return "".join(out[:-1]) # strip the <EOS>
        

# train, and generate every 5 samples
def train(rnn, sentence):
    trainer = SimpleSGDTrainer(model)
    for i in xrange(200):
        loss = do_one_sentence(rnn, sentence)
        loss_value = loss.value()
        loss.backward()
        trainer.update()
        if i % 5 == 0: 
            print loss_value,
            print generate(rnn)

Notice that:

  1. We pass the same rnn-builder to do_one_sentence over and over again. We must re-use the same rnn-builder, as this is where the shared parameters are kept.
  2. We renew_cg() before each sentence -- because we want to have a new graph (new network) for this sentence. The parameters will be shared through the model and the shared rnn-builder.

In [17]:
sentence = "a quick brown fox jumped over the lazy dog"
train(srnn, sentence)


148.033477783 uloocopmczrrjnai funpgrh gn
91.8539962769 nonfg rujw  ogil rz bfagqome  doouqarbayv  nkfkq fa  ryeafeoi
54.739528656  fie qmek tgs   uzv
30.164937973 z ouwu  ol ruiheb 
10.7598457336 a fuic
3.2049407959 a quico bgow   ary dufox oumped over the lazy dog
1.03240394592 a quick brown fox jumped over the lazy dog
0.650553286076 a quick brown fox jumped over the lazy dog
0.47541359067 a qukck brown fox jumped over the lazy dog
0.374430119991 a quick brown fox jumped over the lazy dog
0.308728694916 a quick brown fox jumped over the lazy dog
0.262536674738 a quick brown fox jumped over the lazy dog
0.22830298543 a quick brown fox jumped over the lazy dog
0.201896265149 a quick brown fox jumped over the lazy dog
0.180914476514 a quick brown fox jumped over the lazy dog
0.1638764292 a quick brown fox jumped over the lazy dog
0.149730876088 a quick brown fox jumped over the lazy dog
0.137800350785 a quick brown fox jumped over the lazy dog
0.127627104521 a quick brown fox jumped over the lazy dog
0.118834555149 a quick brown fox jumped over the lazy dog
0.111203595996 a quick brown fox jumped over the lazy dog
0.104423552752 a quick brown fox jumped over the lazy dog
0.0984442383051 a quick brown fox jumped over the lazy dog
0.0930855795741 a quick brown fox jumped over the lazy dog
0.0882897824049 a quick brown fox jumped over the lazy dog
0.0839573442936 a quick brown fox jumped over the lazy dog
0.0800611972809 a quick brown fox jumped over the lazy dog
0.0764751881361 a quick brown fox jumped over the lazy dog
0.0732067674398 a quick brown fox jumped over the lazy dog
0.0701641961932 a quick brown fox jumped over the lazy dog
0.0673855319619 a quick brown fox jumped over the lazy dog
0.0648249685764 a quick brown fox jumped over the lazy dog
0.0623983256519 a quick brown fox jumped over the lazy dog
0.060231667012 a quick brown fox jumped over the lazy dog
0.0581797286868 a quick brown fox jumped over the lazy dog
0.0562158599496 a quick brown fox jumped over the lazy dog
0.0544277988374 a quick brown fox jumped over the lazy dog
0.0527582615614 a quick brown fox jumped over the lazy dog
0.0511232204735 a quick brown fox jumped over the lazy dog
0.0496219098568 a quick brown fox jumped over the lazy dog

In [18]:
sentence = "a quick brown fox jumped over the lazy dog"
train(lstm, sentence)


137.304412842 prmnuxlwid  d ux udn  o axez da lxojhfu dp z
129.910614014 skedaeip m vn eji ru w 
124.140289307 y ogzi d  uj ch oaoag  p ivo c ft ghd rl od vqo p rm pottx e  ehrto odou pky ojmoxurootff r mhe u
111.025337219 r ktue wkxu
98.0487518311 ckx roxy oinad juee reee yrhs wddt cr hej qvpn to ah jplfee h cejte yh
81.2529830933 d roqo cbn uotx uojp dncfooar gdpbooguy 
68.0369262695  lu gu zupn oxe ex ghd fhx jzr dom eahpfm
54.3006706238 jc cg uupkbkoogvwo no
42.3075561523  uqci jufk def rw ummtl uik born uuuii bkrn dokn rfow lup
31.4120750427 br dtojej
22.4915180206 aa quick brown oox fumpfeer over the ol
15.6982011795  qxic bbrwn fox jumpjoer ooe laz doe
10.5527915955 d gbrown fox jumped overat
6.46443843842 l qumpd dog
3.84945011139 a quick brown fox ee lzt jmmped over the lazy dog
2.08847999573 a quick brown fox jumped over the lazy dog
1.49571335316 a quick brown fox jumped over the lazy dog
1.15657103062 a quicb brown fox jumped over the lazy dog
0.938805937767 a quick brown fox jumped over tan fover the lazy dog
0.788115262985 a quick brown fox jumped over ohe lazy dog
0.678014576435 a quick brown fox jumped over the lazy dog
0.594275057316 a quick brown fox jumped over the lazy dog
0.52849572897 a quicqk brown fox jumped over the lazy dog
0.4755628407 a quick brown fox jumped over the lazy dog
0.432055711746 x quick brown fox jumped over the lazy dog
0.39562445879 a quick broon fox jumped over the lazy dog
0.364839941263 a quick brown fox jumped over the lazed over the lazy dog
0.338363438845 a quick brown fox jumped over the lazy dog
0.315368592739 a quick brown fox jumped over the lazy dog
0.295288294554 a quik brown fox jumped over the lazy dog
0.277546137571 a quick brown fox jumped over the lazy dog
0.261784315109 a quick brown fox jumped over the lazy dog
0.2476875633 a quick brown fox jumpeef over the lazy dog
0.234950631857 a quick brown fox jumped over the lazy dog
0.223440438509 a quick brown fox jumped over the lazy dog
0.21306720376 a quick buown fox jumped over the lazy dog
0.2035497576 a quick brown fox jumped over the lazed overr the lazy dog
0.194814369082 a kuickk qown fox jumped over the lazy dog
0.18677970767 a quick brown fox jumped over the lazy dog
0.179387688637 aa quick brown fox jumped over the lazy dog

The model seem to learn the sentence quite well.

Somewhat surprisingly, the Simple-RNN model learn quicker than the LSTM!

How can that be?

The answer is that we are cheating a bit. The sentence we are trying to learn has each letter-bigram exactly once. This means a simple trigram model can memorize it very well.

Try it out with more complex sequences.


In [19]:
train(srnn, "these pretzels are making me thirsty")


316.059265137 a quick brown fox jumped over the lazy dog
97.8145599365 a quick brown fox jumped over the lazy dog
52.8635063171 a quickxbrown fox jumped over therntz
24.4791564941 a quick brown fox jumped over thirn vr the n pritu
4.01108551025 thece prejzels are muking mre n the laz oakd gretzels are mupte
1.92300200462 these pretzels are making me thirsty
0.39481985569 these pretzels are making me thirsty
0.135318264365 these pretzels are making me thirsty
0.101885713637 these pretzels are making me thirsty
0.0829276740551 these pretzels are making me thirsty
0.0703410431743 these pretzels are making me thirsty
0.0612589642406 these pretzels are making me thirsty
0.054374050349 these pretzels are making me thirsty
0.0489393435419 these pretzels are making me thirsty
0.0445105880499 these pretzels are making me thirsty
0.0409039668739 these pretzels are making me thirsty
0.0377906225622 these pretzels are making me thirsty
0.0351626649499 these pretzels are making me thirsty
0.0328748859465 these pretzels are making me thirsty
0.0309081021696 these pretzels are making me thirsty
0.0291361920536 these pretzels are making me thirsty
0.0275705922395 these pretzels are making me thirsty
0.0261883735657 these pretzels are making me thirsty
0.0249245800078 these pretzels are making me thirsty
0.0237868223339 these pretzels are making me thirsty
0.0227445568889 these pretzels are making me thirsty
0.0217939633876 these pretzels are making me thirsty
0.020915934816 these pretzels are making me thirsty
0.0200990028679 these pretzels are making me thirsty
0.0193889811635 these pretzels are making me thirsty
0.0187018848956 these pretzels are making me thirsty
0.0180453378707 these pretzels are making me thirsty
0.017415529117 these pretzels are making me thirsty
0.016888782382 these pretzels are making me thirsty
0.0163544174284 these pretzels are making me thirsty
0.0158963911235 these pretzels are making me thirsty
0.0153925828636 these pretzels are making me thirsty
0.0149803832173 these pretzels are making me thirsty
0.0145872598514 these pretzels are making me thirsty
0.0141636235639 these pretzels are making me thirsty