RNNs tutorial


In [1]:
# we assume that we have the dynet module in your path.
# OUTDATED: we also assume that LD_LIBRARY_PATH includes a pointer to where libcnn_shared.so is.
from dynet import *

An LSTM/RNN overview:

An (1-layer) RNN can be thought of as a sequence of cells, $h_1,...,h_k$, where $h_i$ indicates the time dimenstion.

Each cell $h_i$ has an input $x_i$ and an output $r_i$. In addition to $x_i$, cell $h_i$ receives as input also $r_{i-1}$.

In a deep (multi-layer) RNN, we don't have a sequence, but a grid. That is we have several layers of sequences:

  • $h_1^3,...,h_k^3$
  • $h_1^2,...,h_k^2$
  • $h_1^1,...h_k^1$,

Let $r_i^j$ be the output of cell $h_i^j$. Then:

The input to $h_i^1$ is $x_i$ and $r_{i-1}^1$.

The input to $h_i^2$ is $r_i^1$ and $r_{i-1}^2$, and so on.

The LSTM (RNN) Interface

RNN / LSTM / GRU follow the same interface. We have a "builder" which is in charge of creating definining the parameters for the sequence.


In [2]:
pc = ParameterCollection()
NUM_LAYERS=2
INPUT_DIM=50
HIDDEN_DIM=10
builder = LSTMBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
# or:
# builder = SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)

Note that when we create the builder, it adds the internal RNN parameters to the ParameterCollection. We do not need to care about them, but they will be optimized together with the rest of the network's parameters.


In [3]:
s0 = builder.initial_state()

In [4]:
x1 = vecInput(INPUT_DIM)

In [5]:
s1=s0.add_input(x1)
y1 = s1.output()
# here, we add x1 to the RNN, and the output we get from the top is y (a HIDEN_DIM-dim vector)

In [6]:
y1.npvalue().shape


Out[6]:
(10,)

In [7]:
s2=s1.add_input(x1) # we can add another input
y2=s2.output()

If our LSTM/RNN was one layer deep, y2 would be equal to the hidden state. However, since it is 2 layers deep, y2 is only the hidden state (= output) of the last layer.

If we were to want access to the all the hidden state (the output of both the first and the last layers), we could use the .h() method, which returns a list of expressions, one for each layer:


In [8]:
print s2.h()


(exprssion 54/0, exprssion 66/0)

The same interface that we saw until now for the LSTM, holds also for the Simple RNN:


In [9]:
# create a simple rnn builder
rnnbuilder=SimpleRNNBuilder(NUM_LAYERS, INPUT_DIM, HIDDEN_DIM, pc)

# initialize a new graph, and a new sequence
rs0 = rnnbuilder.initial_state()

# add inputs
rs1 = rs0.add_input(x1)
ry1 = rs1.output()
print "all layers:", s1.h()


all layers: (exprssion 32/0, exprssion 42/0)

In [10]:
print s1.s()


(exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)

To summarize, when calling .add_input(x) on an RNNState what happens is that the state creates a new RNN/LSTM column, passing it:

  1. the state of the current RNN column
  2. the input x

The state is then returned, and we can call it's output() method to get the output y, which is the output at the top of the column. We can access the outputs of all the layers (not only the last one) using the .h() method of the state.

.s() The internal state of the RNN may be more involved than just the outputs $h$. This is the case for the LSTM, that keeps an extra "memory" cell, that is used when calculating $h$, and which is also passed to the next column. To access the entire hidden state, we use the .s() method.

The output of .s() differs by the type of RNN being used. For the simple-RNN, it is the same as .h(). For the LSTM, it is more involved.


In [11]:
rnn_h  = rs1.h()
rnn_s  = rs1.s()
print "RNN h:", rnn_h
print "RNN s:", rnn_s


lstm_h = s1.h()
lstm_s = s1.s()
print "LSTM h:", lstm_h
print "LSTM s:", lstm_s


RNN h: (exprssion 74/0, exprssion 76/0)
RNN s: (exprssion 74/0, exprssion 76/0)
LSTM h: (exprssion 32/0, exprssion 42/0)
LSTM s: (exprssion 28/0, exprssion 38/0, exprssion 32/0, exprssion 42/0)

As we can see, the LSTM has two extra state expressions (one for each hidden layer) before the outputs h.

Extra options in the RNN/LSTM interface

Stack LSTM The RNN's are shaped as a stack: we can remove the top and continue from the previous state. This is done either by remembering the previous state and continuing it with a new .add_input(), or using we can access the previous state of a given state using the .prev() method of state.

Initializing a new sequence with a given state When we call builder.initial_state(), we are assuming the state has random /0 initialization. If we want, we can specify a list of expressions that will serve as the initial state. The expected format is the same as the results of a call to .final_s(). TODO: this is not supported yet.


In [12]:
s2=s1.add_input(x1)
s3=s2.add_input(x1)
s4=s3.add_input(x1)

# let's continue s3 with a new input.
s5=s3.add_input(x1)

# we now have two different sequences:
# s0,s1,s2,s3,s4
# s0,s1,s2,s3,s5
# the two sequences share parameters.

assert(s5.prev() == s3)
assert(s4.prev() == s3)

s6=s3.prev().add_input(x1)
# we now have an additional sequence:
# s0,s1,s2,s6

In [13]:
s6.h()


Out[13]:
(exprssion 184/0, exprssion 196/0)

In [14]:
s6.s()


Out[14]:
(exprssion 180/0, exprssion 192/0, exprssion 184/0, exprssion 196/0)

Aside: memory efficient transduction

The RNNState interface is convenient, and allows for incremental input construction. However, sometimes we know the sequence of inputs in advance, and care only about the sequence of output expressions. In this case, we can use the add_inputs(xs) method, where xs is a list of Expression.


In [15]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
states = state.add_inputs(xs)
outputs = [s.output() for s in states]
hs =      [s.h() for s in states]
print outputs, hs


[exprssion 200/0, exprssion 206/0, exprssion 212/0] [(exprssion 198/0, exprssion 200/0), (exprssion 203/0, exprssion 206/0), (exprssion 209/0, exprssion 212/0)]

This is convenient.

What if we do not care about .s() and .h(), and do not need to access the previous vectors? In such cases we can use the transduce(xs) method instead of add_inputs(xs). transduce takes in a sequence of Expressions, and returns a sequence of Expressions. As a consequence of not returning RNNStates, trnasduce is much more memory efficient than add_inputs or a series of calls to add_input.


In [16]:
state = rnnbuilder.initial_state()
xs = [x1,x1,x1]
outputs = state.transduce(xs)
print outputs


[exprssion 216/0, exprssion 222/0, exprssion 228/0]

Character-level LSTM

Now that we know the basics of RNNs, let's build a character-level LSTM language-model. We have a sequence LSTM that, at each step, gets as input a character, and needs to predict the next character.


In [17]:
import random
from collections import defaultdict
from itertools import count
import sys

LAYERS = 2
INPUT_DIM = 50 
HIDDEN_DIM = 50  

characters = list("abcdefghijklmnopqrstuvwxyz ")
characters.append("<EOS>")

int2char = list(characters)
char2int = {c:i for i,c in enumerate(characters)}

VOCAB_SIZE = len(characters)

In [18]:
pc = ParameterCollection()


srnn = SimpleRNNBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, pc)
lstm = LSTMBuilder(LAYERS, INPUT_DIM, HIDDEN_DIM, pc)

params = {}
params["lookup"] = pc.add_lookup_parameters((VOCAB_SIZE, INPUT_DIM))
params["R"] = pc.add_parameters((VOCAB_SIZE, HIDDEN_DIM))
params["bias"] = pc.add_parameters((VOCAB_SIZE))

# return compute loss of RNN for one sentence
def do_one_sentence(rnn, sentence):
    # setup the sentence
    renew_cg()
    s0 = rnn.initial_state()
    
    
    R = parameter(params["R"])
    bias = parameter(params["bias"])
    lookup = params["lookup"]
    sentence = ["<EOS>"] + list(sentence) + ["<EOS>"]
    sentence = [char2int[c] for c in sentence]
    s = s0
    loss = []
    for char,next_char in zip(sentence,sentence[1:]):
        s = s.add_input(lookup[char])
        probs = softmax(R*s.output() + bias)
        loss.append( -log(pick(probs,next_char)) )
    loss = esum(loss)
    return loss
 

# generate from model:
def generate(rnn):
    def sample(probs):
        rnd = random.random()
        for i,p in enumerate(probs):
            rnd -= p
            if rnd <= 0: break
        return i
    
    # setup the sentence
    renew_cg()
    s0 = rnn.initial_state()
    
    R = parameter(params["R"])
    bias = parameter(params["bias"])
    lookup = params["lookup"]
    
    s = s0.add_input(lookup[char2int["<EOS>"]])
    out=[]
    while True:
        probs = softmax(R*s.output() + bias)
        probs = probs.vec_value()
        next_char = sample(probs)
        out.append(int2char[next_char])
        if out[-1] == "<EOS>": break
        s = s.add_input(lookup[next_char])
    return "".join(out[:-1]) # strip the <EOS>
        

# train, and generate every 5 samples
def train(rnn, sentence):
    trainer = SimpleSGDTrainer(pc)
    for i in xrange(200):
        loss = do_one_sentence(rnn, sentence)
        loss_value = loss.value()
        loss.backward()
        trainer.update()
        if i % 5 == 0: 
            print loss_value,
            print generate(rnn)

Notice that:

  1. We pass the same rnn-builder to do_one_sentence over and over again. We must re-use the same rnn-builder, as this is where the shared parameters are kept.
  2. We renew_cg() before each sentence -- because we want to have a new graph (new network) for this sentence. The parameters will be shared through the model and the shared rnn-builder.

In [19]:
sentence = "a quick brown fox jumped over the lazy dog"
train(srnn, sentence)


142.737915039 lvawhaevbxulc yxg esuh vkyb gymj dzcnwgq dcjzzk
84.1147460938 woifoa odp jpt gxjofkaattj
44.212223053 a q io  uoopr ouxducmwi jfxa j
23.4485988617 p tctflr
9.73490333557 w
3.23773050308 yaqzteu pux oa rntd  bxumu yyvvfalejuyhed over the lazy dog
1.06309330463 a quick browe fow jumped over the lazy dog
0.671298980713 a quick broyn  ox jumped over the lazy dog
0.490513861179 a quick brown fox jumped over the lazy dog
0.386095941067 a quick brown fox jumped over the lazy dog
0.318082690239 a quick brown fox jumped over the lazy dog
0.270276993513 a quick brown fox jumped over the lazy dog
0.234851941466 a quick brown foz jumped over the lazy dog
0.207555636764 a quick brown fox jumped over the lazy dog
0.185884565115 a quick brown fox jumped over the lazy dog
0.168265148997 a quiuk brown fox jumped over jhe lazy dog
0.153665527701 a quick brown fox jumped over the lazy dog
0.141367897391 a quick brown fox jumped over the lazy dog
0.130873680115 a quick brown fox jumped over the lazy dog
0.121810980141 a quick brown fox jumped over the lazy dog
0.113908931613 a quick brown fox jumped over the lazy dog
0.106958284974 a quick brown fox jumped over the lazy dog
0.100796818733 a quick brown fox jumped over the lazy dog
0.0953008085489 a quick brown fox jumped over the lazy dog
0.090367347002 a zuick brown for jumped over the lazy dog
0.0859087407589 a quick brown fox jumped over the lazy dog
0.0818664133549 a quick brown fox jumped over the lazy dog
0.0781841799617 a quick brown fox jumped over the lazy dog
0.0748091414571 a quick brown fox jumped over the lazy dog
0.0717144161463 a quick brown fox jumped over the lazy dog
0.0688648074865 a quick brown fox jumped over the lazy dog
0.0662328600883 a quick brown fox jumped over the lazy dog
0.0637853741646 a quick brown fox jumped over the lazy dog
0.0615109689534 a quick brown fox jumped over the lazy dog
0.0593910999596 a quick brown fox jumped over the lazy dog
0.0574130378664 a quick brown fox jumped over the lazy dog
0.0555621087551 a quick brown fox jumped over the lazy dog
0.0538215488195 a quick brown fox jumped over the lazy dog
0.0521896965802 a quick brown fox jumped over the lazy dog
0.0506477579474 a quick brown fox jumped over the lazy dog

In [20]:
sentence = "a quick brown fox jumped over the lazy dog"
train(lstm, sentence)


141.891098022 aoyekppy   mocalmz xk atc jlg oaddk
128.925964355 hempeyud  ki
121.445785522 qpveti  fyobec  ztmr eioknnueh ehecdvabxmc ydpmdm
110.670722961 z buws lmy vvrw
93.5055999756 vueoa cprlnkrd o ocazk nb olegiep o fftr t
82.1586227417 zj rvsr oej c toz bnarreow  fffj
67.430847168  rzfik qoyc ohe hqe  oea uitet ou udjkpme oak kdk oe fbu kcz fox dfoprl too o rxat luurnfowrrtj rbtram to url xlj  okrr ooe otm hcy roab llsg doy ifzw rrbow rbowwb oke jxpee
54.9477920532 ba uiy doge she ueeze  oejv
43.3301696777  qquc crgibbroej  oxne ove rr
34.4687461853  uqckk owrbfo og uouk doge l
25.5408306122   reuk lfr own fox juamd ov
18.9417610168   qojn doo broww boan jover txe zacy moen crlw numk fox joge overwa trez quqk browx  ox ruor oro fow j uoez kon fror bowe luccmd ogwr foy jodmoed ox
13.1646575928  qucy dov
9.46595668793 wiuuik brttxl laed over tre lazy dog
5.6522898674  rukc irown fox juaped over the lazy dov
3.38144731522 a quick brown fox jumver the lazy dog
1.80010521412 a bfoin fox jumped ovk  fox luick brown fox jumped over the lazy dog
1.30616080761 a quic brownn fox jumped over the lazy dog
1.02201879025 a quick brown fox jumped over the lazy dog
0.83735615015  qucck brown fox jcmped over the lazy dog
0.708056390285 a quickz brown fox jumped over the lazy dog
0.612650871277 a quick brown fox jumped over the lazy dog
0.539469838142 a quick brown fox jumped over thel lazy dog
0.481610894203 va quick brown fox jumped over the lazy dog
0.434762001038 a quuck dovtbown fox jumped over the lazy dog
0.396079242229 a quick brown fox jumped over the lazy dog
0.363606244326 a quick brown fox jumped over the laza dog
0.335973978043 a quick brown fox jumped over the lazy dog
0.312186658382 a quick brown fox jumped over the lazy dog
0.291498303413 a quick brown fox qu
0.273335546255 a quick brown fox jumped ove
0.257278442383 a quick brown fox jumped over the lazy dog
0.242971763015 a quick brown fox jumped over the lazy dog
0.230153128505 a quick brown fox jumped over the lazy dog
0.218599274755 a quick brown fox jumped over the lazy dog
0.208135351539 a quick brown fox jumped over the lazy dog
0.198613137007 a quick brown fox jumped over tie lazy dog
0.189909905195 a quick brown fox jumped over the lazy dog
0.181928783655 a quick brown fox jumped over the lazy dog
0.174587100744 a quick brown fox jumped over the lazy dog

The model seem to learn the sentence quite well.

Somewhat surprisingly, the Simple-RNN model learn quicker than the LSTM!

How can that be?

The answer is that we are cheating a bit. The sentence we are trying to learn has each letter-bigram exactly once. This means a simple trigram model can memorize it very well.

Try it out with more complex sequences.


In [21]:
train(srnn, "these pretzels are making me thirsty")


332.651580811 a quick brown fox jumped over the lazy dog
133.209350586 a quick brown fox jumped over the lazy doe hu yum xd the
65.0720596313 azquick brown fox jumped over ohe iog
31.5592880249 a quick brown fox jumpedrovtretpede pretzelz are makink ma tui idmilt
13.2322559357 theve prwtumpede mhxtjaypny mreticv
1.87829053402 thele pretzelb mre laki loet dre za tuiri mtoina ma qui irwt ere sa taetsdaca qamtuioe ma ick mrolnn mhetsirstyyza qa luijuoethetsepsaaya quirk brmtze ehersjlyaa aumu orkrbtoeqz lrea quijk jrowza quiquihi sakiny mr tui ss thels theqetursy famtzi maethehe iretza lamqzd zretsels area qhirk browna yhetza quirkt rxkwn mox ja isi mq thirsty
0.680327475071 these pretzels are makind me thirsty
0.176128521562 these pretzels are making me thirsty
0.126334354281 these pretzels are making me thirsty
0.10075186193 these pretzels are making me thirsty
0.0846510156989 these pretzels are making me thirsty
0.0734022557735 these pretzels are making me thirsty
0.0650328546762 these pretzels are making me thirsty
0.0585154108703 these pretzels are making me thirsty
0.0532807298005 these pretzels are making me thirsty
0.0489665567875 these pretzels are making me thirsty
0.0453444086015 these pretzels are making me thirsty
0.0422535128891 these pretzels are making me thirsty
0.0395833179355 these pretzels are making me thirsty
0.0372485220432 these mretzels are making me thirsty
0.0351839251816 these pretzels are making me thirsty
0.0333509668708 these pretzels are making me thirsty
0.0317104011774 these pretzels are making me thirsty
0.0302277039737 these pretzels are making me thirsty
0.0288887582719 these pretzels are making me thirsty
0.0276643745601 these pretzels are making me thirsty
0.0265435613692 these pretzels are making me thirsty
0.0255212895572 these pretzels are making me thirsty
0.0245705824345 these pretzels are making me thirsty
0.0236932244152 these pretzels are making me thirsty
0.0228785891086 these pretzels are making me thirsty
0.0221205893904 these pretzels are making me thirsty
0.0214090794325 these pretzels are making me thirsty
0.0207556784153 these pretzels are making me thirsty
0.0201329570264 these pretzels are making me thirsty
0.0195484217256 these pretzels are making me thirsty
0.0190003421158 these pretzels are making me thirsty
0.0184785164893 these pretzels are making me thirsty
0.0179911740124 these pretzels are making me thirsty
0.0175334792584 these pretzels are making me thirsty

In [ ]: