21 June 2017 - WNixalo - Practical Deep Learning I - Lesson 6 CodeAlong NotebookLecture


In [4]:
import theano
%matplotlib inline
import sys, os
sys.path.insert(1, os.path.join('../utils'))
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Setup

We're going to download the collected works of Nietzsche to use as our data for this class.


In [5]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))


Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
corpus length: 600901

In [6]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1
print('total chars:', vocab_size)


total chars: 86

Sometimes it's useful to have a zero value in the dataset, eg. for padding


In [7]:
chars.insert(0, "\0")

In [8]:
''.join(chars[1:-6])


Out[8]:
'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz'

Map from chars to indices and back again


In [9]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

idx will be the data we use from now on -- it simply converts all the characters to their index (based on the mapping above)


In [10]:
idx = [char_indices[c] for c in text]
# the 1st 10 characters:
idx[:10]


Out[10]:
[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [11]:
''.join(indices_char[i] for i in idx[:70])


Out[11]:
'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

3 Char Model

Create Inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters.

We're going to build a model that attempts to predict the 4th character from the previous 3. To do that we're going ot go through our whole list of indexes from 0 to the end minus 3, and we'll create a whole list of the 0th, 4th, 8th, 12th, etc characters; the 1st, 5th, 9th, etc; and 2nd, 6th, 10th, & so forth..


In [12]:
cs = 3
c1_dat = [idx[i] for i in xrange(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in xrange(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in xrange(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in xrange(0, len(idx)-1-cs, cs)] # <-- gonna predict this

Our inputs


In [13]:
# we can turn these into Numpy arrays just by stacking them up together
x1 = np.stack(c1_dat[:-2]) # 1st chars
x2 = np.stack(c2_dat[:-2]) # 2nd chars
x3 = np.stack(c3_dat[:-2]) # 3rd chars
# for every 4 character peice of this - collected works

Our output


In [14]:
# labels will just be the 4th characters
y = np.stack(c4_dat[:-2])

The first 4 inputs and ouputs


In [15]:
# 1st, 2nd, 3rd chars of text
x1[:4], x2[:4], x3[:4]


Out[15]:
(array([40, 30, 29,  1]), array([42, 25,  1, 43]), array([29, 27,  1, 45]))

In [16]:
# 4th char of text
y[:3]


Out[16]:
array([30, 29,  1])

Will try to predict 30 from 40, 42, 29, 29 from 30, 25, 27, & etc. That's our data format.


In [17]:
x1.shape, y.shape


Out[17]:
((200297,), (200297,))

The number of latent factors to create (ie. the size of our 3 character inputs)


In [18]:
# we're going to turn these into embeddings
n_fac = 42

In [19]:
# by creating an embedding matrix
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

In [20]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)
# c1, c2, c3 represent result of putting each char through the embedding & 
# getting out 42 latent vectors. <-- those are input to greenarrow.

Create and train model

Pick a size for our hidden state


In [22]:
n_hidden = 256

This is the 'green arrow' from our diagram - the layer operation from input to hidden.


In [23]:
dense_in = Dense(n_hidden, activation='relu')

Our first hidden activation is simpmly this function applied to the result of the embedding of the first character.


In [24]:
c1_hidden = dense_in(c1)

This is the 'orange arrow' from our diagram - the layer operation from hidden to hidden.


In [25]:
dense_hidden = Dense(n_hidden, activation='tanh')

Our second and third hidden activations sum up the previous hidden state (after applying dense_hidden) to the new input state.


In [26]:
c2_dense = dense_in(c2) # char-2 embedding thru greenarrow
hidden_2 = dense_hidden(c1_hidden) # output of char-1's hidden state thru orangearrow
c2_hidden = merge([c2_dense, hidden_2]) # merge the two together (default: sum)

In [27]:
c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

This is the 'blue arrow' from our diagram - the layer operation from hidden to hidden.


In [28]:
dense_out = Dense(vocab_size, activation='softmax') #output size: 86 <-- vocab_size

The third hidden state is the input to our output layer.


In [29]:
c4_out = dense_out(c3_hidden)

In [30]:
# passing in our 3 inputs & 1 output
model = Model([c1_in, c2_in, c3_in], c4_out)

In [ ]:
model.summary()

In [31]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
model.optimizer.lr=0.001

In [32]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=10)


Epoch 1/10
200297/200297 [==============================] - 11s - loss: 2.3986    
Epoch 2/10
200297/200297 [==============================] - 13s - loss: 2.2581    
Epoch 3/10
200297/200297 [==============================] - 13s - loss: 2.2052    
Epoch 4/10
200297/200297 [==============================] - 12s - loss: 2.1616    
Epoch 5/10
200297/200297 [==============================] - 12s - loss: 2.1305    
Epoch 6/10
200297/200297 [==============================] - 12s - loss: 2.1096    
Epoch 7/10
200297/200297 [==============================] - 12s - loss: 2.0952    
Epoch 8/10
200297/200297 [==============================] - 12s - loss: 2.0855    
Epoch 9/10
200297/200297 [==============================] - 12s - loss: 2.0775    
Epoch 10/10
200297/200297 [==============================] - 14s - loss: 2.0710    
Out[32]:
<keras.callbacks.History at 0x11d5979d0>

Test model

We test it by creating a function that we pass 3 letters into. Turn those letters into character indices (by looking them up in char_indices); turn each of those into a Numpy array; call model.predict on those 3 arrays -- that gives us 86 outputs; We then do an argmax to find which index of those 86 is the highest: and that's the character number we want to return.

So basically: we give it 3 letters, it gives us back the letter it thinks is most likely next.


In [33]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict(arrs)
    i = np.argmax(p)
    return chars[i]

In [34]:
get_next('phi')


Out[34]:
'l'

In [36]:
get_next(' th')


Out[36]:
'e'

In [37]:
get_next(' an')


Out[37]:
'd'

Our first RNN:

Create inputs

This is the size of our unrolled RNN.


In [38]:
cs = 8 # use 8 characters to predict the 9th

For each 0 thru 7, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.


In [39]:
c_in_dat = [[idx[i+n] for i in xrange(0, len(idx)-1-cs, cs)] for n in range(cs)]

^ create an array with 8 elements; ea. elem contains a list of the 0th,8th,16th,24th char, the 1st,9th,17th,25th char, etc just as before. A sequence of inputs where ea. one is offset by 1 from the previous one.

Then create a list of the next character in each of these series. This will be the labels for our model. -- so our output will be exactly the same thing, except we're going to look at the indexed across by cs, so: 8. So this'll be the 8th thing in each sequence, predicted by the previous ones.


In [40]:
c_out_dat = [idx[i+cs] for i in xrange(0, len(idx)-1-cs,cs)]

In [41]:
# go thru every one of those input lists and turn into Numpy array:
xs = [np.stack(c[:-2]) for c in c_in_dat]

In [64]:
len(xs), xs[0].shape


Out[64]:
(8, (75110,))

In [65]:
y = np.stack(c_out_dat[:-2])

So each column below is one series of 8 characters from the text.


In [66]:
# visualizing xs:
[xs[n][:cs] for n in range(cs)]


Out[66]:
[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

The first column in each row is the 1st 8 characters of our test.

...and this is the next character after each sequence:


In [67]:
y[:cs]


Out[67]:
array([ 1, 33,  2, 72, 67, 73,  2, 68])

NOTE: it's almost the same as the 1-7th characters in the first row of xs. The final character in ea. sequence is the same as the first character of this sequence. It's almost the same as our previous data, just done in a more flexible way.

Create and train Model


In [70]:
n_fac = 42
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [71]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]
n_hidden = 256

In [72]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax')

The first character of each sequence goes through dense_in(), to create our first hidden activations.


In [73]:
hidden = dense_in(c_ins[0][1])

Then for each successive layer we combine the output of dense_in() on the ext character with the output of dense_hidden() on the current hidden state, to create the new hidden state.


In [77]:
for i in range(1,cs):
    c_dense = dense_in(c_ins[i][1]) #green arrow
    hidden = dense_hidden(hidden)   #orange arrow
    hidden = merge([c_dense, hidden]) #merge the two together

Putting the final hidden state through dense_out() gives us our output:


In [78]:
c_out = dense_out(hidden)

So now we can create our model


In [80]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [81]:
model.fit(xs, y, batch_size=64, nb_epoch=12)


Epoch 1/12
75110/75110 [==============================] - 10s - loss: 2.5348    
Epoch 2/12
75110/75110 [==============================] - 11s - loss: 2.2498    
Epoch 3/12
75110/75110 [==============================] - 12s - loss: 2.1498    
Epoch 4/12
75110/75110 [==============================] - 11s - loss: 2.0812    
Epoch 5/12
75110/75110 [==============================] - 12s - loss: 2.0281    
Epoch 6/12
75110/75110 [==============================] - 11s - loss: 1.9836    
Epoch 7/12
75110/75110 [==============================] - 11s - loss: 1.9452    
Epoch 8/12
75110/75110 [==============================] - 11s - loss: 1.9101    
Epoch 9/12
75110/75110 [==============================] - 11s - loss: 1.8785    
Epoch 10/12
75110/75110 [==============================] - 11s - loss: 1.8492    
Epoch 11/12
75110/75110 [==============================] - 11s - loss: 1.8215    
Epoch 12/12
75110/75110 [==============================] - 11s - loss: 1.7967    
Out[81]:
<keras.callbacks.History at 0x123b8ca10>

With 8 pieces of context instead of 3, we'd expect it to do better; and we see a loss of ~1.8 instead of ~2.0

Test model:


In [100]:
def get_next(inp):
    idxs = [np.array(char_indices[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [101]:
get_next('for thos')


Out[101]:
' '

In [86]:
get_next('part of ')


Out[86]:
't'

In [87]:
get_next('queens a')


Out[87]:
'n'

Returning Sequences

Create Inputs

Here, c_out_dat is identical to c_in_dat, but moved across 1 character.

So now, in ea sequece, the 1st char will be used to predict the 2nd, the 1st & 2nd to predict the 3rd, and so on. A lot more predictions going on --> a lot more opportunity for the model to learn.


In [102]:
# c_in_dat = [[idx[i+n] for i in xrange(0, len(idx)-1-cs, cs)] for n in range(cs)]
c_out_dat = [[idx[i+n] for i in xrange(1, len(idx)-cs, cs)] for n in range(cs)]

In [103]:
ys = [np.stack(c[:-2]) for c in c_out_dat]

In [104]:
[xs[n][:cs] for n in range(cs)]


Out[104]:
[array([[40],
        [ 1],
        [33],
        [ 2],
        [72],
        [67],
        [73],
        [ 2]]), array([[42],
        [ 1],
        [38],
        [44],
        [ 2],
        [ 9],
        [61],
        [73]]), array([[29],
        [43],
        [31],
        [71],
        [54],
        [ 9],
        [58],
        [61]]), array([[30],
        [45],
        [ 2],
        [74],
        [ 2],
        [76],
        [67],
        [58]]), array([[25],
        [40],
        [73],
        [73],
        [76],
        [61],
        [24],
        [71]]), array([[27],
        [40],
        [61],
        [61],
        [68],
        [54],
        [ 2],
        [58]]), array([[29],
        [39],
        [54],
        [ 2],
        [66],
        [73],
        [33],
        [ 2]]), array([[ 1],
        [43],
        [73],
        [62],
        [54],
        [ 2],
        [72],
        [67]])]

In [105]:
[ys[n][:cs] for n in range(cs)]


Out[105]:
[array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67]),
 array([ 1, 33,  2, 72, 67, 73,  2, 68])]

Now our y dataset looks exactly like our x dataset did before, but everything's shifted over by 1 character.

Create and train the model:


In [106]:
dense_in = Dense(n_hidden, activation='relu')
dense_out = Dense(vocab_size, activation='softmax', name='output')

We're going to pass a vector of all zeros as our starting point - here's our input layers for that:


In [107]:
# our char1 input is moved within the diagram's loop-box; so now need 
# initialized input (zeros)
inp1 = Input(shape=(n_fac,), name='zeros')
hidden = dense_in(inp1)

In [108]:
outs = []

for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden], mode='sum')
    # every layer now has an output
    outs.append(dense_out(hidden))

# our loop is identical to before, except at the end of every loop, 
# we're going to append this output; so now we're going to have 
# 8 outputs for every sequence instead of just 1.

In [109]:
# model now has vector of 0s: [inp1], and array of outputs: outs
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [110]:
zeros = np.tile(np.zeros(n_fac), (len(xs[0]),1))
zeros.shape


Out[110]:
(75110, 42)

Now when we fit, we add the array of zeros to the start of our inputs; our ouputs are going to be those lists of 8, offset by 1. We get 8 losses instead of 1 bc ea. one of those 8 outputs has its own loss. You'll see the model's ability to predict the 1st character from a bunch of zeros is very limited and flattens out; but predicting the 8th char with the context of 7 is much better and keeps improving.


In [112]:
model.fit([zeros]+xs, ys, batch_size=64, nb_epoch=12)


Epoch 1/12
75110/75110 [==============================] - 15s - loss: 20.4057 - output_loss_1: 2.7273 - output_loss_2: 2.6021 - output_loss_3: 2.5623 - output_loss_4: 2.5195 - output_loss_5: 2.5127 - output_loss_6: 2.4935 - output_loss_7: 2.5060 - output_loss_8: 2.4824    
Epoch 2/12
75110/75110 [==============================] - 16s - loss: 18.0430 - output_loss_1: 2.5157 - output_loss_2: 2.3603 - output_loss_3: 2.2590 - output_loss_4: 2.1993 - output_loss_5: 2.1854 - output_loss_6: 2.1718 - output_loss_7: 2.1881 - output_loss_8: 2.1634    
Epoch 3/12
75110/75110 [==============================] - 17s - loss: 17.5095 - output_loss_1: 2.4969 - output_loss_2: 2.3358 - output_loss_3: 2.2014 - output_loss_4: 2.1225 - output_loss_5: 2.1005 - output_loss_6: 2.0808 - output_loss_7: 2.0971 - output_loss_8: 2.0744    
Epoch 4/12
75110/75110 [==============================] - 17s - loss: 17.1960 - output_loss_1: 2.4896 - output_loss_2: 2.3261 - output_loss_3: 2.1684 - output_loss_4: 2.0776 - output_loss_5: 2.0481 - output_loss_6: 2.0257 - output_loss_7: 2.0425 - output_loss_8: 2.0180    
Epoch 5/12
75110/75110 [==============================] - 16s - loss: 16.9768 - output_loss_1: 2.4851 - output_loss_2: 2.3199 - output_loss_3: 2.1488 - output_loss_4: 2.0462 - output_loss_5: 2.0117 - output_loss_6: 1.9861 - output_loss_7: 2.0030 - output_loss_8: 1.9760    
Epoch 6/12
75110/75110 [==============================] - 17s - loss: 16.8223 - output_loss_1: 2.4824 - output_loss_2: 2.3167 - output_loss_3: 2.1359 - output_loss_4: 2.0232 - output_loss_5: 1.9855 - output_loss_6: 1.9565 - output_loss_7: 1.9731 - output_loss_8: 1.9490    
Epoch 7/12
75110/75110 [==============================] - 17s - loss: 16.6977 - output_loss_1: 2.4811 - output_loss_2: 2.3130 - output_loss_3: 2.1257 - output_loss_4: 2.0075 - output_loss_5: 1.9629 - output_loss_6: 1.9330 - output_loss_7: 1.9509 - output_loss_8: 1.9236    
Epoch 8/12
75110/75110 [==============================] - 17s - loss: 16.5996 - output_loss_1: 2.4795 - output_loss_2: 2.3121 - output_loss_3: 2.1197 - output_loss_4: 1.9930 - output_loss_5: 1.9474 - output_loss_6: 1.9130 - output_loss_7: 1.9294 - output_loss_8: 1.9054    
Epoch 9/12
75110/75110 [==============================] - 16s - loss: 16.5198 - output_loss_1: 2.4785 - output_loss_2: 2.3106 - output_loss_3: 2.1142 - output_loss_4: 1.9827 - output_loss_5: 1.9337 - output_loss_6: 1.8992 - output_loss_7: 1.9138 - output_loss_8: 1.8870    
Epoch 10/12
75110/75110 [==============================] - 16s - loss: 16.4499 - output_loss_1: 2.4771 - output_loss_2: 2.3079 - output_loss_3: 2.1102 - output_loss_4: 1.9729 - output_loss_5: 1.9213 - output_loss_6: 1.8853 - output_loss_7: 1.9018 - output_loss_8: 1.8732    
Epoch 11/12
75110/75110 [==============================] - 17s - loss: 16.3920 - output_loss_1: 2.4767 - output_loss_2: 2.3074 - output_loss_3: 2.1058 - output_loss_4: 1.9655 - output_loss_5: 1.9125 - output_loss_6: 1.8738 - output_loss_7: 1.8897 - output_loss_8: 1.8607    
Epoch 12/12
75110/75110 [==============================] - 17s - loss: 16.3412 - output_loss_1: 2.4762 - output_loss_2: 2.3068 - output_loss_3: 2.1031 - output_loss_4: 1.9585 - output_loss_5: 1.9031 - output_loss_6: 1.8646 - output_loss_7: 1.8782 - output_loss_8: 1.8506    
Out[112]:
<keras.callbacks.History at 0x1244f6f10>

This is what a sequence model looks like. We pass in a sequence and after every character, it returns a guess.

Test Model:


In [115]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [116]:
get_nexts(' this is')


[' ', 't', 'h', 'i', 's', ' ', 'i', 's']
Out[116]:
['t', 'h', 'e', 't', ' ', 'c', 'n', ' ']

In [118]:
get_nexts(' part of')


[' ', 'p', 'a', 'r', 't', ' ', 'o', 'f']
Out[118]:
['t', 'o', 'r', 'i', 'i', 'o', 'f', ' ']

Sequence model with Keras

return_sequences=True says: rather than put the triangle outside the loop, put it inside the recurrent loop; ie: return an output every time you go to another time-step, intead of just a single output at the end.


In [120]:
n_hidden, n_fac, cs, vocab_size


Out[120]:
(256, 42, 8, 86)

To convert our previous ekras model into a sequence model, simply add the return_sequences=True parameter, and TimeDistributed() around our dense layer.


In [121]:
model = Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, return_sequences=True, activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
])

In [122]:
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_4 (Embedding)          (None, 8, 42)         3612        embedding_input_1[0][0]          
____________________________________________________________________________________________________
simplernn_1 (SimpleRNN)          (None, 8, 256)        76544       embedding_4[0][0]                
____________________________________________________________________________________________________
timedistributed_1 (TimeDistribut (None, 8, 86)         22102       simplernn_1[0][0]                
====================================================================================================
Total params: 102,258
Trainable params: 102,258
Non-trainable params: 0
____________________________________________________________________________________________________

Note 8 outputs. What TimeDistributed does is create 8 copies of the weight matrix for each output.

NOTE: in Keras anytime you specfy return_sequences=True, any dense layers after that must have TimeDistributed wrapped around them. - Bc, in this case, we want to create not 1 dense layer, but 8.


In [123]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [124]:
# just some dimensionality changes required; otherwise same
x_rnn = np.stack(np.squeeze(xs), axis=1)
y_rnn = np.stack(ys, axis=1)

In [125]:
x_rnn.shape, y_rnn.shape


Out[125]:
((75110, 8), (75110, 8, 1))

In [126]:
model.fit(x_rnn, y_rnn, batch_size=64, nb_epoch=8)


/Users/WayNoxchi/Miniconda3/Theano/theano/tensor/basic.py:5130: UserWarning: flatten outdim parameter is deprecated, use ndim instead.
  "flatten outdim parameter is deprecated, use ndim instead.")
Epoch 1/8
75110/75110 [==============================] - 19s - loss: 2.4360    
Epoch 2/8
75110/75110 [==============================] - 19s - loss: 1.9980    
Epoch 3/8
75110/75110 [==============================] - 18s - loss: 1.8819    
Epoch 4/8
75110/75110 [==============================] - 18s - loss: 1.8230    
Epoch 5/8
75110/75110 [==============================] - 18s - loss: 1.7856    
Epoch 6/8
75110/75110 [==============================] - 19s - loss: 1.7589    
Epoch 7/8
75110/75110 [==============================] - 21s - loss: 1.7387    
Epoch 8/8
75110/75110 [==============================] - 23s - loss: 1.7232    
Out[126]:
<keras.callbacks.History at 0x12be0a750>

In [128]:
def get_nexts_keras(inp):
    idxs = [char_indices[c] for c in inp]
    arr = np.array(idxs)[np.newaxis,:]
    p = model.predict(arr)[0]
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [129]:
get_nexts_keras(' this is')


[' ', 't', 'h', 'i', 's', ' ', 'i', 's']
Out[129]:
['t', 'h', 'e', 's', ' ', 'i', 's', ' ']

Stateful model with Keras

A stateful model is easy to create (just add stateful=True) but harder to train. We had to add batchnorm and use LSTM to get reasonable results.

When using stateful in Keras, you have to also add batch_input_shape to the first layer, and fix the batch size there.


Need shuffle=False and stateful=True in order to have memory for LSTMs. Having stateful True, tells Keras not to not to reset the hidden activations to zero, but leave them as they are -- allowing the model to build up as much state as it wants. If this done, then shuffle must be False, so it'll pass in the 1st 8 chars, then 2nd 8, and so on, in order, leaving the hidden state untouched in between each one.

Training these stateful models is a lot harder than other models due to exploding gradients (exploding activations). These Long-term Dependency Models were thought impossible until the '90s when researchers invented the LSTM model.

In the LSTM model, the recurrent weight-activations-matrix loop is replaced with a loop with a Neural Network inside of it that decides how much of this state matrix to keep, and to use at each activation. Therefore the model can learn how to avoid gradient explosions. It can actually learn how to create an effective sequence.

Below an LSTM & BatchNormed inputs are used bc J.H. had no luck with pure RNNs and ReLUs.


In [130]:
bs = 64

In [132]:
model = Sequential([
            Embedding(vocab_size, n_fac, input_length=cs, batch_input_shape=(bs,8)),
            BatchNormalization(),
            LSTM(n_hidden, return_sequences=True, stateful=True),
            TimeDistributed(Dense(vocab_size, activation='softmax')),
])

In [ ]:
# dont forget to compile (accidetnly hit `M` in JNB)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

Since we're using a fixed batch shape, we have to ensure our inputs and outputs are an even multiple of the batch size.


In [135]:
mx = len(x_rnn)//bs*bs

The LSTM model takes much longer to run than the regular RNN because it isn't in parallel: each operation has to be run in order.


In [138]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)


Epoch 1/4
75072/75072 [==============================] - 84s - loss: 2.2238    
Epoch 2/4
75072/75072 [==============================] - 93s - loss: 1.9708    
Epoch 3/4
75072/75072 [==============================] - 80s - loss: 1.8960    
Epoch 4/4
75072/75072 [==============================] - 79s - loss: 1.8515    
Out[138]:
<keras.callbacks.History at 0x12dc79fd0>

In [139]:
model.optimizer.lr=1e-4
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)


Epoch 1/4
75072/75072 [==============================] - 86s - loss: 1.8191    
Epoch 2/4
75072/75072 [==============================] - 88s - loss: 1.7933    
Epoch 3/4
75072/75072 [==============================] - 84s - loss: 1.7716    
Epoch 4/4
75072/75072 [==============================] - 98s - loss: 1.7525    
Out[139]:
<keras.callbacks.History at 0x1302b4890>

One-Hot Sequence Model with Keras

Necessary to use onehot encoding in order to build an RNN straight in Theano, next up.

This is the keras version of the theano model we're about to create.


In [141]:
model = Sequential([
            SimpleRNN(n_hidden, return_sequences=True, input_shape=(cs, vocab_size),
                      activation='relu', inner_init='identity'),
            TimeDistributed(Dense(vocab_size, activation='softmax')),
])
model.compile(loss='categorical_crossentropy', optimizer=Adam())
# no embedding layer, so inputs must be onhotted too.

In [142]:
oh_ys = [to_categorical(o, vocab_size) for o in ys]
oh_y_rnn = np.stack(oh_ys, axis=1)

oh_xs = [to_categorical(o, vocab_size) for o in xs]
oh_x_rnn = np.stack(oh_xs, axis=1)

oh_x_rnn.shape, oh_y_rnn.shape


Out[142]:
((75110, 8, 86), (75110, 8, 86))

The 86 is the onehotted dimension; classes of characters


In [144]:
model.fit(oh_x_rnn, oh_y_rnn, batch_size=64, nb_epoch=8)


Epoch 1/8
75110/75110 [==============================] - 21s - loss: 2.4420    
Epoch 2/8
75110/75110 [==============================] - 20s - loss: 2.0373    
Epoch 3/8
75110/75110 [==============================] - 20s - loss: 1.9232    
Epoch 4/8
75110/75110 [==============================] - 20s - loss: 1.8581    
Epoch 5/8
75110/75110 [==============================] - 25s - loss: 1.8156    
Epoch 6/8
75110/75110 [==============================] - 23s - loss: 1.7850    
Epoch 7/8
75110/75110 [==============================] - 21s - loss: 1.7609    
Epoch 8/8
75110/75110 [==============================] - 20s - loss: 1.7415    
Out[144]:
<keras.callbacks.History at 0x12ec9ec50>

In [145]:
def get_nexts_oh(inp):
    idxs = np.array([char_indices[c] for c in inp])
    arr = to_categorical(idxs, vocab_size)
    p = model.predict(arr[np.newaxis,:])[0]
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [146]:
get_nexts_oh(' this is')


[' ', 't', 'h', 'i', 's', ' ', 'i', 's']
Out[146]:
['t', 'h', 'e', 'n', ' ', 't', 's', ' ']

Theano RNN

Sometimes you just need more control over an artificial mind.


In [152]:
n_input = vocab_size
n_output = vocab_size

Using raw theano, we have to create our weight matrices and bias vectors outselves - here are the functions we'll use to do so (using glorot initialization).

The return values are wrapped in shard(), which is how we tell theano that it can manage this data (copying it to and from the GPU as necessary).


In [166]:
def init_wgts(rows, cols): 
    scale = math.sqrt(2/rows) # 1st calc Glorot number to scale weights
    return shared(normal(scale=scale, size=(rows, cols)).astype(np.float32))
def init_bias(rows): 
    return shared(np.zeros(rows, dtype=np.float32))

We return the weights and biases together as a tuple. For the hidden weights, we'll use an identity initialization (as recommended by Hinton.)


In [167]:
def wgts_and_bias(n_in, n_out): 
    return init_wgts(n_in, n_out), init_bias(n_out)
def id_and_bias(n): 
    return shared(np.eye(n, dtype=np.float32)), init_bias(n)

Different than Python; Theano requires us to build up a computation graph first. shared(..) basically tells Theano to keep track of something to send to the GPU later. Once you wrap smth in shared it basically belongs to Theano now.


Theano doesn't actually do any computations until we explicitly compile and evaluate the function (at which point it'll be turned into CUDA code and sent off to the GPU). So our job is to describe the computations that we'll want theano to do - the first step is to tell theano what inputs we'll be providing to our computation:


In [168]:
# Theano variables
t_inp = T.matrix('inp')
t_outp = T.matrix('outp')
t_h0 = T.vector('h0')
lr = T.scalar('lr')

all_args = [t_h0, t_inp, t_outp, lr]

Now we're ready to create our initial weight matrices.


In [169]:
W_h = id_and_bias(n_hidden)
W_x = wgts_and_bias(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

We now need to tell Theano what happens each time we take a single step of this RNN.


Theano handles looping by using the GPU scan operation. We have to tell theano what to do at each step through the scan - this is the function we'll use, which does a single forward pass for one character.


In [170]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    # Calculate the hidden activations
    h = nnet.relu(T.dot(x, W_x) + b_x + T.dot(h, W_h) + b_h)
    # Calculate the output activations
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    # Return both (the 'Flatten()' is to work around a theano bug)
    return h, T.flatten(y, 1)

Now we can provide everything necessary for the scan operation, so we can set that up - we have to pass in the function to call at each step, the sequence to step through, the initial values of the outputs, and any other arguments to pass to the step function.


In [171]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp,
                            outputs_info=[t_h0, None], non_sequences=w_all)

You get this error if you accidently define step as:

def step(x, h, W_h, W_x, b_h, b_x, W_y, b_y):

In [164]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
                            outputs_info=[t_h0, None], non_sequences=w_all)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-164-3e8da040c384> in <module>()
      1 [v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
----> 2                             outputs_info=[t_h0, None], non_sequences=w_all)

/Users/WayNoxchi/Miniconda3/Theano/theano/scan_module/scan.pyc in scan(fn, sequences, outputs_info, non_sequences, n_steps, truncate_gradient, go_backwards, mode, name, profile, allow_gc, strict, return_list)
   1074             pass
   1075         scan_inputs += [arg]
-> 1076     scan_outs = local_op(*scan_inputs)
   1077     if type(scan_outs) not in (list, tuple):
   1078         scan_outs = [scan_outs]

/Users/WayNoxchi/Miniconda3/Theano/theano/gof/op.pyc in __call__(self, *inputs, **kwargs)
    613         """
    614         return_list = kwargs.pop('return_list', False)
--> 615         node = self.make_node(*inputs, **kwargs)
    616 
    617         if config.compute_test_value != 'off':

/Users/WayNoxchi/Miniconda3/Theano/theano/scan_module/scan_op.pyc in make_node(self, *inputs)
    550                                   argoffset + idx,
    551                                   outer_sitsot.type.ndim,
--> 552                                   inner_sitsot_out.type.ndim))
    553 
    554         argoffset += len(self.outer_sitsot(inputs))

ValueError: When compiling the inner function of scan (the function called by scan in each of its iterations) the following error has been encountered: The initial state (`outputs_info` in scan nomenclature) of variable IncSubtensor{Set;:int64:}.0 (argument number 1) has 2 dimension(s), while the corresponding variable in the result of the inner function of scan (`fn`) has 2 dimension(s) (it should be one less than the initial state). For example, if the inner function of scan returns a vector of size d and scan uses the values of the previous time-step, then the initial state in scan should be a matrix of shape (1, d). The first dimension of this matrix corresponds to the number of previous time-steps that scan uses in each of its iterations. In order to solve this issue if the two varialbe currently have the same dimensionality, you can increase the dimensionality of the variable in the initial state of scan by using dimshuffle or shape_padleft. 

We can now calculate our loss function, and all of our gradients, with just a couple lines of code!


In [172]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

We even have to show Theano how to do SGD - so we set up this dictionary of updates to complete after every forward pass, which apply the standard SGD update rule to every weight.


In [173]:
def upd_dict(wgts, grads, lr):
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

In [176]:
upd = upd_dict(w_all, g_all, lr)

# we're finally ready to compile the function!:
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)

In [177]:
X = oh_x_rnn
Y = oh_y_rnn
X.shape, Y.shape


Out[177]:
((75110, 8, 86), (75110, 8, 86))

To use it, we simply loop through our input data, calling the function compiled above, and printing our progress from time to time.


We have to manually define our loop because Theano doesn't have it built-in.


In [178]:
err=0.0; l_rate=0.01
for i in xrange(len(X)):
    err += fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 1000 == 999:
        print ("Error:{:.3f}".format(err/1000))
        err=0.0


Error:25.120
Error:21.431
Error:20.914
Error:19.876
Error:18.779
Error:19.194
Error:19.000
Error:18.420
Error:17.936
Error:18.240
Error:17.485
Error:17.656
Error:18.457
Error:17.288
Error:16.788
Error:17.815
Error:17.392
Error:17.204
Error:16.853
Error:16.688
Error:16.567
Error:16.392
Error:16.697
Error:16.234
Error:16.807
Error:16.642
Error:16.033
Error:16.312
Error:16.290
Error:16.460
Error:16.745
Error:16.408
Error:16.716
Error:16.333
Error:16.022
Error:16.710
Error:16.056
Error:16.427
Error:16.097
Error:16.295
Error:15.366
Error:15.766
Error:15.756
Error:15.999
Error:16.021
Error:15.937
Error:15.677
Error:16.150
Error:16.075
Error:16.100
Error:15.274
Error:15.574
Error:14.976
Error:14.878
Error:15.590
Error:15.355
Error:14.704
Error:15.438
Error:15.136
Error:15.035
Error:15.059
Error:15.390
Error:15.336
Error:15.070
Error:14.814
Error:14.856
Error:14.295
Error:14.776
Error:15.243
Error:14.872
Error:15.136
Error:14.681
Error:14.452
Error:14.532
Error:14.464

In [179]:
f_y = theano.function([t_h0, t_inp], v_y, allow_input_downcast=True)

In [180]:
pred = np.argmax(f_y(np.zeros(n_hidden), X[6]), axis=1)

In [181]:
act = np.argmax(X[6], axis=1)

In [182]:
[indices_char[o] for o in act]


Out[182]:
['t', 'h', 'e', 'n', '?', ' ', 'I', 's']

In [183]:
[indices_char[o] for o in pred]


Out[183]:
['h', 'e', ' ', ' ', ' ', 'T', 't', ' ']

In [4]:
# looking at how to use Python debugger
import numpy as np
import pdb
err=0.; lrate=0.01
for i in range(len(np.zeros(10))):
    err += np.sin(lrate+np.e**i)
    pdb.set_trace()


> <ipython-input-4-e555df94ff9a>(5)<module>()
-> for i in range(len(np.zeros(10))):
(Pdb) err
0.84683184461801519
(Pdb) l
  1  	# looking at how to use Python debugger
  2  	import numpy as np
  3  	import pdb
  4  	err=0.; lrate=0.01
  5  ->	for i in range(len(np.zeros(10))):
  6  	    err += np.sin(lrate+np.e**i)
  7  	    pdb.set_trace()
[EOF]
(Pdb) n
> <ipython-input-4-e555df94ff9a>(6)<module>()
-> err += np.sin(lrate+np.e**i)
(Pdb) q
---------------------------------------------------------------------------
BdbQuit                                   Traceback (most recent call last)
<ipython-input-4-e555df94ff9a> in <module>()
      4 err=0.; lrate=0.01
      5 for i in range(len(np.zeros(10))):
----> 6     err += np.sin(lrate+np.e**i)
      7     pdb.set_trace()

<ipython-input-4-e555df94ff9a> in <module>()
      4 err=0.; lrate=0.01
      5 for i in range(len(np.zeros(10))):
----> 6     err += np.sin(lrate+np.e**i)
      7     pdb.set_trace()

/Users/WayNoxchi/Miniconda3/envs/FAI/lib/python2.7/bdb.pyc in trace_dispatch(self, frame, event, arg)
     47             return # None
     48         if event == 'line':
---> 49             return self.dispatch_line(frame)
     50         if event == 'call':
     51             return self.dispatch_call(frame, arg)

/Users/WayNoxchi/Miniconda3/envs/FAI/lib/python2.7/bdb.pyc in dispatch_line(self, frame)
     66         if self.stop_here(frame) or self.break_here(frame):
     67             self.user_line(frame)
---> 68             if self.quitting: raise BdbQuit
     69         return self.trace_dispatch
     70 

BdbQuit: 

In [ ]: