In [8]:
from theano.sandbox import cuda
cuda.use('gpu1')


WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

WARNING (theano.sandbox.cuda): Ignoring call to use(1), GPU number 0 is already in use.

In [9]:
%matplotlib inline
import utils;
from utils import *
from keras.layers import TimeDistributed, Activation
from numpy.random import choice


Using Theano backend.

Setup


In [10]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))


corpus length: 600893

In [11]:
!tail {path} -n10


whole of antiquity swarmed with sons of god--he attained the same goal,
the sense of complete sinlessness, complete irresponsibility, that can
now be attained by every individual through science.--In the same manner
I have viewed the saints of India who occupy an intermediate station
between the christian saints and the Greek philosophers and hence are
not to be regarded as a pure type. Knowledge and science--as far as they
existed--and superiority to the rest of mankind by logical discipline
and training of the intellectual powers were insisted upon by the
Buddhists as essential to sanctity, just as they were denounced by the
christian world as the indications of sinfulness.

In [12]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars: ', vocab_size)


total chars:  85

Sometimes it's useful to have a zero value in the dataset, e.g. for padding


In [13]:
chars.insert(0, "\0")
''.join(chars[:-6])


Out[13]:
'\x00\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxy'

In [14]:
char_indices = dict((c, i) for i,c in enumerate(chars))
indices_char = dict((i, c) for i,c in enumerate(chars))
idx = [char_indices[c] for c in text]

In [15]:
idx[:10]


Out[15]:
[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [16]:
''.join(indices_char[i] for i in idx[:20])


Out[16]:
'PREFACE\n\n\nSUPPOSING '

3 Char Model

Create inputs

Create a list of every 4th character, starting at the 0th


In [17]:
cs=3
c1_dat = [idx[i] for i in range(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-1-cs, cs)]

In [18]:
c1_dat[:5]
?np.stack

Our inputs


In [19]:
# Return them into numpy arrays
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

In [20]:
print(x1.shape)
x1[:5]


(200295,)
Out[20]:
array([40, 30, 29,  1, 40])

Our output


In [21]:
y = np.stack(c4_dat[:-2])

The number of latent factors to create


In [22]:
n_fac = 42

Create inputs and embedding outputs for each of our 3 character inputs


In [23]:
def embedding_input(name, n_in, n_out):
    """ Create embedding by first create an input layer
    then apply an embedding layer to it
    """
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

Of course, you can always use one-hot encoding for each character. But with embedding, we are able to capture the similarities between 'A' and 'a' for example. Whereas with one-hot encoding, 'A' and 'a' will be treated no differently with 'A' and 'Z'.


In [24]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)

Create and train model

We choose to have 256 activations


In [30]:
n_hidden = 256

Now create the 'green arrow' from our diagram - the layer operation from input to hidden


In [19]:
dense_in = Dense(n_hidden, activation='relu')

Our first hidden activation is simply this function applied to the result of the embedding of the first character(s)


In [20]:
c1_hidden = dense_in(c1)

Now create the 'orange arrow' from our diagram - the layer operation from hidden to hidden


In [21]:
dense_hidden = Dense(n_hidden, activation='tanh')

Our 2nd and 3rd hidden activations sum up the previous hidden status to the new input state


In [22]:
c2_dense = dense_in(c2)
hidden_2 = dense_hidden(c1_hidden)
c2_hidden = merge([c2_dense, hidden_2])
# merge: by default is a sum

In [23]:
c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

Now create the 'blue arrow' from our diagram - the layer operation from hidden to output


In [24]:
dense_out = Dense(vocab_size, activation='softmax')

In [25]:
c4_out = dense_out(c3_hidden)

Till now, c4_out contains all the model process information


In [26]:
c4_out


Out[26]:
Softmax.0

In [27]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [28]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [29]:
model.optimizer.lr=0.001

In [30]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=5)


Epoch 1/10
200295/200295 [==============================] - 16s - loss: 2.3987    
Epoch 2/10
200295/200295 [==============================] - 16s - loss: 2.2637    
Epoch 3/10
200295/200295 [==============================] - 16s - loss: 2.2166    
Epoch 4/10
200295/200295 [==============================] - 16s - loss: 2.1747    
Epoch 5/10
200295/200295 [==============================] - 17s - loss: 2.1403    
Epoch 6/10
200295/200295 [==============================] - 17s - loss: 2.1159    
Epoch 7/10
200295/200295 [==============================] - 16s - loss: 2.0989    
Epoch 8/10
200295/200295 [==============================] - 17s - loss: 2.0871    
Epoch 9/10
200295/200295 [==============================] - 17s - loss: 2.0777    
Epoch 10/10
200295/200295 [==============================] - 17s - loss: 2.0715    
Out[30]:
<keras.callbacks.History at 0x7efc02aa7fd0>

Test model


In [31]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict(arrs)
    i = np.argmax(p)
    return chars[i]

In [32]:
get_next('phi')


Out[32]:
'l'

In [33]:
get_next(' th')


Out[33]:
'e'

In [34]:
get_next(' an')


Out[34]:
'd'

Our First RNN!

Now, we will try to implement the typical structure of RNN - i.e. the rolled one.

That is, we cannot use c1, c2, c.... Instead, we will need an array of inputs all at once.


In [15]:
cs=8
c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)]
            for n in range(cs)]

Then create the labels for our model (i.e. the 9th char)


In [16]:
c_out_dat = [idx[i+cs] for i in range(0, len(idx)-1-cs, cs)]

In [17]:
xs = [np.stack(c[:-2]) for c in c_in_dat]
len(xs), xs[0].shape


Out[17]:
(8, (75109,))

In [18]:
y = np.stack(c_out_dat[:-2])

So each column below is one series of 8 chars from the text


In [19]:
[xs[n][:cs] for n in range(4)]


Out[19]:
[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58])]

And this is the next char after each sequence


In [26]:
y[:4]


Out[26]:
array([ 1, 33,  2, 72])

In [27]:
n_fac=42

Create and train model


In [28]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [29]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [30]:
n_hidden = 256

In [31]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax')

The first char of each sequence goes through dense_in(), to create our first hidden activations.


In [32]:
# the [1] means the embedding structure
hidden = dense_in(c_ins[0][1])

Then for each successive layer we combine the output of dense_in() on the next character with the output of dense_hidden() on the current hidden state, to create the new hidden state.


In [33]:
for i in range(1, cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

Putting the final hidden state through dense_out() gives us our output


In [34]:
c_out = dense_out(hidden)

So now we can create our model.


In [37]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
model.fit(xs, y, batch_size=64, nb_epoch=10)


Epoch 1/10
75109/75109 [==============================] - 9s - loss: 2.5319     
Epoch 2/10
75109/75109 [==============================] - 9s - loss: 2.2567     
Epoch 3/10
75109/75109 [==============================] - 9s - loss: 2.1509     
Epoch 4/10
75109/75109 [==============================] - 9s - loss: 2.0792     
Epoch 5/10
75109/75109 [==============================] - 9s - loss: 2.0229     
Epoch 6/10
75109/75109 [==============================] - 9s - loss: 1.9776     
Epoch 7/10
75109/75109 [==============================] - 10s - loss: 1.9380    
Epoch 8/10
75109/75109 [==============================] - 10s - loss: 1.9027    
Epoch 9/10
75109/75109 [==============================] - 10s - loss: 1.8743    
Epoch 10/10
75109/75109 [==============================] - 9s - loss: 1.8456     
Out[37]:
<keras.callbacks.History at 0x7f62587be7b8>

Test model


In [38]:
def get_next(inp):
    idxs = [np.array(char_indices[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [39]:
get_next('for thos')


Out[39]:
' '

In [40]:
get_next('part of ')


Out[40]:
't'

In [41]:
model.fit(xs, y, batch_size=64, nb_epoch=5)


Epoch 1/5
75109/75109 [==============================] - 10s - loss: 1.8194    
Epoch 2/5
75109/75109 [==============================] - 9s - loss: 1.7964     
Epoch 3/5
75109/75109 [==============================] - 10s - loss: 1.7754    
Epoch 4/5
75109/75109 [==============================] - 10s - loss: 1.7553    
Epoch 5/5
75109/75109 [==============================] - 10s - loss: 1.7381    
Out[41]:
<keras.callbacks.History at 0x7f624f817518>

In [42]:
get_next('for thos')


Out[42]:
' '

Our first RNN with keras!


In [43]:
n_hidden, n_fac, cs, vobac_size = (256, 42, 8, 86)

This is nearly exactly equivalent to the RNN we built ourselves in the previous section


In [44]:
model = Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        # rather than initialize them randomly, we init them as an identity matrix
        # it always does well with relu
        SimpleRNN(n_hidden, activation='relu', inner_init='identity'),
        Dense(vocab_size, activation='softmax')
    ])

In [45]:
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_3 (Embedding)          (None, 8, 42)         3570        embedding_input_3[0][0]          
____________________________________________________________________________________________________
simplernn_3 (SimpleRNN)          (None, 256)           76544       embedding_3[0][0]                
____________________________________________________________________________________________________
dense_6 (Dense)                  (None, 85)            21845       simplernn_3[0][0]                
====================================================================================================
Total params: 101,959
Trainable params: 101,959
Non-trainable params: 0
____________________________________________________________________________________________________

'sparse_categorical_crossentropy' works together with the integer categorical output.

It's the same as 'categorical_crossentropy' with one-hot encoding output.


In [46]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [47]:
model.fit(np.concatenate(xs, axis=1), y, batch_size=64, nb_epoch=8)


WARNING (theano.configdefaults): install mkl with `conda install mkl-service`: No module named 'mkl'
Epoch 1/8
75109/75109 [==============================] - 10s - loss: 2.7813    
Epoch 2/8
75109/75109 [==============================] - 10s - loss: 2.2769    
Epoch 3/8
75109/75109 [==============================] - 10s - loss: 2.0789    
Epoch 4/8
75109/75109 [==============================] - 10s - loss: 1.9435    
Epoch 5/8
75109/75109 [==============================] - 10s - loss: 1.8390    
Epoch 6/8
75109/75109 [==============================] - 10s - loss: 1.7585    
Epoch 7/8
75109/75109 [==============================] - 10s - loss: 1.6933    
Epoch 8/8
75109/75109 [==============================] - 10s - loss: 1.6381    
Out[47]:
<keras.callbacks.History at 0x7f62493e86d8>

In [49]:
def get_next_keras(inp):
    idxs = [char_indices[c] for c in inp]
    # np.newaxis is used to add 1 more dimention
    arrs = np.array(idxs)[np.newaxis, :]
    p = model.predict(arrs)[0]
    return chars[np.argmax(p)]

In [50]:
get_next_keras('this is ')


Out[50]:
't'

In [51]:
get_next_keras('part of ')


Out[51]:
't'

In [52]:
get_next_keras('queens a')


Out[52]:
'n'

Returning sequences

Now, instead of predicting char n using chars 1 to n-1, we will predict char 2 to n using chars 1 to n-1

Create inputs

To use a sequence model, we can leave our input unchanged - but we have to change our output to a sequence.

Here, c_out_dat is identical to c_in_dat, but moved across 1 character


In [58]:
# c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)]
#            for n in range(cs)]
c_out_dat = [[idx[i+n] for i in range(1, len(idx)-cs, cs)]
            for n in range(cs)]
xs = [np.stack(c[:-2]) for c in c_in_dat]
ys = [np.stack(c[:-2]) for c in c_out_dat]

Reading down each column shows one set of inputs and outputs.


In [59]:
[xs[n][:cs] for n in range(cs)]


Out[59]:
[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

In [60]:
[ys[n][:cs] for n in range(cs)]


Out[60]:
[array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67]),
 array([ 1, 33,  2, 72, 67, 73,  2, 68])]

Create and train model


In [62]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu')
dense_out = Dense(vocab_size, activation='softmax')

In [63]:
inp1 = Input(shape=(n_fac,), name='zeros')
hidden = dense_in(inp1)

In [64]:
outs = []
for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden], mode='sum')
    outs.append(dense_out(hidden))

In [65]:
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [66]:
# add an array of 0s to our input
zeros = np.tile(np.zeros(n_fac), (len(xs[0]),1))
zeros.shape


Out[66]:
(75109, 42)

In [71]:
model.fit([zeros]+xs, ys, batch_size=64, nb_epoch=8)


Epoch 1/8
75109/75109 [==============================] - 22s - loss: 16.0193 - dense_11_loss_1: 2.4812 - dense_11_loss_2: 2.3084 - dense_11_loss_3: 2.0844 - dense_11_loss_4: 1.9269 - dense_11_loss_5: 1.8353 - dense_11_loss_6: 1.8011 - dense_11_loss_7: 1.8027 - dense_11_loss_8: 1.7793    
Epoch 2/8
75109/75109 [==============================] - 22s - loss: 15.9434 - dense_11_loss_1: 2.4801 - dense_11_loss_2: 2.3063 - dense_11_loss_3: 2.0790 - dense_11_loss_4: 1.9166 - dense_11_loss_5: 1.8222 - dense_11_loss_6: 1.7887 - dense_11_loss_7: 1.7880 - dense_11_loss_8: 1.7625    
Epoch 3/8
75109/75109 [==============================] - 22s - loss: 15.8804 - dense_11_loss_1: 2.4796 - dense_11_loss_2: 2.3054 - dense_11_loss_3: 2.0757 - dense_11_loss_4: 1.9100 - dense_11_loss_5: 1.8124 - dense_11_loss_6: 1.7753 - dense_11_loss_7: 1.7738 - dense_11_loss_8: 1.7483    
Epoch 4/8
75109/75109 [==============================] - 22s - loss: 15.8238 - dense_11_loss_1: 2.4786 - dense_11_loss_2: 2.3044 - dense_11_loss_3: 2.0715 - dense_11_loss_4: 1.9039 - dense_11_loss_5: 1.8027 - dense_11_loss_6: 1.7644 - dense_11_loss_7: 1.7611 - dense_11_loss_8: 1.7372    
Epoch 5/8
75109/75109 [==============================] - 22s - loss: 15.7722 - dense_11_loss_1: 2.4781 - dense_11_loss_2: 2.3035 - dense_11_loss_3: 2.0702 - dense_11_loss_4: 1.8965 - dense_11_loss_5: 1.7934 - dense_11_loss_6: 1.7538 - dense_11_loss_7: 1.7507 - dense_11_loss_8: 1.7261    
Epoch 6/8
75109/75109 [==============================] - 22s - loss: 15.7318 - dense_11_loss_1: 2.4775 - dense_11_loss_2: 2.3025 - dense_11_loss_3: 2.0680 - dense_11_loss_4: 1.8930 - dense_11_loss_5: 1.7879 - dense_11_loss_6: 1.7451 - dense_11_loss_7: 1.7417 - dense_11_loss_8: 1.7160    
Epoch 7/8
75109/75109 [==============================] - 22s - loss: 15.6889 - dense_11_loss_1: 2.4772 - dense_11_loss_2: 2.3021 - dense_11_loss_3: 2.0650 - dense_11_loss_4: 1.8883 - dense_11_loss_5: 1.7813 - dense_11_loss_6: 1.7368 - dense_11_loss_7: 1.7319 - dense_11_loss_8: 1.7063    
Epoch 8/8
75109/75109 [==============================] - 22s - loss: 15.6541 - dense_11_loss_1: 2.4773 - dense_11_loss_2: 2.3006 - dense_11_loss_3: 2.0619 - dense_11_loss_4: 1.8831 - dense_11_loss_5: 1.7754 - dense_11_loss_6: 1.7295 - dense_11_loss_7: 1.7267 - dense_11_loss_8: 1.6997    
Out[71]:
<keras.callbacks.History at 0x7f6240f0c390>

Test model


In [69]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [72]:
get_nexts(' this is')


[' ', 't', 'h', 'i', 's', ' ', 'i', 's']
Out[72]:
['t', 'h', 'e', 't', ' ', 'p', 'n', ' ']

In [73]:
get_nexts(' part of')


[' ', 'p', 'a', 'r', 't', ' ', 'o', 'f']
Out[73]:
['t', 'o', 'r', 't', ' ', 'o', 'f', ' ']

Sequence model with keras

To convert our previous keras model into a sequence model, simply add the 'return_sequences=True' parameter, and add TimeDistributed() around our dense layer.


In [80]:
model = Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, return_sequences=True, activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])


/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/keras/engine/topology.py:368: UserWarning: The `regularizers` property of layers/models is deprecated. Regularization losses are now managed via the `losses` layer/model property.
  warnings.warn('The `regularizers` property of '

In [81]:
model.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_5 (Embedding)          (None, 8, 42)         3570        embedding_input_5[0][0]          
____________________________________________________________________________________________________
simplernn_5 (SimpleRNN)          (None, 8, 256)        76544       embedding_5[0][0]                
____________________________________________________________________________________________________
timedistributed_2 (TimeDistribut (None, 8, 85)         21845       simplernn_5[0][0]                
====================================================================================================
Total params: 101,959
Trainable params: 101,959
Non-trainable params: 0
____________________________________________________________________________________________________

In [82]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [83]:
xs[0].shape


Out[83]:
(75109,)

In [90]:
x_rnn=np.stack(xs, axis=1)
c_out_dat = [[idx[i+n] for i in range(1, len(idx)-cs, cs)]
            for n in range(cs)]
ys = [np.stack(c[:-2]) for c in c_out_dat]
y_rnn=np.expand_dims(np.stack(ys, axis=1),-1)
x_rnn.shape, y_rnn.shape


Out[90]:
((75109, 8), (75109, 8, 1))

In [91]:
model.fit(x_rnn, y_rnn, batch_size=64, nb_epoch=8)


Epoch 1/8
75109/75109 [==============================] - 11s - loss: 2.4334    
Epoch 2/8
75109/75109 [==============================] - 11s - loss: 2.0022    
Epoch 3/8
75109/75109 [==============================] - 11s - loss: 1.8840    
Epoch 4/8
75109/75109 [==============================] - 11s - loss: 1.8225    
Epoch 5/8
75109/75109 [==============================] - 11s - loss: 1.7839    
Epoch 6/8
75109/75109 [==============================] - 11s - loss: 1.7570    
Epoch 7/8
75109/75109 [==============================] - 11s - loss: 1.7368    
Epoch 8/8
75109/75109 [==============================] - 11s - loss: 1.7214    
Out[91]:
<keras.callbacks.History at 0x7f623cc891d0>

Stateful model with keras

stateful=True means that at end of each sequence, don't reset the hidden activations to 0, but leave them as they are. And also make sure that you pass shuffle=False when you train the model.

A stateful model is easy to create (just add "stateful=True") but harder to train. We had to add batchnorm and use LSTM to get reasonable results.

When using stateful in keras, you have to also add 'batch_input_shape' to the first layer, and fix the batch size there.


In [92]:
bs=64

In [93]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs, batch_input_shape=(bs,8)),
        BatchNormalization(),
        LSTM(n_hidden, return_sequences=True, stateful=True),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])


/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/keras/engine/topology.py:368: UserWarning: The `regularizers` property of layers/models is deprecated. Regularization losses are now managed via the `losses` layer/model property.
  warnings.warn('The `regularizers` property of '

In [94]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

Since we're using a fixed batch shape, we have to ensure our inputs and outputs are a even multiple of the batch size.


In [95]:
mx = len(x_rnn)//bs*bs

In [96]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)


Epoch 1/4
75072/75072 [==============================] - 34s - loss: 2.2216    
Epoch 2/4
75072/75072 [==============================] - 34s - loss: 1.9729    
Epoch 3/4
75072/75072 [==============================] - 34s - loss: 1.8982    
Epoch 4/4
75072/75072 [==============================] - 34s - loss: 1.8535    
Out[96]:
<keras.callbacks.History at 0x7f622fec33c8>

Theano RNN

As we start to try to add more and more stuff on top of Keras, or to Keras, increasingly, you'll find yourself wanting to use Theano. Because Theano is the kind of language that Keras is using behind the scenes.

In the process of doing it in Theano, we're going to force ourselves to think through a lot more of the details than we have before because Theano does not have any of the conveniences. There's no such thing as a layer. We have to think about all of the weight matrices and activtion functions ourself.

In Theano, there is a concept of variable. Rather than actually starting off by giving it data, we start off by describing the types of data that when we give it.


In [44]:
# import theano
# from theano import shared, tensor as T
n_input = vocab_size
n_output = vocab_size
n_hidden = 256
cs=8

Using raw theano, we have to create our weight matrices and bias vectors - here are the functions we'll use to do so (using glorot initialization).

The return values are wrapped in shared(), which is how we tell theano that it can manage this data (copying it to and from the GPU as necessary).


In [26]:
def init_wgts(rows, cols):
    scale = math.sqrt(2/rows)
    return shared(normal(scale=scale, size=(rows, cols)).astype(np.float32))
def init_bias(rows):
    return shared(np.zeros(rows, dtype=np.float32))

We return the weights and biases together as a tuple. For the hidden weights, we'll use an identity initialization (as recommended by Hinton.)


In [27]:
def wgts_and_bias(n_in, n_out):
    return init_wgts(n_in, n_out), init_bias(n_out)
def id_and_bias(n):
    return shared(np.eye(n, dtype=np.float32)), init_bias(n)

Theano doesn't actually do any computations until we explicitly compile and evaluate the function (at which point it'll be turned into CUDA code and sent off to the GPU). So our job is to describe the computations that we'll want theano to do - the first step is to tell theano what inputs we'll be providing to our computation:


In [28]:
t_inp = T.matrix('inp')
t_outp = T.matrix('outp')
t_h0 = T.vector('h0')
lr = T.scalar('lr')

all_args = [t_h0, t_inp, t_outp, lr]

Now, we're ready to create our initial weight matrices


In [34]:
W_h = id_and_bias(n_hidden)
W_x = wgts_and_bias(n_input, n_hidden)
W_y = wgts_and_bias(n_hidden, n_output)
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

Theano handles looping by using the GPU scan operation. We have to tell theano what to do at each step through the scan - this is the function we'll use, which does a single forward pass for one char:


In [36]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    # calculate the hidden activations
    h = nnet.relu(T.dot(x, W_x)+b_x+T.dot(h, W_h)+b_h)
    # calculate the output activations
    y = nnet.softmax(T.dot(h, W_y)+b_y)
    # return both ('Flatten()' is to work around a theano bug)
    return h, T.flatten(y,1)

Now we can provide everything necessary for the scan operation, so we can setup that up - we have to pass in the function to call at each step, the sequence to step through, the initial values of the outputs, and any other arguments to pass to the step function.


In [37]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
                            outputs_info=[t_h0, None], non_sequences=w_all)

We can now calculate our loss function, and all of our gradients, with just a couple of lines of code!


In [38]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

We even have to show theano how to do SGD - so we set up this dictionary of updates to complete after every forward pass, which apply to standard SGD update rule to every weight.


In [39]:
def upd_dict(wgts, grads, lr):
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts, grads)})

upd = upd_dict(w_all, g_all, lr)

We're finally ready to compile the function!


In [40]:
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)


WARNING (theano.configdefaults): install mkl with `conda install mkl-service`: No module named 'mkl'

In [45]:
c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)]
            for n in range(cs)]
xs = [np.stack(c[:-2]) for c in c_in_dat]
c_out_dat = [[idx[i+n] for i in range(1, len(idx)-cs, cs)]
            for n in range(cs)]
ys = [np.stack(c[:-2]) for c in c_out_dat]
oh_ys = [to_categorical(o, vocab_size) for o in ys]
oh_y_rnn=np.stack(oh_ys, axis=1)

oh_xs = [to_categorical(o, vocab_size) for o in xs]
oh_x_rnn=np.stack(oh_xs, axis=1)

X = oh_x_rnn
Y = oh_y_rnn
X.shape, Y.shape


Out[45]:
((75109, 8, 85), (75109, 8, 85))

To use it, we simply loop through our input data, calling the function compiled above, and printing our progress from time to time.


In [ ]:
err=0.0; l_rate=0.01
for i in range(len(X)):
    err += fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 1000 == 999:
        print  ("Error:{:.3f}".format(err/1000))
        err=0.0


Error:25.370
Error:21.578
Error:20.992
Error:19.928
Error:18.853
Error:19.251
Error:19.045
Error:18.480
Error:17.985
Error:18.197
Error:17.503
Error:17.633
Error:18.445
Error:17.311
Error:16.842
Error:17.803
Error:17.468
Error:17.261
Error:16.921
Error:16.722
Error:16.663
Error:16.428
Error:16.803
Error:16.243
Error:16.842
Error:16.624
Error:16.079
Error:16.348
Error:16.332
Error:16.560
Error:16.825
Error:16.489
Error:16.720
Error:16.394
Error:16.085
Error:16.779
Error:16.073
Error:16.473
Error:16.101
Error:16.359
Error:15.399
Error:15.769
Error:15.835
Error:16.094
Error:16.029
Error:15.969
Error:15.722
Error:16.176
Error:16.030
Error:16.135
Error:15.344
Error:15.506
Error:15.083
Error:14.980
Error:15.593
Error:15.404
Error:14.786
Error:15.472
Error:15.123
Error:15.152
Error:15.039
Error:15.448
Error:15.398

In [ ]: