Sequence Modeling: Train a model to do basic math

@sunilmallya

@jrhunt

code will be available after session @ https://github.com/sunilmallya/dl-twitch-series

Dataset

We know how to generate basic math sequences, our input looks like this:

5+4+9 = 18

1+4+2 = 7

3 2 5 = 30

9 2 2 = 36

Recurrent Neural Networks (RNN)

Certain problems require you to model information dependency. They can be external or could be based on previous context, which we'll explore in this notebook. Recurrent neural networks (RNN) are designed to address this issue. They are networks with loops in them, thus allowing information to persist, making them ideal to model sequences. Speech recognition, image captioning, video analysis, language modeling and many other use case can be addressed using RNN's. Check out Andrej Karapthy's blog for more details. Lets visualize a few examples of sequence in input or oputput:

img src: http://karpathy.github.io

A RNN can be imagined as a repeated copy of the same network, each passing a message to the next step, in this case it may be to the same layer. But RNN's have problems with long range dependencies - interactions between sequences that are many steps apart, this is explained in detail in this blog. Also RNN's tend to be deep and hence more vulnerable to the vanishing gradient problem

To address these issues Hochreiter & Schmidhuber (1997) introduced LSTMs, which now are widely used to model the problems described above.

Sequence modeling basics

What is sequence modeling?

Encoder - Decoder

Encoder transforms the input in to a hidden state. This can now be translated or converted in to any desirable form. The decoder tries to predict the next word in the outpute (decoder) sequence, given the current word in the decoder sequence and the context from the encoder sequence.

What problems can we solve with this

  • Language translation
  • image caption
  • sequence conversion

img src: https://indico.io/blog/sequence-modeling-neuralnets-part1/

why one hot ?

In general most ML agorithms don't understand the label data directly. They like input/output variables to be numbers.

For categorical data where there is no ordering, its not desirable to let the model assume any kind of ordering. One-hot encoding can help establish this and is a more desirable format for the neural network to work on.

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

References

Its inspired by serveral blogs, but mainly from this paper - "Sequence to Sequence Learning with Neural Networks"

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf


In [12]:
import numpy as np
import mxnet as mx

import random
random.seed(10)

n_samples = 10000
n_numbers = 3 # numbers to operate on
largest = 10

character_set = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '*', ' ']

input_sequence_length =  8 # (10 + 10 + 10)
output_sequence_length = 4

In [13]:
def generate_data(n_samples):
    inputs = []
    labels = []
    
    char_to_int = dict((c,i)  for i,c in enumerate(character_set))
    
    for i in range(n_samples):
        lhs = [random.randint(1, largest) for _ in range(n_numbers)]
        op = random.choice(['+', '*'])
        if op == '+':
            rhs = sum(lhs)
        elif op == '*':
            rhs = 1
            for l in lhs:
                rhs *= l
        
        lhs = [str(l) for l in lhs]
        strng = op.join(lhs)
        padded_strng = "%*s" % (input_sequence_length, strng)
        enc_input = [char_to_int[ch] for ch in padded_strng]
        
        #RHS
        padded_strng = "%*s" % (output_sequence_length, rhs)
        enc_lbl = [char_to_int[ch] for ch in padded_strng]
    
        inputs.append(enc_input)
        labels.append(enc_lbl)
        
    return np.array(inputs), np.array(labels)


dataX, dataY = generate_data(n_samples)

print dataX.shape, dataY.shape


(10000, 8) (10000, 4)

In [14]:
# Iterators

batch_size = 32
data_dim = len(character_set)

train_iter = mx.io.NDArrayIter(data=dataX, label=dataY,
                               data_name='data', label_name='target',
                               batch_size=batch_size, shuffle=True)

train_iter.provide_data, train_iter.provide_label


Out[14]:
([DataDesc[data,(32, 8L),<type 'numpy.float32'>,NCHW]],
 [DataDesc[target,(32, 4L),<type 'numpy.float32'>,NCHW]])

In [15]:
# Lets build the model!

data = mx.sym.var('data')
target = mx.sym.var('target')

# Encoder - Decoder 

lstm1 = mx.rnn.FusedRNNCell(num_hidden=32, prefix="lstm1_", get_next_state=True)
lstm2 = mx.rnn.FusedRNNCell(num_hidden=32, prefix="lstm2_", get_next_state=False)

# convert to one-hot encoding

data_one_hot = mx.sym.one_hot(data, depth=len(character_set))
data_one_hot = mx.sym.transpose(data_one_hot, axes=(1,0,2))

# unroll the loop/lstm

# Note that when unrolling, if 'merge_outputs' is set to True, the 'outputs' is merged into a single symbol
# In the layout, 'N' represents batch size, 'T' represents sequence length, and 'C' represents the
# number of dimensions in hidden states.

l_out, encode_state = lstm1.unroll(length=input_sequence_length, inputs=data_one_hot, layout="TNC")
encode_state_h = mx.sym.broadcast_to(encode_state[0], shape=(output_sequence_length, 0, 0))

# Decoder

decode_out, l2 = lstm2.unroll(length=output_sequence_length, inputs=encode_state_h, layout="TNC")
decode_out = mx.sym.reshape(decode_out, shape=(-1,batch_size))

out = mx.sym.FullyConnected(decode_out, num_hidden=data_dim)
out = mx.sym.reshape(out, shape=(output_sequence_length, -1, data_dim))
out = mx.sym.transpose(out, axes=(1,0,2))

loss = mx.sym.mean(-mx.sym.pick(mx.sym.log_softmax(out), target, axis=-1))
loss  = mx.sym.make_loss(loss)

shape = {"data": (batch_size, dataX[0].shape[0])}
#mx.viz.plot_network(out, shape=shape)

In [16]:
#["cats", "dogs"] ==> [0, 1] ==> [[1, 0], [0, 1]]

In [17]:
# Module


net = mx.mod.Module(symbol=loss,
                   data_names=['data'], label_names=['target'],
                    context=mx.gpu(7)
                   )

net.bind(data_shapes=train_iter.provide_data, label_shapes=train_iter.provide_label)
net.init_params(initializer=mx.init.Xavier())
net.init_optimizer(optimizer='adam',
                  optimizer_params={'learning_rate': 1E-3},
                   kvstore=None
                  )

In [18]:
#Train

epochs = 100

total_batches = len(dataX) // batch_size

for epoch in range(epochs):
    avg_loss =0
    train_iter.reset()
    
    for i, data_batch in enumerate(train_iter):
        net.forward_backward(data_batch=data_batch)
        loss = net.get_outputs()[0].asscalar()
        avg_loss += loss
        net.update()
    avg_loss /= total_batches

    print epoch, "%.7f" % avg_loss


0 1.3908080
1 0.9489369
2 0.8515543
3 0.7824094
4 0.7301234
5 0.6806088
6 0.6317520
7 0.5849520
8 0.5348147
9 0.4884985
10 0.4454728
11 0.4070407
12 0.3729887
13 0.3429023
14 0.3168791
15 0.2946095
16 0.2736928
17 0.2522101
18 0.2382104
19 0.2220520
20 0.2054102
21 0.1895113
22 0.1765906
23 0.1635755
24 0.1484580
25 0.1378150
26 0.1310562
27 0.1216568
28 0.1063497
29 0.0976198
30 0.0942983
31 0.0819174
32 0.0779237
33 0.0728757
34 0.0661058
35 0.0574462
36 0.0595423
37 0.0478818
38 0.0763143
39 0.0380665
40 0.0342471
41 0.0314132
42 0.0370409
43 0.0346601
44 0.0237405
45 0.0223978
46 0.0493947
47 0.0173508
48 0.0156706
49 0.0141608
50 0.0129971
51 0.0391612
52 0.0143734
53 0.0101149
54 0.0091349
55 0.0083722
56 0.0077632
57 0.0786065
58 0.0402763
59 0.0077026
60 0.0066978
61 0.0061074
62 0.0056646
63 0.0053165
64 0.0050393
65 0.0046588
66 0.0044155
67 0.0537669
68 0.0054095
69 0.0040832
70 0.0037208
71 0.0034542
72 0.0032270
73 0.0030249
74 0.0028477
75 0.0026960
76 0.0025183
77 0.0023169
78 0.0021388
79 0.0478033
80 0.0035333
81 0.0024058
82 0.0021013
83 0.0019403
84 0.0018133
85 0.0017037
86 0.0016035
87 0.0015086
88 0.0014173
89 0.0013297
90 0.0012481
91 0.0011718
92 0.0011023
93 0.0010249
94 0.0698479
95 0.0022211
96 0.0013886
97 0.0012301
98 0.0011327
99 0.0010600

In [19]:
# test module
test_net = mx.mod.Module(symbol=out,
                         data_names=['data'],
                         label_names=None,
                         context=mx.gpu(7)) # FusedRNNCell works only with GPU

# data descriptor
data_desc = train_iter.provide_data[0]

# set shared_module = model used for training so as to share same parameters and memory
test_net.bind(data_shapes=[data_desc],
              label_shapes=None,
              for_training=False,
              grad_req='null',
              shared_module=net)

n_test = 100
testX, testY = generate_data(n_test)

testX = np.array(testX, dtype=np.int)

test_net.reshape(data_shapes=[mx.io.DataDesc('data', (1, input_sequence_length))])
predictions = test_net.predict(mx.io.NDArrayIter(testX, batch_size=1)).asnumpy()

print "expression", "predicted", "actual"

correct = 0
for i, prediction in enumerate(predictions):
    x_str = [character_set[j] for j in testX[i]]
    index = np.argmax(prediction, axis=1)
    result = [character_set[j] for j in index]
    label = [character_set[j] for j in testY[i]]
    #print result, label
    if result == label:
        correct +=1
    print "".join(x_str), "".join(result), "    ", "".join(label)
    
print correct, correct/(n_test*1.0)


expression predicted actual
   4+3+5   12        12
   5*2*8   80        80
   9*7*1   63        63
   7+5+9   21        21
   5+6+9   20        20
   6*7*9  378       378
   6+2+8   16        16
   1*9*8   72        72
   4+9+4   17        17
   8+8+3   19        19
   2*3*7   42        42
   1+9+5   15        15
   2+4+1    7         7
   2*3*4   24        24
   3+5+8   16        16
   1*8*3   24        24
   6*8*6  288       288
 10+10+5   25        25
  10+1+7   18        18
  8+4+10   22        22
   7+4+4   15        15
   6*3*5   90        90
 10*10*7  700       700
  10+8+8   26        26
   7*8*7  392       392
   1*3*8   24        24
   2*5*9   90        90
  4*10*1   40        40
   9*4*8  288       288
   6+7+2   15        15
  8*10*1   80        80
   9+5+8   22        22
   5+4+3   12        12
   7*9*9  567       567
  8*10*7  560       560
   7+3+1   11        11
   9*9*5  405       405
   4*2*7   56        56
   6+1+2    9         9
  1+10+6   17        17
   8*8*9  576       576
   4+5+6   15        15
   3*8*9  216       216
   1*1*5    5         5
   3*3*2   18        18
  10*6*6  360       360
  5*10*3  150       150
  8+10+9   27        27
   1*9*8   72        72
   2*6*7   84        84
   7+6+7   20        20
   2*3*9   54        54
  10*3*4  120       120
   6*6*7  252       252
   2+2+4    8         8
  1*2*10   20        20
   1+4+5   10        10
   6+6+1   13        13
  7+8+10   25        25
   3*8*6  144       144
   2*7*4   56        56
  10+3+5   18        18
   9+3+1   13        13
   2*4*6   48        48
   3+7+3   13        13
   2*7*5   70        70
   4*8*7  224       224
   9+7+4   20        20
   3+7+8   18        18
   7*8*4  224       224
   6*8*9  432       432
   8*5*9  360       360
   4+5+1   10        10
   4+3+5   12        12
   5*3*2   30        30
  6*10*3  180       180
  7+10+3   20        20
   7*2*9  126       126
   4+6+3   13        13
  8*10*7  560       560
   2*9*1   18        18
   3+8+7   18        18
   6+8+2   16        16
   8*5*9  360       360
   3+7+9   19        19
   7*3*4   84        84
   9*8*9  648       648
  10+7+3   20        20
   8+9+2   19        19
   6*8*9  432       432
   4+8+5   17        17
   5*7*2   70        70
   8+4+2   14        14
   1*3*7   21        21
   3*9*9  243       243
   2*8*2   32        32
   9*3*6  162       162
   3*4*5   60        60
   1*7*1    7         7
   1+1+1    3         3
100 1.0

In [ ]: