A basic sequence-to-sequence model, as introduced in Cho et al., 2014 (pdf), consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output.

Every Seq2seq model has 2 primary layers : the encoder and the decoder. Generally, the encoder encodes the input sequence to an internal representation called 'context vector' which is used by the decoder to generate the output sequence.

The lengths of input and output sequences can be different, as there is no explicit one on one relation between the input and output sequences.

Source : https://github.com/farizrahman4u/seq2seq

An implementation of sequence to sequence learning for performing addition

Input: "535+61" Output: "596" Padding is handled by using a repeated sentinel character (space) Input may optionally be inverted, shown to increase performance in many tasks in: "Learning to Execute" http://arxiv.org/abs/1410.4615 and "Sequence to Sequence Learning with Neural Networks" http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf Theoretically it introduces shorter term dependencies between source and target. Two digits inverted:

  • One layer LSTM (128 HN), 5k training examples = 99% train/test accuracy in 55 epochs Three digits inverted:
  • One layer LSTM (128 HN), 50k training examples = 99% train/test accuracy in 100 epochs Four digits inverted:
  • One layer LSTM (128 HN), 400k training examples = 99% train/test accuracy in 20 epochs Five digits inverted:
  • One layer LSTM (128 HN), 550k training examples = 99% train/test accuracy in 30 epochs

In [3]:
from __future__ import print_function
from keras.models import Sequential
from keras.engine.training import slice_X
from keras.layers import Activation, TimeDistributed, Dense, RepeatVector, recurrent
import numpy as np
from six.moves import range


Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.

In [4]:
class CharacterTable(object):
    '''
    Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilities to their character output
    '''
    def __init__(self, chars, maxlen):
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
        self.maxlen = maxlen

    def encode(self, C, maxlen=None):
        maxlen = maxlen if maxlen else self.maxlen
        X = np.zeros((maxlen, len(self.chars)))
        for i, c in enumerate(C):
            X[i, self.char_indices[c]] = 1
        return X

    def decode(self, X, calc_argmax=True):
        if calc_argmax:
            X = X.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in X)

In [5]:
class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

In [6]:
# Parameters for the model and dataset
TRAINING_SIZE = 50000
DIGITS = 3
INVERT = True
# Try replacing GRU, or SimpleRNN
RNN = recurrent.LSTM
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1
MAXLEN = DIGITS + 1 + DIGITS

chars = '0123456789+ '
ctable = CharacterTable(chars, MAXLEN)

questions = []
expected = []
seen = set()

In [7]:
%%time

#Generating random  numbers to perofrm addition on
print('Generating data...')
while len(questions) < TRAINING_SIZE:
    f = lambda: int(''.join(np.random.choice(list('0123456789')) for i in range(np.random.randint(1, DIGITS + 1))))
    a, b = f(), f()
    # Skip any addition questions we've already seen
    # Also skip any such that X+Y == Y+X (hence the sorting)
    key = tuple(sorted((a, b)))
    if key in seen:
        continue
    seen.add(key)
    # Pad the data with spaces such that it is always MAXLEN
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (MAXLEN - len(q))
    ans = str(a + b)
    # Answers can be of maximum size DIGITS + 1
    ans += ' ' * (DIGITS + 1 - len(ans))
    if INVERT:
        query = query[::-1]
    questions.append(query)
    expected.append(ans)
print('Total addition questions:', len(questions))

#We now have 50000 examples of addition, each exaple contains the addition between two numbers
#Each example contains the first number followed by '+' operand followed by the second number 
#examples - 85+96, 353+551, 6+936
#The answers to the additon operation are stored in expected


Generating data...
Total addition questions: 50000
CPU times: user 4.89 s, sys: 41.9 ms, total: 4.93 s
Wall time: 4.94 s

In [9]:
#Look into the training data

In [ ]:


In [ ]:


In [10]:
%%time

#The above questions and answers are going to be one hot encoded, 
#before training.
#The encoded values will be used to train the model
#The maximum length of a question can be 7 
#(3 digits followed by '+' followed by 3 digits)
#The maximum length of an answer can be 4 
#(Since the addition of 3 digits yields either a 3 digit number or a 4
#4 digit number)

#Now for training each number or operand is going to be one hot encode below
#In one hot encode there are 12 possibilities '0123456789+ ' (The last one is a space)
#Since we assume a maximum of 3 digit numbers, a two digit number is taken as space with two digts, or 
#a single digit number as two spaces with a number

#So for questions we get 7 rows since the max possible length is 7, and each row has a length of 12 because it will
#be one hot encoded with True and False, depending on the character(any one of the number, '+' operand, or space)
#will be stored  in X_train and X_val
#The 4th position in(1,2,3,4,5,6,7) will indicate the one hot encoding of the '+' operand

##So for questions we get 4 rows since the max possible length is 4, and each row has a length of 12 because it will
#be one hot encoded with True and False, depending on the character(any one of the number, '+' operand, or space)
#will be stored  in y_train and y_val


print('Vectorization...')
X = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool)
y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool)
for i, sentence in enumerate(questions):
    X[i] = ctable.encode(sentence, maxlen=MAXLEN)
for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, maxlen=DIGITS + 1)

# Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
indices = np.arange(len(y))
np.random.shuffle(indices)
X = X[indices]
y = y[indices]

# Explicitly set apart 10% for validation data that we never train over
split_at = len(X) - len(X) / 10
(X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
(y_train, y_val) = (y[:split_at], y[split_at:])


Vectorization...
CPU times: user 472 ms, sys: 11.3 ms, total: 483 ms
Wall time: 481 ms

In [11]:
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)


(45000, 7, 12)
(45000, 4, 12)
(5000, 7, 12)
(5000, 4, 12)

In [12]:
%%time

#Training the model with the encoded inputs
print('Build model...')
model = Sequential()
# "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
# note: in a situation where your input sequences have a variable length,
# use input_shape=(None, nb_feature).
model.add(RNN(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars))))
# For the decoder's input, we repeat the encoded input for each time step
model.add(RepeatVector(DIGITS + 1))
# The decoder RNN could be multiple layers stacked or a single layer
for _ in range(LAYERS):
    model.add(RNN(HIDDEN_SIZE, return_sequences=True))

# For each of step of the output sequence, decide which character should be chosen
model.add(TimeDistributed(Dense(len(chars))))
model.add(Activation('softmax'))


Build model...
CPU times: user 1.58 s, sys: 65.7 ms, total: 1.65 s
Wall time: 724 ms

In [13]:
%%time

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Train the model each generation and show predictions against the validation dataset
for iteration in range(1, 2):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X_train, y_train, batch_size=BATCH_SIZE, nb_epoch=1,
              validation_data=(X_val, y_val))
    
    score = model.evaluate(X_val, y_val, verbose=0)
    print('\n')
    print('Test score:', score[0])
    print('Test accuracy:', score[1])
    print('\n')


--------------------------------------------------
Iteration 1
Train on 45000 samples, validate on 5000 samples
Epoch 1/1
45000/45000 [==============================] - 15s - loss: 1.8436 - acc: 0.3339 - val_loss: 1.7053 - val_acc: 0.3648


Test score: 1.70534285183
Test accuracy: 0.3648


CPU times: user 1min 21s, sys: 4.57 s, total: 1min 26s
Wall time: 39.6 s

In [14]:
%%time

#For predicting the outputs, the predict method will return 
#an one hot encoded ouput, we decode the one hot encoded 
#ouptut to get our final output

# Select 10 samples from the validation set at random so we can visualize errors
for i in range(10):
    ind = np.random.randint(0, len(X_val))
    rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
    preds = model.predict_classes(rowX, verbose=0)
    q = ctable.decode(rowX[0])
    correct = ctable.decode(rowy[0])
    guess = ctable.decode(preds[0], calc_argmax=False)
    print('Q', q[::-1] if INVERT else q)
    print('T', correct)
    print(colors.ok + '☑' + colors.close if correct == guess else colors.fail + '☒' + colors.close, guess)
    print('---')


Q 19+66  
T 85  
 10  
---
Q 525+14 
T 539 
 556 
---
Q 0+784  
T 784 
 101 
---
Q 153+726
T 879 
 111 
---
Q 885+55 
T 940 
 109 
---
Q 234+129
T 363 
 321 
---
Q 276+880
T 1156
 1011
---
Q 35+35  
T 70  
 55  
---
Q 383+766
T 1149
 101 
---
Q 250+492
T 742 
 101 
---
CPU times: user 1.65 s, sys: 35.5 ms, total: 1.68 s
Wall time: 1.65 s

In [ ]: