Self-explanatory Examples of Stateful LSTMs

http://philipperemy.github.io/keras-stateful-lstm/

Within a batch, each sequence has its OWN states. Also each sequence is seen as independent from the other sequences in the batch. If your batch is Y, sequences Y[1] and Y[2] will never have something in common (whatever stateful=False or stateful=True).

If the LSTM has the stateful mode, the states of the current batch will be propagated to the next batch (at index i). Batching sequences is only to speed up the computations on a GPU.

Let's consider a matrix X with 1024 rows and a batch size of 32. In the first place, let's consider each row of X as a sequence (1024 sequences) and stateful=False. You're in the default Keras setup. It means that each sequence has its states propagated throughout all the sequence BUT each sequence has its own states and there is nothing shared in the batch with the other sequences (the first dimension of the states tensor is the batch_size). So in your own terms, each sequence within a batch will be stateful because that's the purpose of a recurrent neural network.

Again if you consider sample as sub-sequence (of length 1), and you set stateful=True, the case is different, because we are working in a supervised learning context. You have to provide more targets Y because you have more sequences (= 1024 x nb_columns) where shape(X) = (1024, nb_columns).

Let’s consider the easiest example where batch_size=1 and shuffle=False to avoid any confusions.

Let the input matrix X be of the arbitrary shape (1200,20). Those values are totally arbitrary, trust me! Let’s also assume that X is made exclusively of zeros except in the first column where exactly half of the values are 1.


In [6]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

N = 1200
N_train = 1000
X = np.zeros((1200, 20))
from numpy.random import choice
#one_indexes = choice(a=N, size=int(N_train / 2), replace=False)
one_indexes = choice(a=N, size=int(N / 2), replace=False)
X[one_indexes, 0] = 1  # very long term memory.
Y = X[:,0]
X = X.reshape(X.shape + (1,))
Y = Y.reshape(Y.shape + (1,))

X[:3,:5, 0]
Y[:3]


Out[6]:
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])
Out[6]:
array([[ 1.],
       [ 1.],
       [ 0.]])

In [7]:
X_train = X[:N_train,:]
Y_train = Y[:N_train]
X_val = X[N_train:,:]
Y_val = Y[N_train:]
X_val[:5,:10, 0]
Y_val[:5]


Out[7]:
array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
Out[7]:
array([[ 1.],
       [ 0.],
       [ 1.],
       [ 0.],
       [ 1.]])

The first column is also stored in Y and it will be the target of the model. In other words, our model must understand that if the first value of the sequence is 1, then the result is 1, 0 otherwise.

We split (X,Y) in training (1000 rows) and validation sets (200 rows).

Now that we have the setup, let’s try to illustrate each point that we have discussed above.


In [8]:
X_train.shape, Y_train.shape
X_val.shape, Y_val.shape


Out[8]:
((1000, 20, 1), (1000, 1))
Out[8]:
((200, 20, 1), (200, 1))

Impact of sequences subsampling

We consider a stateless LSTM in our first experiment and our dataset (X,Y). Our objective is to prove that if we split the big sequences contained in X into smaller sequence, the model cannot converge. The function that we use to split a big sequence into subsequences is:


In [10]:
max_len = 10

def prepare_sequences(x_train, y_train, window_length):
    windows = []
    windows_y = []
    for i, sequence in enumerate(x_train):
        len_seq = len(sequence)
        for window_start in range(0, len_seq - window_length + 1):
            window_end = window_start + window_length
            window = sequence[window_start:window_end]
            windows.append(window)
            windows_y.append(y_train[i])  #always same target!!
    return np.array(windows), np.array(windows_y)

X2_train, Y2_train = prepare_sequences(X_train, Y_train, 10)
X2_train.shape, Y2_train.shape
X2_val, Y2_val = prepare_sequences(X_val, Y_val, max_len)
X2_val.shape, Y2_val.shape


Out[10]:
((11000, 10, 1), (11000, 1))
Out[10]:
((2200, 10, 1), (2200, 1))

Indeed in a sequence of length 20, we have exactly (20−10+1) sequences of length 10 (sketch it if you’re not convinced). Because we have 1000 samples (sequences) in the training set, we have 1000∗11=11000 subsequences in our resulting training set (we make sure to keep the order). Let’s feed that to our LSTM:


In [11]:
batch_size = 64

print('Building STATELESS model...')
model = Sequential()
model.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False, stateful=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


Building STATELESS model...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lstm_1 (LSTM)                    (None, 10)            480         lstm_input_1[0][0]               
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             11          lstm_1[0][0]                     
====================================================================================================
Total params: 491
____________________________________________________________________________________________________

In [19]:
%%capture output
%%time
model.fit(X2_train, Y2_train, batch_size=batch_size, nb_epoch=1, validation_data=(X2_val, Y2_val), shuffle=False)
score, acc = model.evaluate(X2_val, Y2_val, batch_size=batch_size, verbose=0)

In [20]:
output.show()


Train on 11000 samples, validate on 2200 samples
Epoch 1/1
11000/11000 [==============================] - 2s - loss: 0.6332 - acc: 0.5455 - val_loss: 0.6304 - val_acc: 0.6000
CPU times: user 4.16 s, sys: 188 ms, total: 4.35 s
Wall time: 2.69 s

Because the targets Y are approximately equally distributed (half are 0, half are 1), the baseline is an accuracy of 0.5.

After 15 epochs, on this very simple task, we have:

Epoch 15/15 11000/11000 [==============================] - 21s - loss: 0.6623 - acc: 0.5343 - val_loss: 0.6606 - val_acc: 0.5455

The interpretation of this result is very clear. The LSTM cannot find the optimal solution when working with subsequences. On such an easy problem, we expect an accuracy of more than 0.99.

Activating the statefulness of the model does not help at all (we’re going to see why in the next section):


In [16]:
batch_size = 110 

print('Building STATEFULL model...')
model = Sequential()
# model.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False, stateful=False))
model.add(LSTM(10, batch_input_shape=(batch_size, max_len, 1), return_sequences=False, stateful=True))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


Building STATEFULL model...
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lstm_2 (LSTM)                    (110, 10)             480         lstm_input_2[0][0]               
____________________________________________________________________________________________________
dense_2 (Dense)                  (110, 1)              11          lstm_2[0][0]                     
====================================================================================================
Total params: 491
____________________________________________________________________________________________________

In [22]:
%%capture output
%%time
model.fit(X2_train, Y2_train, batch_size=batch_size, nb_epoch=15, validation_data=(X2_val, Y2_val), shuffle=False)
score, acc = model.evaluate(X2_val, Y2_val, batch_size=batch_size, verbose=0)
score, acc

In [23]:
output.show()


Train on 11000 samples, validate on 2200 samples
Epoch 1/15
11000/11000 [==============================] - 2s - loss: 0.6328 - acc: 0.5400 - val_loss: 0.6304 - val_acc: 0.6000
Epoch 2/15
11000/11000 [==============================] - 2s - loss: 0.6325 - acc: 0.5400 - val_loss: 0.6304 - val_acc: 0.6000
Epoch 3/15
11000/11000 [==============================] - 2s - loss: 0.6322 - acc: 0.5400 - val_loss: 0.6305 - val_acc: 0.6000
Epoch 4/15
11000/11000 [==============================] - 2s - loss: 0.6320 - acc: 0.5418 - val_loss: 0.6305 - val_acc: 0.6000
Epoch 5/15
11000/11000 [==============================] - 2s - loss: 0.6319 - acc: 0.5418 - val_loss: 0.6306 - val_acc: 0.6000
Epoch 6/15
11000/11000 [==============================] - 2s - loss: 0.6318 - acc: 0.5382 - val_loss: 0.6306 - val_acc: 0.6000
Epoch 7/15
11000/11000 [==============================] - 2s - loss: 0.6316 - acc: 0.5382 - val_loss: 0.6332 - val_acc: 0.5814
Epoch 8/15
11000/11000 [==============================] - 2s - loss: 0.6315 - acc: 0.5491 - val_loss: 0.6339 - val_acc: 0.5264
Epoch 9/15
11000/11000 [==============================] - 2s - loss: 0.6326 - acc: 0.5326 - val_loss: 0.6308 - val_acc: 0.5273
Epoch 10/15
11000/11000 [==============================] - 2s - loss: 0.6314 - acc: 0.5418 - val_loss: 0.6308 - val_acc: 0.6000
Epoch 11/15
11000/11000 [==============================] - 2s - loss: 0.6313 - acc: 0.5418 - val_loss: 0.6308 - val_acc: 0.5545
Epoch 12/15
11000/11000 [==============================] - 2s - loss: 0.6313 - acc: 0.5382 - val_loss: 0.6308 - val_acc: 0.5545
Epoch 13/15
11000/11000 [==============================] - 2s - loss: 0.6316 - acc: 0.5382 - val_loss: 0.6313 - val_acc: 0.5455
Epoch 14/15
11000/11000 [==============================] - 2s - loss: 0.6312 - acc: 0.5564 - val_loss: 0.6310 - val_acc: 0.4909
Epoch 15/15
11000/11000 [==============================] - 2s - loss: 0.6311 - acc: 0.5473 - val_loss: 0.6311 - val_acc: 0.4909
Training time: 39.302290201187134
CPU times: user 1min 1s, sys: 2.87 s, total: 1min 3s
Wall time: 39.3 s

As a conclusion, subsampling does not help the LSTM converge. So when you have a big time series (e.g. in financial markets), the lookback window length is crucial and can be found with Bayesian Optimization .

I think the sub-sampling can work good if you enforce that the full sequence is preserved through the batches, meaning that if X is my batch, I can concatenate X[i] and X[i+1] and it makes sense.

I think the advantage of subsampling is that you can associate different targets to your sequence (like a many-to-many would do but at specific intervals e.g. every period of length 10, I generate an output). Let's consider a case where you have a sequence of length 1000 and you want to apply a stateful LSTM but with smaller sequences of length 10 (to have more data points). You can sub-sample by using the function prepare_sequences() from above. Instead of having a big sequence X of length 999 and trying to predict the 1000th value, you will have lots of smaller sequences of length 9 where you want to predict the 10th value.

To make it work nicely, you have to enforce the ordering of your sequences (i.e. ban any shuffling when you call keras.model.fit()) because otherwise you will lose the long term memory of your big sequence. It's like sub-sampling from your big sequence and shuffle all your generated smaller sequences. It's important that X[0:9] comes before X[10:19] to keep the time axis value.

The gradient is always computed at the end of each batch. If your batch_size = 1, a batch becomes a single sequence and the gradient is computed at the end of this sequence.

Mastering stateful models

Let’s consider the setup without sequences subsampling where the dimensions of our input matrices are:


In [25]:
max_len = 20
X.shape, Y.shape


Out[25]:
((1200, 20, 1), (1200, 1))

In [26]:
# to make things easier
batch_size = 1

In [27]:
%%time
print('Build STATEFUL model...')
model = Sequential()
model.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=True))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Build STATEFUL model...
CPU times: user 3.9 s, sys: 4 ms, total: 3.9 s
Wall time: 3.86 s

The idea is to split each sequence Xi (of length 20) into elements of size 1 and feed them to the LSTM. Once the sequence is over, we manually reset the states of the LSTM to have a clean setup for the next one. For each element, we associate the related target Yi.

For instance, if we have Xi=[1,0,0...,0] and Yi=[1], we feed ((xi0=[1],yi=[1]), (xi1=[0],yi=[1]),…, (xi20=[0],yi=[1])) to the LSTM.

The operation is not destructive but reversible. We repeat the procedure for every (Xi,Yi) to re create our new matrices (X,Y) that will used as inputs to the LSTM.

Because the LSTM is stateful, the state will be propagated to the next batch. Also because the batch_size=1, we are sure that the state of the last element will be used as input to the current element. The code is available below and is a bit more involved:


In [32]:
import time

def train(X_train, X_val, Y_train, Y_val, epoch):
    print('Train...')
    for epoch in range(epoch):
        start_time = time.time()
        print("Epoch {}".format(epoch))
        mean_tr_acc = []
        mean_tr_loss = []
        for i in range(len(X_train)):
            Y_true = Y_train[i]
            for j in range(max_len):
                tr_loss, tr_acc = model.train_on_batch(np.expand_dims(np.expand_dims(X_train[i][j], axis=1), axis=1),
                                                       np.array([Y_true]))
                mean_tr_acc.append(tr_acc)
                mean_tr_loss.append(tr_loss)
            model.reset_states()

        print('accuracY training = {}'.format(np.mean(mean_tr_acc)))
        print('loss training = {}'.format(np.mean(mean_tr_loss)))
        print('___________________________________')

        mean_te_acc = []
        mean_te_loss = []
        for i in range(len(X_val)):
            for j in range(max_len):
                te_loss, te_acc = model.test_on_batch(np.expand_dims(np.expand_dims(X_val[i][j], axis=1), axis=1),
                                                      Y_val[i])
                mean_te_acc.append(te_acc)
                mean_te_loss.append(te_loss)
            model.reset_states()

            for j in range(max_len):
                Y_pred = model.predict_on_batch(np.expand_dims(np.expand_dims(X_val[i][j], axis=1), axis=1))
            model.reset_states()

        print('accuracY testing = {}'.format(np.mean(mean_te_acc)))
        print('loss testing = {}'.format(np.mean(mean_te_loss)))
        print('epoch duration: ', time.time() - start_time)
        print('___________________________________')

In [31]:
%%time
train(X_train, X_val, Y_train, Y_val, 2)


Train...
Epoch 0
accuracY training = 1.0
loss training = 0.0015426294412463903
___________________________________
accuracY testing = 1.0
loss testing = 3.906428389655048e-07
___________________________________
Epoch 1
accuracY training = 1.0
loss training = 1.346518700984234e-07
___________________________________
accuracY testing = 1.0
loss testing = 1.0845209175158743e-07
___________________________________
CPU times: user 10min 30s, sys: 29.5 s, total: 11min
Wall time: 6min 56s

As expected, after just a few epochs, our accuracy is very close to 1, which means that our model could successfully catch the pattern.

Finally, let’s see how the stateless model would perform stateless:


In [33]:
%%time
print('Build STATELESS model...')
model = Sequential()
model.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Build STATELESS model...
CPU times: user 4.72 s, sys: 12 ms, total: 4.73 s
Wall time: 4.67 s

In [34]:
%%time
train(X_train, X_val, Y_train, Y_val, 3)


Train...
Epoch 0
accuracY training = 0.5257499814033508
loss training = 0.680762767791748
___________________________________
accuracY testing = 0.5820000171661377
loss testing = 0.6716867685317993
epoch duration:  207.85303044319153
___________________________________
Epoch 1
accuracY training = 0.5177500247955322
loss training = 0.677130937576294
___________________________________
accuracY testing = 0.5820000171661377
loss testing = 0.6700948476791382
epoch duration:  201.38860535621643
___________________________________
Epoch 2
accuracY training = 0.5162000060081482
loss training = 0.6766434907913208
___________________________________
accuracY testing = 0.5820000171661377
loss testing = 0.6693295240402222
epoch duration:  200.68656373023987
___________________________________
CPU times: user 15min 34s, sys: 43.1 s, total: 16min 17s
Wall time: 10min 9s

In [ ]: