http://philipperemy.github.io/keras-stateful-lstm/

Within a batch, each sequence has its OWN states. Also each sequence is seen as independent from the other sequences in the batch. If your batch is Y, sequences Y[1] and Y[2] will never have something in common (whatever stateful=False or stateful=True).

If the LSTM has the stateful mode, the states of the current batch will be propagated to the next batch (at index i). Batching sequences is only to speed up the computations on a GPU.

Let's consider a matrix X with 1024 rows and a batch size of 32. In the first place, let's consider each row of X as a sequence (1024 sequences) and stateful=False. You're in the default Keras setup. It means that each sequence has its states propagated throughout all the sequence BUT each sequence has its own states and there is nothing shared in the batch with the other sequences (the first dimension of the states tensor is the batch_size). So in your own terms, each sequence within a batch will be stateful because that's the purpose of a recurrent neural network.

**Again if you consider sample as sub-sequence (of length 1), and you set stateful=True, the case is different, because we are working in a supervised learning context. You have to provide more targets Y because you have more sequences (= 1024 x nb_columns) where shape(X) = (1024, nb_columns).**

Let’s consider the easiest example where batch_size=1 and shuffle=False to avoid any confusions.

Let the input matrix X be of the arbitrary shape (1200,20). Those values are totally arbitrary, trust me! Let’s also assume that X is made exclusively of zeros except in the first column where exactly half of the values are 1.

```
In [6]:
```import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
N = 1200
N_train = 1000
X = np.zeros((1200, 20))
from numpy.random import choice
#one_indexes = choice(a=N, size=int(N_train / 2), replace=False)
one_indexes = choice(a=N, size=int(N / 2), replace=False)
X[one_indexes, 0] = 1 # very long term memory.
Y = X[:,0]
X = X.reshape(X.shape + (1,))
Y = Y.reshape(Y.shape + (1,))
X[:3,:5, 0]
Y[:3]

```
Out[6]:
Out[6]:
```

```
In [7]:
```X_train = X[:N_train,:]
Y_train = Y[:N_train]
X_val = X[N_train:,:]
Y_val = Y[N_train:]
X_val[:5,:10, 0]
Y_val[:5]

```
Out[7]:
Out[7]:
```

The first column is also stored in Y and it will be the target of the model. In other words, our model must understand that if the first value of the sequence is 1, then the result is 1, 0 otherwise.

We split (X,Y) in training (1000 rows) and validation sets (200 rows).

Now that we have the setup, let’s try to illustrate each point that we have discussed above.

```
In [8]:
```X_train.shape, Y_train.shape
X_val.shape, Y_val.shape

```
Out[8]:
Out[8]:
```

We consider a stateless LSTM in our first experiment and our dataset (X,Y). Our objective is to prove that if we split the big sequences contained in X into smaller sequence, the model cannot converge. The function that we use to split a big sequence into subsequences is:

```
In [10]:
```max_len = 10
def prepare_sequences(x_train, y_train, window_length):
windows = []
windows_y = []
for i, sequence in enumerate(x_train):
len_seq = len(sequence)
for window_start in range(0, len_seq - window_length + 1):
window_end = window_start + window_length
window = sequence[window_start:window_end]
windows.append(window)
windows_y.append(y_train[i]) #always same target!!
return np.array(windows), np.array(windows_y)
X2_train, Y2_train = prepare_sequences(X_train, Y_train, 10)
X2_train.shape, Y2_train.shape
X2_val, Y2_val = prepare_sequences(X_val, Y_val, max_len)
X2_val.shape, Y2_val.shape

```
Out[10]:
Out[10]:
```

```
In [11]:
```batch_size = 64
print('Building STATELESS model...')
model = Sequential()
model.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False, stateful=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

```
```

```
In [19]:
```%%capture output
%%time
model.fit(X2_train, Y2_train, batch_size=batch_size, nb_epoch=1, validation_data=(X2_val, Y2_val), shuffle=False)
score, acc = model.evaluate(X2_val, Y2_val, batch_size=batch_size, verbose=0)

```
In [20]:
```output.show()

```
```

Because the targets Y are approximately equally distributed (half are 0, half are 1), the baseline is an accuracy of 0.5.

After 15 epochs, on this very simple task, we have:

Epoch 15/15 11000/11000 [==============================] - 21s - loss: 0.6623 - acc: 0.5343 - val_loss: 0.6606 - val_acc: 0.5455

The interpretation of this result is very clear. The LSTM cannot find the optimal solution when working with subsequences. On such an easy problem, we expect an accuracy of more than 0.99.

Activating the statefulness of the model does not help at all (we’re going to see why in the next section):

```
In [16]:
```batch_size = 110
print('Building STATEFULL model...')
model = Sequential()
# model.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False, stateful=False))
model.add(LSTM(10, batch_input_shape=(batch_size, max_len, 1), return_sequences=False, stateful=True))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

```
```

```
In [22]:
```%%capture output
%%time
model.fit(X2_train, Y2_train, batch_size=batch_size, nb_epoch=15, validation_data=(X2_val, Y2_val), shuffle=False)
score, acc = model.evaluate(X2_val, Y2_val, batch_size=batch_size, verbose=0)
score, acc

```
In [23]:
```output.show()

```
```

As a conclusion, subsampling does not help the LSTM converge. So when you have a big time series (e.g. in financial markets), the lookback window length is crucial and can be found with Bayesian Optimization .

I think the sub-sampling can work good if you enforce that the full sequence is preserved through the batches, meaning that if X is my batch, I can concatenate X[i] and X[i+1] and it makes sense.

I think the advantage of subsampling is that you can associate different targets to your sequence (like a many-to-many would do but at specific intervals e.g. every period of length 10, I generate an output). Let's consider a case where you have a sequence of length 1000 and you want to apply a stateful LSTM but with smaller sequences of length 10 (to have more data points). You can sub-sample by using the function prepare_sequences() from above. Instead of having a big sequence X of length 999 and trying to predict the 1000th value, you will have lots of smaller sequences of length 9 where you want to predict the 10th value.

To make it work nicely, you have to enforce the ordering of your sequences (i.e. ban any shuffling when you call keras.model.fit()) because otherwise you will lose the long term memory of your big sequence. It's like sub-sampling from your big sequence and shuffle all your generated smaller sequences. It's important that X[0:9] comes before X[10:19] to keep the time axis value.

The gradient is always computed at the end of each batch. If your batch_size = 1, a batch becomes a single sequence and the gradient is computed at the end of this sequence.

```
In [25]:
```max_len = 20
X.shape, Y.shape

```
Out[25]:
```

```
In [26]:
```# to make things easier
batch_size = 1

```
In [27]:
```%%time
print('Build STATEFUL model...')
model = Sequential()
model.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=True))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

```
```

The idea is to split each sequence Xi (of length 20) into elements of size 1 and feed them to the LSTM. Once the sequence is over, we manually reset the states of the LSTM to have a clean setup for the next one. For each element, we associate the related target Yi.

For instance, if we have Xi=[1,0,0...,0] and Yi=[1], we feed ((xi0=[1],yi=[1]), (xi1=[0],yi=[1]),…, (xi20=[0],yi=[1])) to the LSTM.

The operation is not destructive but reversible. We repeat the procedure for every (Xi,Yi) to re create our new matrices (X,Y) that will used as inputs to the LSTM.

Because the LSTM is stateful, the state will be propagated to the next batch. Also because the batch_size=1, we are sure that the state of the last element will be used as input to the current element. The code is available below and is a bit more involved:

```
In [32]:
```import time
def train(X_train, X_val, Y_train, Y_val, epoch):
print('Train...')
for epoch in range(epoch):
start_time = time.time()
print("Epoch {}".format(epoch))
mean_tr_acc = []
mean_tr_loss = []
for i in range(len(X_train)):
Y_true = Y_train[i]
for j in range(max_len):
tr_loss, tr_acc = model.train_on_batch(np.expand_dims(np.expand_dims(X_train[i][j], axis=1), axis=1),
np.array([Y_true]))
mean_tr_acc.append(tr_acc)
mean_tr_loss.append(tr_loss)
model.reset_states()
print('accuracY training = {}'.format(np.mean(mean_tr_acc)))
print('loss training = {}'.format(np.mean(mean_tr_loss)))
print('___________________________________')
mean_te_acc = []
mean_te_loss = []
for i in range(len(X_val)):
for j in range(max_len):
te_loss, te_acc = model.test_on_batch(np.expand_dims(np.expand_dims(X_val[i][j], axis=1), axis=1),
Y_val[i])
mean_te_acc.append(te_acc)
mean_te_loss.append(te_loss)
model.reset_states()
for j in range(max_len):
Y_pred = model.predict_on_batch(np.expand_dims(np.expand_dims(X_val[i][j], axis=1), axis=1))
model.reset_states()
print('accuracY testing = {}'.format(np.mean(mean_te_acc)))
print('loss testing = {}'.format(np.mean(mean_te_loss)))
print('epoch duration: ', time.time() - start_time)
print('___________________________________')

```
In [31]:
```%%time
train(X_train, X_val, Y_train, Y_val, 2)

```
```

As expected, after just a few epochs, our accuracy is very close to 1, which means that our model could successfully catch the pattern.

Finally, let’s see how the stateless model would perform stateless:

```
In [33]:
```%%time
print('Build STATELESS model...')
model = Sequential()
model.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

```
```

```
In [34]:
```%%time
train(X_train, X_val, Y_train, Y_val, 3)

```
```

```
In [ ]:
```