Deep Long Short Term Memory RNNs

For 1D time-series prediction.



In [ ]:

    
from __future__ import print_function
import os
import mxnet as mx
from mxnet import nd, autograd
import numpy as np
from exceptions import ValueError
import warnings
from collections import defaultdict
# we use cpus here
ctx = mx.cpu(0)



In [ ]:

    
warnings.filterwarnings('ignore', category=DeprecationWarning, module='.*/IPython/.*')
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from datetime import datetime
sns.set_style('whitegrid')
sns.set_context('poster')
# Make inline plots vector graphics instead of raster graphics
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'png')



In [ ]:

    
# +1 because of the time shift between inputs and labels
SEQ_LENGTH = 100 + 1
NUM_SAMPLES_TRAINING = 5000 + 1
# a few samples for testing our prediction outputs vs true labels
NUM_SAMPLES_TESTING = 100 + 1
# set to False if you already have the data files locally
CREATE_DATA_SETS = True

Dataset: "Some time-series"



In [ ]:

    
# return one random scalar from 0 to 1 using mxnet's nd
def gimme_one_random_number():
    return nd.random_uniform(low=0, high=1, shape=(1,1)).asnumpy()[0][0]

# create one sine time series with fixed random frequency and amplitude, from Lakshmanan V. Medium's post
def create_one_time_series(seq_length=10):
  freq = (gimme_one_random_number()*0.5) + 0.1  # 0.1 to 0.6
  ampl = gimme_one_random_number() + 0.5  # 0.5 to 1.5
  x = np.sin(np.arange(0, seq_length) * freq) * ampl
  return x



In [ ]:

    
def create_batch_time_series(seq_length=10, num_samples=4):
    """Create a dataframe with a batch of random sine time series. 
        inputs:
            seq_length: number of time steps in each time series
            num_samples: number of time series
        outputs:
            df: pandas dataframe of shape (num_samples, seq_length) with each row being one time series, 
            each column is a timestep
    """
    column_labels = ['t'+str(i) for i in range(0, seq_length)]
    df = pd.DataFrame(create_one_time_series(seq_length=seq_length)).transpose()
    df.columns = column_labels
    df.index = ['s'+str(0)]
    for i in range(1, num_samples):
        more_df = pd.DataFrame(create_one_time_series(seq_length=seq_length)).transpose()
        more_df.columns = column_labels
        more_df.index = ['s'+str(i)]
        df = pd.concat([df, more_df], axis=0)
    return df



In [ ]:

    
########################
# Create some time-series for training and testing
########################

# build predictible random time-series
mx.random.seed(123) 

# store the data in this directory, create it if it does not exist
if not os.path.exists('../data/timeseries/'):
    makedirs('../data/timeseries/')
    
if CREATE_DATA_SETS:
    data_train = create_batch_time_series(seq_length=SEQ_LENGTH, num_samples=NUM_SAMPLES_TRAINING)  
    data_test = create_batch_time_series(seq_length=SEQ_LENGTH, num_samples=NUM_SAMPLES_TESTING)
    # Write data to csvs
    data_train.to_csv("../data/timeseries/train.csv")
    data_test.to_csv("../data/timeseries/test.csv")
else: 
    data_train = pd.read_csv("../data/timeseries/train.csv", index_col=0)
    data_test = pd.read_csv("../data/timeseries/test.csv", index_col=0)

Check the data real quick



In [ ]:

    
# pick 3 time series at random and plot them
(data_train.sample(3).transpose().iloc[range(0, SEQ_LENGTH)]).plot()

Preparing the data for training



In [ ]:

    
# number of samples to use for each bach when doing minibatch training on the Deep net
batch_size = 64
# number of samples to use for testing. 1 is simple, to quickly evaluate prediction power on one time series
batch_size_test = 1
# sequence length for training & testing. This is equal to the number of RNN input (and output) cells when unrolling the RNN net. Can be adjusted to your liking or the problem you are trying to solve. Longer sequences may require deeper nets.
seq_length = 16

# number of minibatches available for training and testing
num_batches_train = data_train.shape[0] // batch_size
num_batches_test = data_test.shape[0] // batch_size_test

#  we do 1D time series for now, this is equivalent to vocab_size = 1 for text 
num_features = 1  

# inputs are from t0 to t_seq_length - 1. because the last point is kept for the output ("labels") of the penultimate point 
data_train_inputs = data_train.loc[:,data_train.columns[:-1]]
data_train_labels = data_train.loc[:,data_train.columns[1:]]
data_test_inputs = data_test.loc[:,data_test.columns[:-1]]
data_test_labels = data_test.loc[:,data_test.columns[1:]]

# reshape the data to prepare for training & testing ingestion
train_data_inputs = nd.array(data_train_inputs.values).reshape((num_batches_train, batch_size, seq_length, num_features))
train_data_labels = nd.array(data_train_labels.values).reshape((num_batches_train, batch_size, seq_length, num_features))
test_data_inputs = nd.array(data_test_inputs.values).reshape((num_batches_test, batch_size_test, seq_length, num_features))
test_data_labels = nd.array(data_test_labels.values).reshape((num_batches_test, batch_size_test, seq_length, num_features))

train_data_inputs = nd.swapaxes(train_data_inputs, 1, 2)
train_data_labels = nd.swapaxes(train_data_labels, 1, 2)
test_data_inputs = nd.swapaxes(test_data_inputs, 1, 2)
test_data_labels = nd.swapaxes(test_data_labels, 1, 2)

print('num_samples_training={0} | num_batches_train={1} | batch_size={2} | seq_length={3}'.format(NUM_SAMPLES_TRAINING, num_batches_train, batch_size, seq_length))

Long short-term memory (LSTM) RNNs

An LSTM block has mechanisms to enable "memorizing" information for an extended number of time steps. We use the LSTM block with the following transformations that map inputs to outputs across blocks at consecutive layers and consecutive time steps: $\newcommand{\xb}{\mathbf{x}} \newcommand{\RR}{\mathbb{R}}$

$$g_t = \text{tanh}(X_t W_{xg} + h_{t-1} W_{hg} + b_g),$$$$i_t = \sigma(X_t W_{xi} + h_{t-1} W_{hi} + b_i),$$$$f_t = \sigma(X_t W_{xf} + h_{t-1} W_{hf} + b_f),$$$$o_t = \sigma(X_t W_{xo} + h_{t-1} W_{ho} + b_o),$$$$c_t = f_t \odot c_{t-1} + i_t \odot g_t,$$$$h_t = o_t \odot \text{tanh}(c_t),$$

where $\odot$ is an element-wise multiplication operator, and for all $\xb = [x_1, x_2, \ldots, x_k]^\top \in \RR^k$ the two activation functions:

$$\sigma(\xb) = \left[\frac{1}{1+\exp(-x_1)}, \ldots, \frac{1}{1+\exp(-x_k)}]\right]^\top,$$$$\text{tanh}(\xb) = \left[\frac{1-\exp(-2x_1)}{1+\exp(-2x_1)}, \ldots, \frac{1-\exp(-2x_k)}{1+\exp(-2x_k)}\right]^\top.$$

In the transformations above, the memory cell $c_t$ stores the "long-term" memory in the vector form. In other words, the information accumulatively captured and encoded until time step $t$ is stored in $c_t$ and is only passed along the same layer over different time steps.

Given the inputs $c_t$ and $h_t$, the input gate $i_t$ and forget gate $f_t$ will help the memory cell to decide how to overwrite or keep the memory information. The output gate $o_t$ further lets the LSTM block decide how to retrieve the memory information to generate the current state $h_t$ that is passed to both the next layer of the current time step and the next time step of the current layer. Such decisions are made using the hidden-layer parameters $W$ and $b$ with different subscripts: these parameters will be inferred during the training phase by gluon.

Deep LSTM

Deep LSTM are multi-layered, which means they contain at least two LSTM layers stacked together. One deep LSTM layer, indexed by $i$ (nameley, the $i$-th hidden LSTM layer, starting with $i=1$) receives the hidden state at time $t$, noted $h^{[i-1]}_t$, of the LSTM cell above it, and its own hidden state at time $t-1$, noted $h^{[i]}_{t-1}$ (that is, the hidden state of itself at the previous timestep, just like any regular one-layered LSTM RNN.)

Like any regular one-layered LSTM, the entire sequence is fed to each LSTM unit, by design, and this is repeated for each sequence in the minibatch, before backprop and updating the parameters is done.

Everything else is left unchanged. The output layer will receive inputs from the last hidden layer.

Allocate parameters



In [ ]:

    
# for a 1D time series, this is just a scalar equal to 1
num_inputs = num_features
# same comment
num_outputs = num_features
# num of hidden units in each hidden LSTM layer. This effectively sets the number of layers in the Deep Net.
num_hidden_units = [8, 8, 8]
# num of hidden LSTM layers
num_hidden_layers = len(num_hidden_units)
# num of units in each layer but the output layer
num_units_layers = [num_features] + num_hidden_units

########################
#  Weights connecting the inputs to the hidden layers
########################
# weights and biases are now dictionaries because there is one set of them per layer
Wxg, Wxi, Wxf, Wxo, Whg, Whi, Whf, Who, bg, bi, bf, bo = {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {} 
for i_layer in range(1, num_hidden_layers+1):
    num_inputs = num_units_layers[i_layer-1]
    num_hidden_units = num_units_layers[i_layer]
    Wxg[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * .01
    Wxi[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * .01
    Wxf[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * .01
    Wxo[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * .01

    ########################
    #  Recurrent weights connecting the hidden layer across time steps
    ########################
    Whg[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * .01
    Whi[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * .01
    Whf[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * .01
    Who[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * .01

    ########################
    #  Bias vector for hidden layer
    ########################
    bg[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * .01
    bi[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * .01
    bf[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * .01
    bo[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * .01

########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_units_layers[-1], num_outputs), ctx=ctx) * .01
by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

Attach the gradients



In [ ]:

    
params = []
for i_layer in range(1, num_hidden_layers+1):
    params += [Wxg[i_layer], Wxi[i_layer], Wxf[i_layer], Wxo[i_layer], Whg[i_layer], Whi[i_layer], Whf[i_layer], Who[i_layer], bg[i_layer], bi[i_layer], bf[i_layer], bo[i_layer]]

# add the output layer
params += [Why, by]

for param in params:
    param.attach_grad()

Averaging the loss over the sequence



In [ ]:

    
# we use standard rmse loss here, good enough for 1D time series
def rmse(yhat, y):
    return nd.mean(nd.sqrt(nd.sum(nd.power(y - yhat, 2), axis=0, exclude=True)))

def average_rmse_loss(outputs, labels):
    assert(len(outputs) == len(labels))
    total_loss = 0.
    for (output, label) in zip(outputs,labels):
        total_loss = total_loss + rmse(output, label)
    return total_loss / len(outputs)

Optimizer



In [ ]:

    
# Standard Gradient Descent
def SGD(params, learning_rate):
    for param in params:
        param[:] = param - learning_rate * param.grad

# Adam Optimizer. Inspired from Agustinus Kristiadi's blog post on Optimizers
def adam(params, learning_rate, M , R, index_adam_call, beta1, beta2, eps):
    k = -1
    for param in params:
        k += 1
        M[k] = beta1 * M[k] + (1. - beta1) * param.grad
        R[k] = beta2 * R[k] + (1. - beta2) * (param.grad)**2
        # bias correction since we initilized M & R to zeros, they're biased toward zero on the first few iterations
        m_k_hat = M[k] / (1. - beta1**(index_adam_call))
        r_k_hat = R[k] / (1. - beta2**(index_adam_call))
        if((np.isnan(M[k].asnumpy())).any() or (np.isnan(R[k].asnumpy())).any()):
            raise(ValueError('Error nans propagating in Adam Optimizer'))
        param[:] = param - learning_rate * m_k_hat / (nd.sqrt(r_k_hat) + eps)
    return params, M, R

Define the model



In [ ]:

    
def single_lstm_unit_calcs(X, c, Wxg, h, Whg, bg, Wxi, Whi, bi, Wxf, Whf, bf, Wxo, Who, bo):
    """ Function that does the LSTM computations for a single cell. 
        input:
            all parameters (W, biases), input X, memory cell c.
        output:
            c: updated value for the memory cell. 'updated' means going from timestep t-1 to timestep t
            h: updated value for the hidden state
            
    """
    g = nd.tanh(nd.dot(X, Wxg) + nd.dot(h, Whg) + bg)
    i = nd.sigmoid(nd.dot(X, Wxi) + nd.dot(h, Whi) + bi)
    f = nd.sigmoid(nd.dot(X, Wxf) + nd.dot(h, Whf) + bf)
    o = nd.sigmoid(nd.dot(X, Wxo) + nd.dot(h, Who) + bo)
    #######################
    c = f * c + i * g
    h = o * nd.tanh(c)
    return c, h

def deep_lstm_rnn(inputs, h, c, temperature=1.0):
    """ Do one forward pass of the entire deep net (accross all layers) for one minibatch 'inputs'
        h: dict of nd.arrays, each key is the index of a hidden layer (from 1 to num_hidden_layers). 
        Index 0, if any, is the input layer
    """
    outputs = []
    # inputs is one MINIBATCH of sequences so its shape is number_of_seq, seq_length, features_dim 
    # Note that features_dim is 1 for a time series, vocab_size for a character, n for a n-D times series)
    for X in inputs:  # this loop is therefore a loop on the timesteps
        # X is one minibatch of one timestep value, NOT the entire sequence. E.g. if each batch has 37 sequences, then the first value of X will be a set of the 37 first values of each of the 37 sequences 
        # that means each iteration on X corresponds to one timestep value, but it is done in batches of different sequences
        h[0] = X  # the first hidden layer takes the input X as input 
        for i_layer in range(1, num_hidden_layers+1):
            # lstm units now have the 2 following inputs: 
            # i) h_t from the previous layer (equivalent to the input X for a non-deep lstm net), 
            # ii) h_t-1 from the current layer (same as for non-deep lstm nets)
            c[i_layer], h[i_layer] = single_lstm_unit_calcs(h[i_layer-1], c[i_layer], Wxg[i_layer], h[i_layer], Whg[i_layer], bg[i_layer], Wxi[i_layer], Whi[i_layer], bi[i_layer], Wxf[i_layer], Whf[i_layer], bf[i_layer], Wxo[i_layer], Who[i_layer], bo[i_layer])
        yhat_linear = nd.dot(h[num_hidden_layers], Why) + by
        # yhat is a batch of several values of the same timestep
        # this is basically the prediction of the next timestep value
        # we cannot use a 1.0-bounded activation function like tanh for example, since amplitudes can be greater than 1.0
        yhat = yhat_linear
        # outputs has same shape as inputs, i.e. a list of minibatches of data points. outputs is thus a minibatch of output sequences
        outputs.append(yhat) 

    return (outputs, h, c)

Test and visualize predictions



In [ ]:

    
def test_prediction(one_input_seq, one_label_seq, temperature=1.0):
    #####################################
    # Set the initial state of the hidden representation ($h_0$) to the zero vector
    #####################################  # some better initialization needed??
    h, c = {}, {}
    for i_layer in range(1, num_hidden_layers+1):
        h[i_layer] = nd.zeros(shape=(batch_size_test, num_units_layers[i_layer]), ctx=ctx)
        c[i_layer] = nd.zeros(shape=(batch_size_test, num_units_layers[i_layer]), ctx=ctx)
    
    outputs, h, c = deep_lstm_rnn(one_input_seq, h, c, temperature=temperature)
    loss = rmse(outputs[-1][0], one_label_seq)
    return outputs[-1][0].asnumpy()[-1], one_label_seq.asnumpy()[-1], loss.asnumpy()[-1], outputs, one_label_seq

def check_prediction(index):
    """ Computes one prediction and its associated relative error on the index sequence of the test dataset
        input:
            index: index of the sequence in the test_data_inputs
        output:
            df: pandas dataframe that contains predicted values vs true values (called 'labels')
    """
    o, label, loss, outputs, labels = test_prediction(test_data_inputs[index], test_data_labels[index], temperature=1.0)
    prediction = round(o, 3)
    true_label = round(label, 3)
    outputs = [float(i.asnumpy().flatten()) for i in outputs]
    true_labels = list(test_data_labels[index].asnumpy().flatten())
    df = pd.DataFrame([outputs, true_labels]).transpose()
    df.columns = ['predicted', 'true']

    return df



In [ ]:

    
# one epoch is one pass over the entire training dataset.
epochs = 32
moving_loss = 0.
# 0.001 works for a [8, 8, 8] after about 70 epochs of 32-sized batches. 
# Can be somewhat larger for deeper nets or when using SGD, to speed up training.
learning_rate = 0.001  # 

# Adam Optimizer parameters
beta1 = .9
beta2 = .999
index_adam_call = 0
# M & R arrays to keep track of momenta in adam optimizer. params is a list that contains all ndarrays of parameters
M = {k: nd.zeros_like(v) for k, v in enumerate(params)}
R = {k: nd.zeros_like(v) for k, v in enumerate(params)}

# initialize dataframes that keep track of Loss and Errors of test predictions, for visualization purposes
df_moving_loss = pd.DataFrame(columns=['Loss', 'Error'])
df_moving_loss.index.name = 'Epoch'

# needed to update plots on the fly
%matplotlib notebook
fig, axes_fig1 = plt.subplots(1,1, figsize=(6,3))  # for plotting predicted vs true time series
fig2, axes_fig2 = plt.subplots(1,1, figsize=(6,3))  # for plotting training Loss and Relative error on predictions for a randomly chosen test sequence

for e in range(epochs):
    ############################
    # Attenuate the learning rate by a factor of 2 every 100 epochs. Only use for SGD, not needed (valid?) for Adam.
    ############################
    if ((e+1) % 100 == 0):
        learning_rate = learning_rate / 2.0

    # initialize hidden and memory states for the net 
    h, c = {}, {}
    for i_layer in range(1, num_hidden_layers+1):
        h[i_layer] = nd.zeros(shape=(batch_size, num_units_layers[i_layer]), ctx=ctx)
        c[i_layer] = nd.zeros(shape=(batch_size, num_units_layers[i_layer]), ctx=ctx)

    for i in range(num_batches_train):
        data_one_hot = train_data_inputs[i]
        label_one_hot = train_data_labels[i]
        with autograd.record():
            outputs, h, c = deep_lstm_rnn(data_one_hot, h, c)
            loss = average_rmse_loss(outputs, label_one_hot)
            loss.backward()
#         SGD(params, learning_rate)
        index_adam_call += 1  # needed for bias correction in Adam optimizer
        params, M, R = adam(params, learning_rate, M, R, index_adam_call, beta1, beta2, 1e-8)
        
        ##########################
        #  Keep a moving average of the losses
        ##########################
        if (i == 0) and (e == 0):
            moving_loss = nd.mean(loss).asscalar()
        else:
            moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
        df_moving_loss.loc[e] = round(moving_loss, 4)

    ############################
    #  Predictions and plots
    ############################
    # use the epoch index to randomly select one test data sequence.
    data_prediction_df = check_prediction(index=e)
    axes_fig1.clear()
    data_prediction_df.plot(ax=axes_fig1)
    fig.canvas.draw()
    prediction = round(data_prediction_df.tail(1)['predicted'].values.flatten()[-1], 3)
    true_label = round(data_prediction_df.tail(1)['true'].values.flatten()[-1], 3)
    rel_error = round(100. * np.abs(prediction / true_label - 1.0), 2)
    axes_fig2.clear()
    if e == 0:
        moving_rel_error = rel_error
    else:
        moving_rel_error = .9 * moving_rel_error + .1 * rel_error

    df_moving_loss.loc[e, ['Error']] = moving_rel_error
    axes_loss_plot = df_moving_loss.plot(ax=axes_fig2, secondary_y='Loss', color=['r','b'])
    axes_loss_plot.right_ax.grid(False)
    axes_loss_plot.right_ax.set_yscale('log')
    fig2.canvas.draw()
    print("Epoch = {0} | Loss = {1} | Out of Sample Prediction = {2} True = {3} Error = {4}".format(e, moving_loss, prediction, true_label, moving_rel_error ))
    
%matplotlib inline

Conclusions

After about 10 epochs, the deep net already exhibits a descent prediction power.

After about 30 epochs, convergence slows down and prediction power is of the order of a few percents. At some point, the deep net may diverge and NaNs may appear.

The Deep Net can be customized by changing it shape using num_hidden_units. A deeper will require more training data, a longer training, without bringing too much extra performance. This is because the underlying data is just sine waves so their complexity is limited. A super-complex (i.e. deep with many hidden cells) net that tries to learn a not-so-complex time series is a sure path to unnecessary slow training at best and to overfitting at worse.