TimeSeries Collection

LSTM 20 steps ahead

http://datascience.stackexchange.com/questions/15694/sequence-forecasting-in-keras-not-possible-for-variable-length-sequence-forecast

I'm using keras for multiple-step ahead time series forecasting of a univariate time series of type float. Judging by the results I got, the approach works works perfectly well. There is, however, an important detail in the training process that baffles me:

keras requires the sequence length of the input sequences (X matrix) to be equal to the forecasting horizon (y matrix). That means, for example, that keras needs input sequences of length 20 in order to forecast the next 20 time steps. My goal is to be able to forecast as many time steps as I specify, given the last 20 time steps. With the below code, this is not possible:

I get a dimension error whenever the sequence lengths of the sequences in X_train and in y_train differ from one another.


In [ ]:
import numpy as np
from keras.layers.core import Dense, Activation, Dropout, TimeDistributedDense
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.optimizers import RMSprop
np.random.seed(1234)


model = Sequential()
layers = [1, 20, 40, 1]

model.add(LSTM(
    input_dim=layers[0],
    output_dim=layers[1],
    return_sequences=True))
model.add(Dropout(0.3))

model.add(LSTM(
    layers[2],
    return_sequences=True))
model.add(Dropout(0.3))

model.add(TimeDistributedDense(
    output_dim=layers[3]))
model.add(Activation("linear"))

rms = RMSprop(lr=0.001)
model.compile(loss="mse", optimizer=rms)
model.fit(
    X_train, y_train,
    batch_size=512, nb_epoch=100, validation_split=0.1)

Houshold example

http://vict0rsch.github.io/tutorials/keras/recurrent/

Background info: Recurrent Nets and Data dimensions https://github.com/Vict0rSch/deep_learning/blob/master/recurrent.md

Source: https://github.com/Vict0rSch/deep_learning/blob/master/keras/recurrent/recurrent_keras_power.py

The task here will be to be able to predict values for a timeseries : the history of 2 million minutes of a household’s power consumption. We are going to use a multi-layered LSTM recurrent neural network to predict the last value of a sequence of values. Put another way, given 49 timesteps of consumption, what will be the 50th value?

Data: https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption#


In [31]:
### prevent the dying jupyter notebook
import sys
stdout = sys.stdout
#sys.stdout = sys.__stdout__  # did not work to restoure print -> console
#sys.stdout = open('keras_output.txt', 'a+')
#sys.stdout = stdout

import matplotlib.pyplot as plt
%matplotlib inline

dpath = 'data/power/'
path_to_dataset = dpath + 'household_power_consumption.txt'

In [ ]:
%mkdir -p dpath
!wget -P $dpath https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip
!unzip -d $dpath data/power/household_power_consumption.zip

In [2]:
import numpy as np
import time
import csv
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
np.random.seed(1234)


Using TensorFlow backend.

The initial file contains lots of different pieces of data. We will here focus on a single value : a house’s Global_active_power history, minute by minute for almost 4 years. This means roughly 2 million points. Some values are missing, this is why we try to load the values as floats into the list and if the value is not a number ( missing values are marked with a ?) we simply ignore them.

Also if we do not want to load the entire dataset, there is a condition to stop loading the data when a certain ratio is reached.

Once all the datapoints are loaded as one large timeseries, we have to split it into examples. Again, one example is made of a sequence of 50 values. Using the first 49, we are going to try and predict the 50th. Moreover, we’ll do this for every minute given the 49 previous ones so we use a sliding buffer of size 50.

Neural networks usually learn way better when data is pre-processed (cf Y. Lecun’s 1995 paper, section 4.3). However regarding time-series we do not want the network to learn on data too far from the real world. So here we’ll keep it simple and simply center the data to have a 0 mean.

Now that the examples are formatted, we need to split them into train and test, input and target. Here we select 10% of the data as test and 90% to train. We also select the last value of each example to be the target, the rest being the sequence of inputs.

We shuffle the training examples so that we train in no particular order and the distribution is uniform (for the batch calculation of the loss) but not the test set so that we can visualize our predictions with real signals.


In [20]:
path_to_dataset
sequence_length = 50
h5Path = '{}power_TF_s{}.h5'.format(dpath, sequence_length)
h5Path


Out[20]:
'data/power/household_power_consumption.txt'
Out[20]:
'data/power/power_TF_s50.h5'

In [21]:
def data_power_consumption(path_to_dataset,
                           sequence_length=50,
                           ratio=1.0):

    max_values = ratio * 2049280

    with open(path_to_dataset, encoding = "ISO-8859-1") as f:
        data = csv.reader(f, delimiter=";")
        power = []
        nb_of_values = 0
        for line in data:
            try:
                power.append(float(line[2]))
                nb_of_values += 1
            except ValueError:
                pass
            # 2049280.0 is the total number of valid values, i.e. ratio = 1.0
            if nb_of_values >= max_values:
                break

    print("Data loaded from csv. Formatting...")

    result = []
    for index in range(len(power) - sequence_length):
        result.append(power[index: index + sequence_length])
    result = np.array(result)  # shape (2049230, 50)

    result_mean = result.mean()
    result -= result_mean
    print("Shift : ", result_mean)
    print("Data  : ", result.shape)

    row = round(0.9 * result.shape[0])
    train = result[:row, :]
    np.random.shuffle(train)
    X_train = train[:, :-1]
    y_train = train[:, -1]
    X_test = result[row:, :-1]
    y_test = result[row:, -1]

    X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
    X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

    return [X_train, y_train, X_test, y_test]

X_train, y_train, X_test, y_test = data_power_consumption(path_to_dataset, sequence_length)


Data loaded from csv. Formatting...
Shift :  1.09157810417
Data  :  (2049230, 50)

In [37]:
fig = plt.figure(facecolor='white', figsize=(20,10))
ax = fig.add_subplot(111)
ax.plot(X_train[:100,0,0], label='X_train')


Out[37]:
[<matplotlib.lines.Line2D at 0x7effd4a99a90>]

In [22]:
X_train.shape, y_train.shape
X_train[0,:,0]
y_train[0]


Out[22]:
((1844307, 49, 1), (1844307,))
Out[22]:
array([-0.8775781, -0.8775781, -0.8795781, -0.8115781, -0.7955781,
       -0.7955781, -0.7955781, -0.7955781, -0.7975781, -0.7975781,
       -0.7995781, -0.7975781, -0.7995781, -0.7315781, -0.7095781,
       -0.7675781, -0.7715781, -0.7735781, -0.7695781, -0.7655781,
       -0.7655781, -0.7675781, -0.7675781, -0.7675781, -0.7695781,
       -0.7715781, -0.7715781, -0.7715781, -0.7795781, -0.7795781,
       -0.7815781, -0.7795781, -0.7815781, -0.8715781, -0.8695781,
       -0.8695781, -0.8715781, -0.8695781, -0.8715781, -0.8715781,
       -0.8775781, -0.8815781, -0.8815781, -0.8815781, -0.8815781,
       -0.8815781, -0.8815781, -0.8835781, -0.8775781])
Out[22]:
-0.87357810416595072

The list [ X_test[i][0] ] is the entire signal (minus the last 49 values) from which it was built since we’ve used a 1-timestep sliding buffer.

predict(X_test[0]) is therefore the prediction for the 50th value and its associated target is y_test[0]. Moreover, by construction, y_test[0] = X_test[1][48] = X_test[2][47] = ...

then predict(X_test[1]) is the prediction of the 51th value, associated with y_test[1] as a target.

therefore predict(X_test) is the predicted signal, one step ahead, and y_test is its target.


In [23]:
X_test.shape, y_test.shape
X_test[0,:10,0]
X_test[1,:10,0]
X_test[0][0]


Out[23]:
((204923, 49, 1), (204923,))
Out[23]:
array([ 0.1644219,  0.6324219,  0.6564219,  0.6764219,  0.6844219,
        0.6924219,  0.2564219, -0.0355781, -0.0375781, -0.0375781])
Out[23]:
array([ 0.6324219,  0.6564219,  0.6764219,  0.6844219,  0.6924219,
        0.2564219, -0.0355781, -0.0375781, -0.0375781, -0.0395781])
Out[23]:
array([ 0.1644219])

In [24]:
data = [X_train, y_train, X_test, y_test]

Last thing regards input formats. Read through the recurrent post to get more familiar with data dimensions. So we reshape the inputs to have dimensions (#examples, #values in sequences, dim. of each value). Here each value is 1-dimensional, they are only one measure (of power consumption at time t). However if we were to predict speed vectors they could be 3 dimensional for instance. In fine, we return X_train, y_train, X_test, y_test in a list (to be able to feed it as one only object to our run function)

So here we are going to build our Sequential model. This means we’re going to stack layers in this object. Also, layers is the list containing the sizes of each layer. We are therefore going to have a network with 1-dimensional input, two hidden layers of sizes 50 and 100 and eventually a 1-dimensional output layer.

After the model is initialized, we create a first layer, in this case an LSTM layer. Here we use the default parameters so it behaves as a standard recurrent layer. Since our input is of 1 dimension, we declare that it should expect an input_dim of 1. Then we say we want layers[1] units in this layer. We also add 20% Dropout in this layer.

The last layer we use is a Dense layer ( = feedforward). Since we are doing a regression, its activation is linear.

Lastly, we compile the model using a Mean Square Error (again, it’s standard for regression) and the RMSprop optimizer. See the mnist example to learn more on rmsprop.

  1. The first example value is fed to the network from the input
    1. The first hidden layer’s activation is computed and passed both to the second hidden layer and to itself
    2. The second hidden layer takes as input the first hidden layer’s activation, computes its own activation and passes it only to itself
  2. The second example of the same sequence is fed from the input
    1. The first hidden layer takes as input both this value and its own previous prediction from the first timestep. The computed activation is fed again both to the second layer and to the first hidden layer itself
    2. The second layer behaves likewise: it takes its previous prediction and the first hidden layer’s output as inputs and outputs an activation. This activation, once again, is fed to the second hidden layer for the next timestep
  3. The last value of the sequence is input into the network
    1. The first hidden layer behaves as before (2.a)
    2. The second layer also behaves as before (2.b) except that this time, its activation is also passed to the last, Dense layer.
    3. The Dense layer computes its activation from the second hidden layer’s activation. This activation is the prediction our network does for the 4th timestep.

To conclude, the fact that return_sequence=True for the first layer means that its output is always fed to the second layer. As a whole regarding time, all its activations can be seen as the sequence of prediction this first layer has made from the input sequence.

On the other hand, return_sequence=False for the second layer because its output is only fed to the next layer at the end of the sequence. As a whole regarding time, it does not output a prediction for the sequence but one only prediction-vector (of size layer[2]) for the whole input sequence.

The linear Dense layer is used to aggregate all the information from this prediction-vector into one single value, the predicted 4th timestep of the sequence.

To conclude, the fact that return_sequence=True for the first layer means that its output is always fed to the second layer. As a whole regarding time, all its activations can be seen as the sequence of prediction this first layer has made from the input sequence.

In our case, the first LSTM layer returns sequences because we want it to transfer its information both to the next layer (upwards in the chart) and to itself for the next timestep (arrow to the right).

However for the second one, we just expect its last sequence prediction to be compared to the target.

This means for inputs 0 to sequence_length - 2 the prediction is only passed to the layer itself for the next timestep and not as an input to the next ( = output) layer. However the sequence_length - 1th input is passed forward to the Dense layer for the loss computation against the target.


In [25]:
def build_model():
    model = Sequential()
    layers = [1, sequence_length, sequence_length*2, 1] #49 nodes + bias

    model.add(LSTM(
        input_dim=layers[0],
        output_dim=layers[1],
        return_sequences=True))
    model.add(Dropout(0.2))

    model.add(LSTM(
        layers[2],
        return_sequences=False))
    model.add(Dropout(0.2))

    model.add(Dense(
        output_dim=layers[3]))
    model.add(Activation("linear"))

    start = time.time()
    model.compile(loss="mse", optimizer="rmsprop")
    print("Compilation Time : ", time.time() - start)
    return model

model = build_model()
model.summary()


Compilation Time :  0.2741434574127197
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
lstm_3 (LSTM)                    (None, None, 50)      10400       lstm_input_2[0][0]               
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, None, 50)      0           lstm_3[0][0]                     
____________________________________________________________________________________________________
lstm_4 (LSTM)                    (None, 100)           60400       dropout_3[0][0]                  
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 100)           0           lstm_4[0][0]                     
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 1)             101         dropout_4[0][0]                  
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 1)             0           dense_2[0][0]                    
====================================================================================================
Total params: 70901
____________________________________________________________________________________________________

In [49]:
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint(filepath= dpath + 'checkpoint-{epoch:02d}-{val_loss:.2f}.hdf5')

def run_network(model=None, data=None, epochs=1):
    global_start_time = time.time()
    ratio = 0.5
    sequence_length = 50
    path_to_dataset = 'household_power_consumption.txt'

    if data is None:
        print('Loading data... ')
        X_train, y_train, X_test, y_test = data_power_consumption(
            path_to_dataset, sequence_length, ratio)
    else:
        X_train, y_train, X_test, y_test = data

    print('\nData Loaded. Compiling...\n')

    if model is None:
        model = build_model()

    try:
        history = model.fit(
            X_train, y_train,
            batch_size=512, nb_epoch=epochs, validation_split=0.05,
            verbose=1, callbacks=[checkpoint])
    except KeyboardInterrupt:
        print('Training duration (s) : ', time.time() - global_start_time)
        return model, y_test, 0

    print('Training duration (s) : ', time.time() - global_start_time)
    
    return model, y_test

In [39]:
from IPython.core.debugger import Tracer
def predict_sequence_full(model, data, window_size):
    #Shift the window by 1 new prediction each time, re-run predictions on new window
    curr_frame = data[0]
    print("-D-", curr_frame.shape)
    predicted = []
    
    # loop over entire testdata
    for i in range(len(data)):
        predicted.append(model.predict(curr_frame[np.newaxis,:,:])[0,0])  #get element from shape(1,1)
        curr_frame = curr_frame[1:]  #move window
        #Tracer()()
        curr_frame = np.insert(curr_frame, [window_size-2], predicted[-1], axis=0)  #fill frame with prediction
    return predicted

In [40]:
def plot_result(y_test, predicted, seq_len=100):
    try:
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.plot(y_test[:seq_len])
        plt.plot(predicted[:seq_len])
        plt.show()
    except Exception as e:
        print(str(e))

In [51]:
### prevent the dying jupyter notebook
sys.stdout = open('keras_output.txt', 'a+')
run_network(model, data, 2)
sys.stdout = stdout


Out[51]:
(<keras.models.Sequential at 0x7effd50d28d0>,
 array([ 0.5004219,  0.8504219,  0.6104219, ..., -0.1475781, -0.1535781,
        -0.1575781]))

In [52]:
model.save_weights(h5Path)

In [53]:
print("asdf")


asdf

In [56]:
%%time
length = 150
predicted = predict_sequence_full(model, X_test[:length], sequence_length)
predicted[:5]
plot_result(y_test, predicted, length)


-D- (49, 1)
CPU times: user 5.13 s, sys: 280 ms, total: 5.41 s
Wall time: 5.05 s

Seq 100


In [11]:
#model.load_weights(h5Path)
model.load_weights(dpath + 'power_TF_s150.h5')

In [ ]:
sys.stdout = open('keras_output.txt', 'a+')
run_network(model, data, 2)
sys.stdout = stdout

model.save_weights(h5Path)

In [16]:
%%time
length=1500
X_test[:length].shape
predicted = model.predict(X_test[:length])
plot_result(y_test, predicted, length)


CPU times: user 5.56 s, sys: 344 ms, total: 5.91 s
Wall time: 5.51 s

In [17]:
%%time
predicted = predict_sequence_full(model, X_test[:length], sequence_length)
predicted[:5]
plot_result(y_test, predicted, length)


-D- (149, 1)
Out[17]:
[1.0173913, 0.95691556, 0.92827827, 0.92634255, 0.94639927]

In [19]:
predicted[:100]


Out[19]:
[1.0173913,
 0.95691556,
 0.92827827,
 0.92634255,
 0.94639927,
 0.97766864,
 1.0080423,
 1.0296339,
 1.039916,
 1.0403829,
 1.0344633,
 1.0256516,
 1.016338,
 1.0074536,
 0.99875039,
 0.98939282,
 0.97855055,
 0.9657979,
 0.95125127,
 0.93547577,
 0.91925097,
 0.90330642,
 0.88812155,
 0.87384444,
 0.86033064,
 0.84726268,
 0.83429247,
 0.82115608,
 0.80773216,
 0.79404157,
 0.78020447,
 0.76638013,
 0.75271255,
 0.73929745,
 0.72617418,
 0.71333724,
 0.70075816,
 0.68840593,
 0.67626071,
 0.66431826,
 0.65258664,
 0.64107859,
 0.62980503,
 0.61877,
 0.60797042,
 0.59739739,
 0.58703965,
 0.57688689,
 0.56693155,
 0.55716944,
 0.54759878,
 0.53821927,
 0.52903074,
 0.52003241,
 0.51122248,
 0.50259876,
 0.4941586,
 0.48589951,
 0.47781932,
 0.46991614,
 0.46218827,
 0.45463392,
 0.44725108,
 0.44003749,
 0.43299055,
 0.4261075,
 0.41938534,
 0.4128212,
 0.40641215,
 0.40015534,
 0.39404786,
 0.38808689,
 0.38226959,
 0.37659317,
 0.37105486,
 0.36565188,
 0.36038145,
 0.355241,
 0.35022792,
 0.34533969,
 0.34057382,
 0.3359279,
 0.33139953,
 0.32698634,
 0.32268599,
 0.3184962,
 0.31441471,
 0.31043926,
 0.30656761,
 0.30279759,
 0.29912701,
 0.29555377,
 0.29207572,
 0.28869078,
 0.28539687,
 0.28219193,
 0.27907395,
 0.27604097,
 0.27309099,
 0.27022207]

In [ ]:
model = Sequential()
model.add(Dense(500, input_shape = (TRAIN_SIZE, )))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('linear'))
model.compile(optimizer='adam', loss='mse')

In [ ]:
model.save_weights(dpath+'power_TF_e1.h5')

In [ ]:
model = Sequential()
model.add(Convolution1D(input_shape = (TRAIN_SIZE, EMB_SIZE), 
                        nb_filter=64,
                        filter_length=2,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
model.add(MaxPooling1D(pool_length=2))

model.add(Convolution1D(input_shape = (TRAIN_SIZE, EMB_SIZE), 
                        nb_filter=64,
                        filter_length=2,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
model.add(MaxPooling1D(pool_length=2))

model.add(Dropout(0.25))
model.add(Flatten())

model.add(Dense(250))
model.add(Dropout(0.25))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('linear'))

In [ ]:
model = Sequential()
model.add(LSTM(input_shape = (EMB_SIZE,), input_dim=EMB_SIZE, output_dim=HIDDEN_RNN, return_sequences=True))
model.add(LSTM(input_shape = (EMB_SIZE,), input_dim=EMB_SIZE, output_dim=HIDDEN_RNN, return_sequences=False))
model.add(Dense(1))
model.add(Activation('linear'))