I'm using keras for multiple-step ahead time series forecasting of a univariate time series of type float. Judging by the results I got, the approach works works perfectly well. There is, however, an important detail in the training process that baffles me:
keras requires the sequence length of the input sequences (X matrix) to be equal to the forecasting horizon (y matrix). That means, for example, that keras needs input sequences of length 20 in order to forecast the next 20 time steps. My goal is to be able to forecast as many time steps as I specify, given the last 20 time steps. With the below code, this is not possible:
I get a dimension error whenever the sequence lengths of the sequences in X_train and in y_train differ from one another.
In [ ]:
import numpy as np
from keras.layers.core import Dense, Activation, Dropout, TimeDistributedDense
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.optimizers import RMSprop
np.random.seed(1234)
model = Sequential()
layers = [1, 20, 40, 1]
model.add(LSTM(
input_dim=layers[0],
output_dim=layers[1],
return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(
layers[2],
return_sequences=True))
model.add(Dropout(0.3))
model.add(TimeDistributedDense(
output_dim=layers[3]))
model.add(Activation("linear"))
rms = RMSprop(lr=0.001)
model.compile(loss="mse", optimizer=rms)
model.fit(
X_train, y_train,
batch_size=512, nb_epoch=100, validation_split=0.1)
http://vict0rsch.github.io/tutorials/keras/recurrent/
Background info: Recurrent Nets and Data dimensions https://github.com/Vict0rSch/deep_learning/blob/master/recurrent.md
Source: https://github.com/Vict0rSch/deep_learning/blob/master/keras/recurrent/recurrent_keras_power.py
The task here will be to be able to predict values for a timeseries : the history of 2 million minutes of a household’s power consumption. We are going to use a multi-layered LSTM recurrent neural network to predict the last value of a sequence of values. Put another way, given 49 timesteps of consumption, what will be the 50th value?
Data: https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption#
In [31]:
### prevent the dying jupyter notebook
import sys
stdout = sys.stdout
#sys.stdout = sys.__stdout__ # did not work to restoure print -> console
#sys.stdout = open('keras_output.txt', 'a+')
#sys.stdout = stdout
import matplotlib.pyplot as plt
%matplotlib inline
dpath = 'data/power/'
path_to_dataset = dpath + 'household_power_consumption.txt'
In [ ]:
%mkdir -p dpath
!wget -P $dpath https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip
!unzip -d $dpath data/power/household_power_consumption.zip
In [2]:
import numpy as np
import time
import csv
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.models import Sequential
np.random.seed(1234)
The initial file contains lots of different pieces of data. We will here focus on a single value : a house’s Global_active_power history, minute by minute for almost 4 years. This means roughly 2 million points. Some values are missing, this is why we try to load the values as floats into the list and if the value is not a number ( missing values are marked with a ?) we simply ignore them.
Also if we do not want to load the entire dataset, there is a condition to stop loading the data when a certain ratio is reached.
Once all the datapoints are loaded as one large timeseries, we have to split it into examples. Again, one example is made of a sequence of 50 values. Using the first 49, we are going to try and predict the 50th. Moreover, we’ll do this for every minute given the 49 previous ones so we use a sliding buffer of size 50.
Neural networks usually learn way better when data is pre-processed (cf Y. Lecun’s 1995 paper, section 4.3). However regarding time-series we do not want the network to learn on data too far from the real world. So here we’ll keep it simple and simply center the data to have a 0 mean.
Now that the examples are formatted, we need to split them into train and test, input and target. Here we select 10% of the data as test and 90% to train. We also select the last value of each example to be the target, the rest being the sequence of inputs.
We shuffle the training examples so that we train in no particular order and the distribution is uniform (for the batch calculation of the loss) but not the test set so that we can visualize our predictions with real signals.
In [20]:
path_to_dataset
sequence_length = 50
h5Path = '{}power_TF_s{}.h5'.format(dpath, sequence_length)
h5Path
Out[20]:
Out[20]:
In [21]:
def data_power_consumption(path_to_dataset,
sequence_length=50,
ratio=1.0):
max_values = ratio * 2049280
with open(path_to_dataset, encoding = "ISO-8859-1") as f:
data = csv.reader(f, delimiter=";")
power = []
nb_of_values = 0
for line in data:
try:
power.append(float(line[2]))
nb_of_values += 1
except ValueError:
pass
# 2049280.0 is the total number of valid values, i.e. ratio = 1.0
if nb_of_values >= max_values:
break
print("Data loaded from csv. Formatting...")
result = []
for index in range(len(power) - sequence_length):
result.append(power[index: index + sequence_length])
result = np.array(result) # shape (2049230, 50)
result_mean = result.mean()
result -= result_mean
print("Shift : ", result_mean)
print("Data : ", result.shape)
row = round(0.9 * result.shape[0])
train = result[:row, :]
np.random.shuffle(train)
X_train = train[:, :-1]
y_train = train[:, -1]
X_test = result[row:, :-1]
y_test = result[row:, -1]
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
return [X_train, y_train, X_test, y_test]
X_train, y_train, X_test, y_test = data_power_consumption(path_to_dataset, sequence_length)
In [37]:
fig = plt.figure(facecolor='white', figsize=(20,10))
ax = fig.add_subplot(111)
ax.plot(X_train[:100,0,0], label='X_train')
Out[37]:
In [22]:
X_train.shape, y_train.shape
X_train[0,:,0]
y_train[0]
Out[22]:
Out[22]:
Out[22]:
The list [ X_test[i][0] ] is the entire signal (minus the last 49 values) from which it was built since we’ve used a 1-timestep sliding buffer.
predict(X_test[0]) is therefore the prediction for the 50th value and its associated target is y_test[0]. Moreover, by construction, y_test[0] = X_test[1][48] = X_test[2][47] = ...
then predict(X_test[1]) is the prediction of the 51th value, associated with y_test[1] as a target.
therefore predict(X_test) is the predicted signal, one step ahead, and y_test is its target.
In [23]:
X_test.shape, y_test.shape
X_test[0,:10,0]
X_test[1,:10,0]
X_test[0][0]
Out[23]:
Out[23]:
Out[23]:
Out[23]:
In [24]:
data = [X_train, y_train, X_test, y_test]
Last thing regards input formats. Read through the recurrent post to get more familiar with data dimensions. So we reshape the inputs to have dimensions (#examples, #values in sequences, dim. of each value). Here each value is 1-dimensional, they are only one measure (of power consumption at time t). However if we were to predict speed vectors they could be 3 dimensional for instance. In fine, we return X_train, y_train, X_test, y_test in a list (to be able to feed it as one only object to our run function)
So here we are going to build our Sequential model. This means we’re going to stack layers in this object. Also, layers is the list containing the sizes of each layer. We are therefore going to have a network with 1-dimensional input, two hidden layers of sizes 50 and 100 and eventually a 1-dimensional output layer.
After the model is initialized, we create a first layer, in this case an LSTM layer. Here we use the default parameters so it behaves as a standard recurrent layer. Since our input is of 1 dimension, we declare that it should expect an input_dim of 1. Then we say we want layers[1] units in this layer. We also add 20% Dropout in this layer.
The last layer we use is a Dense layer ( = feedforward). Since we are doing a regression, its activation is linear.
Lastly, we compile the model using a Mean Square Error (again, it’s standard for regression) and the RMSprop optimizer. See the mnist example to learn more on rmsprop.
To conclude, the fact that return_sequence=True for the first layer means that its output is always fed to the second layer. As a whole regarding time, all its activations can be seen as the sequence of prediction this first layer has made from the input sequence.
On the other hand, return_sequence=False for the second layer because its output is only fed to the next layer at the end of the sequence. As a whole regarding time, it does not output a prediction for the sequence but one only prediction-vector (of size layer[2]) for the whole input sequence.
The linear Dense layer is used to aggregate all the information from this prediction-vector into one single value, the predicted 4th timestep of the sequence.
To conclude, the fact that return_sequence=True
for the first layer means that its output is always fed to the second layer. As a whole regarding time, all its activations can be seen as the sequence of prediction this first layer has made from the input sequence.
In our case, the first LSTM layer returns sequences because we want it to transfer its information both to the next layer (upwards in the chart) and to itself for the next timestep (arrow to the right).
However for the second one, we just expect its last sequence prediction to be compared to the target.
This means for inputs 0 to sequence_length - 2
the prediction is only passed to the layer itself for the next timestep and not as an input to the next ( = output) layer. However the sequence_length - 1th input is passed forward to the Dense layer for the loss computation against the target.
In [25]:
def build_model():
model = Sequential()
layers = [1, sequence_length, sequence_length*2, 1] #49 nodes + bias
model.add(LSTM(
input_dim=layers[0],
output_dim=layers[1],
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
layers[2],
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=layers[3]))
model.add(Activation("linear"))
start = time.time()
model.compile(loss="mse", optimizer="rmsprop")
print("Compilation Time : ", time.time() - start)
return model
model = build_model()
model.summary()
In [49]:
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint(filepath= dpath + 'checkpoint-{epoch:02d}-{val_loss:.2f}.hdf5')
def run_network(model=None, data=None, epochs=1):
global_start_time = time.time()
ratio = 0.5
sequence_length = 50
path_to_dataset = 'household_power_consumption.txt'
if data is None:
print('Loading data... ')
X_train, y_train, X_test, y_test = data_power_consumption(
path_to_dataset, sequence_length, ratio)
else:
X_train, y_train, X_test, y_test = data
print('\nData Loaded. Compiling...\n')
if model is None:
model = build_model()
try:
history = model.fit(
X_train, y_train,
batch_size=512, nb_epoch=epochs, validation_split=0.05,
verbose=1, callbacks=[checkpoint])
except KeyboardInterrupt:
print('Training duration (s) : ', time.time() - global_start_time)
return model, y_test, 0
print('Training duration (s) : ', time.time() - global_start_time)
return model, y_test
In [39]:
from IPython.core.debugger import Tracer
def predict_sequence_full(model, data, window_size):
#Shift the window by 1 new prediction each time, re-run predictions on new window
curr_frame = data[0]
print("-D-", curr_frame.shape)
predicted = []
# loop over entire testdata
for i in range(len(data)):
predicted.append(model.predict(curr_frame[np.newaxis,:,:])[0,0]) #get element from shape(1,1)
curr_frame = curr_frame[1:] #move window
#Tracer()()
curr_frame = np.insert(curr_frame, [window_size-2], predicted[-1], axis=0) #fill frame with prediction
return predicted
In [40]:
def plot_result(y_test, predicted, seq_len=100):
try:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(y_test[:seq_len])
plt.plot(predicted[:seq_len])
plt.show()
except Exception as e:
print(str(e))
In [51]:
### prevent the dying jupyter notebook
sys.stdout = open('keras_output.txt', 'a+')
run_network(model, data, 2)
sys.stdout = stdout
Out[51]:
In [52]:
model.save_weights(h5Path)
In [53]:
print("asdf")
In [56]:
%%time
length = 150
predicted = predict_sequence_full(model, X_test[:length], sequence_length)
predicted[:5]
plot_result(y_test, predicted, length)
In [11]:
#model.load_weights(h5Path)
model.load_weights(dpath + 'power_TF_s150.h5')
In [ ]:
sys.stdout = open('keras_output.txt', 'a+')
run_network(model, data, 2)
sys.stdout = stdout
model.save_weights(h5Path)
In [16]:
%%time
length=1500
X_test[:length].shape
predicted = model.predict(X_test[:length])
plot_result(y_test, predicted, length)
In [17]:
%%time
predicted = predict_sequence_full(model, X_test[:length], sequence_length)
predicted[:5]
plot_result(y_test, predicted, length)
Out[17]:
In [19]:
predicted[:100]
Out[19]:
In [ ]:
model = Sequential()
model.add(Dense(500, input_shape = (TRAIN_SIZE, )))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('linear'))
model.compile(optimizer='adam', loss='mse')
In [ ]:
model.save_weights(dpath+'power_TF_e1.h5')
In [ ]:
model = Sequential()
model.add(Convolution1D(input_shape = (TRAIN_SIZE, EMB_SIZE),
nb_filter=64,
filter_length=2,
border_mode='valid',
activation='relu',
subsample_length=1))
model.add(MaxPooling1D(pool_length=2))
model.add(Convolution1D(input_shape = (TRAIN_SIZE, EMB_SIZE),
nb_filter=64,
filter_length=2,
border_mode='valid',
activation='relu',
subsample_length=1))
model.add(MaxPooling1D(pool_length=2))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(250))
model.add(Dropout(0.25))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('linear'))
In [ ]:
model = Sequential()
model.add(LSTM(input_shape = (EMB_SIZE,), input_dim=EMB_SIZE, output_dim=HIDDEN_RNN, return_sequences=True))
model.add(LSTM(input_shape = (EMB_SIZE,), input_dim=EMB_SIZE, output_dim=HIDDEN_RNN, return_sequences=False))
model.add(Dense(1))
model.add(Activation('linear'))