In [1]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
In [2]:
# fix random seed for reproducibility
numpy.random.seed(7)
In [3]:
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
In [17]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
def create_XY(seq_length, alphabet, dataX, dataY):
for i in range(0, len(alphabet) - seq_length, 1):
seq_in = alphabet[i:i + seq_length]
seq_out = alphabet[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
print(seq_in, '->', seq_out)
create_XY(seq_length, alphabet, dataX, dataY)
In [18]:
dataX, dataY
len(dataX)
Out[18]:
Out[18]:
In [32]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
X.shape
X[0:3]
Out[32]:
Out[32]:
In [7]:
# normalize
X = X / float(len(alphabet))
In [8]:
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
Let’s start off by designing a simple LSTM to learn how to predict the next character in the alphabet given the context of just one character.
We will frame the problem as a random collection of one-letter input to one-letter output pairs. As we will see this is a difficult framing of the problem for the LSTM to learn.
Let’s define an LSTM network with 32 units and a single output neuron with a softmax activation function for making predictions. Because this is a multi-class classification problem, we can use the log loss function (called “categorical_crossentropy” in Keras), and optimize the network using the ADAM optimization function.
The model is fit over 500 epochs with a batch size of 1.
In [10]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, nb_epoch=500, batch_size=1, verbose=2)
Out[10]:
In [11]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))
In [31]:
# demonstrate some model predictions
def predict(dataX):
for pattern in dataX:
x = numpy.reshape(pattern, (1, len(pattern), 1))
print(x.shape)
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
print(seq_in, "->", result)
predict(dataX)
We can see that this problem is indeed difficult for the network to learn.
The reason is, the poor LSTM units do not have any context to work with. Each input-output pattern is shown to the network in a random order and the state of the network is reset after each pattern (each batch where each batch contains one pattern).
This is abuse of the LSTM network architecture, treating it like a standard multilayer Perceptron.
A popular approach to adding more context to data for multilayer Perceptrons is to use the window method.
This is where previous steps in the sequence are provided as additional input features to the network. We can try the same trick to provide more context to the LSTM network.
In [20]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 3
In [21]:
dataX, dataY = [], []
create_XY(seq_length, alphabet, dataX, dataY)
In [22]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
X.shape
X[0:3]
Out[22]:
Out[22]:
In [23]:
# normalize
X = X / float(len(alphabet))
In [24]:
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
In [6]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
X.shape
X[0:3]
Out[6]:
Out[6]:
In [7]:
# normalize
X = X / float(len(alphabet))
In [8]:
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
In [ ]:
In [26]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, nb_epoch=500, batch_size=1, verbose=2)
Out[26]:
In [29]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))
In [30]:
predict(dataX)
We have seen that we can break-up our raw data into fixed size sequences and that this representation can be learned by the LSTM, but only to learn random mappings of 3 characters to 1 character.
We have also seen that we can pervert batch size to offer more sequence to the network, but only during training.
Ideally, we want to expose the network to the entire sequence and let it learn the inter-dependencies, rather than us define those dependencies explicitly in the framing of the problem.
We can do this in Keras by making the LSTM layers stateful and manually resetting the state of the network at the end of the epoch, which is also the end of the training sequence.
This is truly how the LSTM networks are intended to be used. We find that by allowing the network itself to learn the dependencies between the characters, that we need a smaller network (half the number of units) and fewer training epochs (almost half).
We first need to define our LSTM layer as stateful. In so doing, we must explicitly specify the batch size as a dimension on the input shape. This also means that when we evaluate the network or make predictions, we must also specify and adhere to this same batch size. This is not a problem now as we are using a batch size of 1. This could introduce difficulties when making predictions when the batch size is not one as predictions will need to be made in batch and in sequence.
In [38]:
seq_length = 1
dataX = []
dataY = []
create_XY(seq_length, alphabet, dataX, dataY)
In [42]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
In [44]:
# create and fit the model
batch_size = 1
model = Sequential()
model.add(LSTM(16, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [45]:
for i in range(300):
model.fit(X, y, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
Out[45]:
In [46]:
# summarize performance of the model
scores = model.evaluate(X, y, batch_size=batch_size, verbose=0)
model.reset_states()
print("Model Accuracy: %.2f%%" % (scores[1]*100))
In [47]:
# demonstrate some model predictions
seed = [char_to_int[alphabet[0]]]
for i in range(0, len(alphabet)-1):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
In [49]:
# demonstrate a random starting point
letter = "K"
seed = [char_to_int[letter]]
print("New start: ", letter)
for i in range(0, 5):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
In the previous section, we discovered that the Keras “stateful” LSTM was really only a shortcut to replaying the first n-sequences, but didn’t really help us learn a generic model of the alphabet.
In this section we explore a variation of the “stateless” LSTM that learns random subsequences of the alphabet and an effort to build a model that can be given arbitrary letters or subsequences of letters and predict the next letter in the alphabet.
Firstly, we are changing the framing of the problem. To simplify we will define a maximum input sequence length and set it to a small value like 5 to speed up training. This defines the maximum length of subsequences of the alphabet will be drawn for training. In extensions, this could just as set to the full alphabet (26) or longer if we allow looping back to the start of the sequence.
We also need to define the number of random sequences to create, in this case 1000. This too could be more or less. I expect less patterns are actually required.
In [51]:
# prepare the dataset of input to output pairs encoded as integers
num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
start = numpy.random.randint(len(alphabet)-2)
end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))
sequence_in = alphabet[start:end+1]
sequence_out = alphabet[end + 1]
dataX.append([char_to_int[char] for char in sequence_in])
dataY.append(char_to_int[sequence_out])
print(sequence_in, '->', sequence_out)
In [55]:
from keras.preprocessing.sequence import pad_sequences
X = pad_sequences(dataX, max_len)
X.shape
X[:3]
Out[55]:
Out[55]:
In [56]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(X, (X.shape[0], max_len, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
In [57]:
# create and fit the model
batch_size = 1
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], 1)))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [59]:
model.fit(X, y, nb_epoch=50, batch_size=batch_size, verbose=2)
Out[59]:
In [60]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))
In [61]:
# demonstrate some model predictions
for i in range(20):
pattern_index = numpy.random.randint(len(dataX))
pattern = dataX[pattern_index]
x = pad_sequences([pattern], maxlen=max_len)
x = numpy.reshape(x, (1, max_len, 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
result = int_to_char[index]
seq_in = [int_to_char[value] for value in pattern]
print(seq_in, "->", result)
We can see that although the model did not learn the alphabet perfectly from the randomly generated subsequences, it did very well. The model was not tuned and may require more training or a larger network, or both (an exercise for the reader).
This is a good natural extension to the “all sequential input examples in each batch” alphabet model learned above in that it can handle ad hoc queries, but this time of arbitrary sequence length (up to the max length).
In [ ]: