Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras

http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence.

What makes this problem difficult is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model to learn the long term context or dependencies between symbols in the input sequence.

We will map each word onto a 32 length real valued vector. We will also limit the total number of words that we are interested in modeling to the 5000 most frequent words, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.


In [1]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)


Using TensorFlow backend.

In [2]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

In [3]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [4]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 100)           53200       embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             101         lstm_1[0][0]                     
====================================================================================================
Total params: 213301
____________________________________________________________________________________________________
None

In [5]:
%%capture output
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=1, batch_size=64)


Out[5]:
<keras.callbacks.History at 0x11a837c50>

In [7]:
output.show()


Train on 25000 samples, validate on 25000 samples
Epoch 1/1
25000/25000 [==============================] - 273s - loss: 0.4669 - acc: 0.7865 - val_loss: 0.6816 - val_acc: 0.7006

In [8]:
# step redundent, eval part of training already
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Accuracy: 70.06%

You can see that this simple LSTM with little tuning achieves near state-of-the-art results on the IMDB problem. Importantly, this is a template that you can use to apply LSTM networks to your own sequence classification problems.

LSTM For Sequence Classification With Dropout

Recurrent Neural networks like LSTM generally have the problem of overfitting.

Dropout can be applied between layers using the Dropout Keras layer. We can do this easily by adding new Dropout layers between the Embedding and LSTM layers and the LSTM and Dense output layers. We can also add dropout to the input on the Embedded layer by using the dropout parameter. For example:


In [12]:
from keras.layers import Dropout
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length, dropout=0.2))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_4 (Embedding)          (None, 500, 32)       160000      embedding_input_4[0][0]          
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 500, 32)       0           embedding_4[0][0]                
____________________________________________________________________________________________________
lstm_3 (LSTM)                    (None, 100)           53200       dropout_3[0][0]                  
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 100)           0           lstm_3[0][0]                     
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 1)             101         dropout_4[0][0]                  
====================================================================================================
Total params: 213301
____________________________________________________________________________________________________
None

In [13]:
%%capture output
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=1, batch_size=64)


Out[13]:
<keras.callbacks.History at 0x7f04a261d908>

In [15]:
output.show()


Train on 25000 samples, validate on 25000 samples
Epoch 1/1
25000/25000 [==============================] - 274s - loss: 0.6491 - acc: 0.6177 - val_loss: 0.5711 - val_acc: 0.7223

In [ ]:
# Final evaluation of the model
#scores = model.evaluate(X_test, y_test, verbose=0)
#print("Accuracy: %.2f%%" % (scores[1]*100))

We can see dropout having the desired impact on training with a slightly slower trend in convergence and in this case a lower final accuracy. The model could probably use a few more epochs of training and may achieve a higher skill (try it an see).

Alternately, dropout can be applied to the input and recurrent connections of the memory units with the LSTM precisely and separately.

Keras provides this capability with parameters on the LSTM layer, the dropout_W for configuring the input dropout and dropout_U for configuring the recurrent dropout. For example, we can modify the first example to add dropout to the input and recurrent connections as follows:


In [ ]:
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length, dropout=0.2))
model.add(LSTM(100, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [ ]:
%%capture output
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=1, batch_size=64)

In [ ]:
output.show()

We can see that the LSTM specific dropout has a more pronounced effect on the convergence of the network than the layer-wise dropout. As above, the number of epochs was kept constant and could be increased to see if the skill of the model can be further lifted.

Dropout is a powerful technique for combating overfitting in your LSTM models and it is a good idea to try both methods, but you may bet better results with the gate-specific dropout provided in Keras.

LSTM and Convolutional Neural Network For Sequence Classification

Convolutional neural networks excel at learning the spatial structure in input data.

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.

We can easily add a one-dimensional CNN and max pooling layers after the Embedding layer which then feed the consolidated features to the LSTM. We can use a smallish set of 32 features with a small filter length of 3. The pooling layer can use the standard length of 2 to halve the feature map size.

For example, we would create the model as follows:


In [17]:
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_6 (Embedding)          (None, 500, 32)       160000      embedding_input_6[0][0]          
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 500, 32)       3104        embedding_6[0][0]                
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 250, 32)       0           convolution1d_1[0][0]            
____________________________________________________________________________________________________
lstm_4 (LSTM)                    (None, 100)           53200       maxpooling1d_1[0][0]             
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 1)             101         lstm_4[0][0]                     
====================================================================================================
Total params: 216405
____________________________________________________________________________________________________
None

In [18]:
%%capture output
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=1, batch_size=64)


Out[18]:
<keras.callbacks.History at 0x7f049b2af9b0>

In [19]:
output.show()


Train on 25000 samples, validate on 25000 samples
Epoch 1/1
25000/25000 [==============================] - 154s - loss: 0.4286 - acc: 0.7804 - val_loss: 0.3324 - val_acc: 0.8626

In [ ]: