Bidirection LSTM - IMDB sentiment classification

see https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py


In [1]:
WEIGHTS_FILEPATH = 'imdb_bidirectional_lstm.hdf5'
MODEL_ARCH_FILEPATH = 'imdb_bidirectional_lstm.json'

In [2]:
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Input, Bidirectional
from keras.datasets import imdb
from keras.callbacks import EarlyStopping, ModelCheckpoint

import json


Using TensorFlow backend.

In [3]:
max_features = 20000
maxlen = 200  # cut texts after this number of words (among top max_features most common words)

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print("Pad sequences (samples x time)")
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
y_train = np.array(y_train)
y_test = np.array(y_test)


Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
X_train shape: (25000, 200)
X_test shape: (25000, 200)

In [6]:
model = Sequential()
model.add(Embedding(max_features, 64, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

In [7]:
# Model saving callback
checkpointer = ModelCheckpoint(filepath=WEIGHTS_FILEPATH, monitor='val_acc', verbose=1, save_best_only=True)

# Early stopping
early_stopping = EarlyStopping(monitor='val_acc', verbose=1, patience=2)

# train
batch_size = 128
epochs = 10
model.fit(X_train, y_train, 
          validation_data=[X_test, y_test],
          batch_size=batch_size, epochs=epochs, verbose=2,
          callbacks=[checkpointer, early_stopping])


Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 00000: val_acc improved from -inf to 0.87036, saving model to imdb_bidirectional_lstm.hdf5
120s - loss: 0.4683 - acc: 0.7663 - val_loss: 0.3254 - val_acc: 0.8704
Epoch 2/10
Epoch 00001: val_acc improved from 0.87036 to 0.87388, saving model to imdb_bidirectional_lstm.hdf5
120s - loss: 0.2298 - acc: 0.9170 - val_loss: 0.3051 - val_acc: 0.8739
Epoch 3/10
Epoch 00002: val_acc did not improve
121s - loss: 0.1453 - acc: 0.9518 - val_loss: 0.3638 - val_acc: 0.8592
Epoch 4/10
Epoch 00003: val_acc did not improve
118s - loss: 0.1109 - acc: 0.9657 - val_loss: 0.4408 - val_acc: 0.8518
Epoch 5/10
Epoch 00004: val_acc did not improve
120s - loss: 0.0761 - acc: 0.9775 - val_loss: 0.4541 - val_acc: 0.8577
Epoch 00004: early stopping
Out[7]:
<keras.callbacks.History at 0x7fbe0ed09eb8>

In [8]:
with open(MODEL_ARCH_FILEPATH, 'w') as f:
    f.write(model.to_json())

sample data


In [9]:
word_index = imdb.get_word_index()


Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json

In [10]:
word_dict = {idx: word for word, idx in word_index.items()}

In [11]:
sample = []
for idx in X_train[0]:
    if idx >= 3:
        sample.append(word_dict[idx-3])
    elif idx == 2:
        sample.append('-')
' '.join(sample)


Out[11]:
"and you could just imagine being there robert - is an amazing actor and now the same being director - father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the - of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

In [12]:
with open('imdb_dataset_word_index_top20000.json', 'w') as f:
    f.write(json.dumps({word: idx for word, idx in word_index.items() if idx < max_features}))

In [13]:
with open('imdb_dataset_word_dict_top20000.json', 'w') as f:
    f.write(json.dumps({idx: word for word, idx in word_index.items() if idx < max_features}))

In [14]:
sample_test_data = []
for i in np.random.choice(range(X_test.shape[0]), size=1000, replace=False):
    sample_test_data.append({'values': X_test[i].tolist(), 'label': y_test[i].tolist()})
    
with open('imdb_dataset_test.json', 'w') as f:
    f.write(json.dumps(sample_test_data))

In [ ]: