Example adapted from: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
Required instals:
My install process was:
In [1]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, LSTM, GRU, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard
from keras import backend
# fix random seed for reproducibility
np.random.seed(7)
import shutil
import os
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:97% !important; }</style>")) #Set width of iPython cells
In [81]:
# load the dataset but only keep the top n words, zero the rest
# docs at: https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data
top_words = 5000
start_char=1
oov_char=2
index_from=3
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words,
start_char=start_char, oov_char = oov_char, index_from = index_from )
In [82]:
print(X_train.shape)
print(y_train.shape)
In [83]:
print(len(X_train[0]))
print(len(X_train[1]))
In [84]:
print(X_test.shape)
print(y_test.shape)
In [85]:
X_train[0]
Out[85]:
In [86]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
In [87]:
print(X_train.shape)
print(y_train.shape)
In [88]:
print(len(X_train[0]))
print(len(X_train[1]))
In [89]:
print(X_test.shape)
print(y_test.shape)
In [90]:
X_train[0]
Out[90]:
In [91]:
y_train[0:20] # first 20 sentiment labels
Out[91]:
In [92]:
word_index = imdb.get_word_index()
inv_word_index = np.empty(len(word_index)+index_from+3, dtype=np.object)
for k, v in word_index.items():
inv_word_index[v+index_from]=k
inv_word_index[0]='<pad>'
inv_word_index[1]='<start>'
inv_word_index[2]='<oov>'
In [93]:
word_index['ai']
Out[93]:
In [94]:
inv_word_index[16942+index_from]
Out[94]:
In [95]:
inv_word_index[:50]
Out[95]:
In [96]:
def toText(wordIDs):
s = ''
for i in range(len(wordIDs)):
if wordIDs[i] != 0:
w = str(inv_word_index[wordIDs[i]])
s+= w + ' '
return s
In [97]:
for i in range(5):
print()
print(str(i) + ') sentiment = ' + ('negative' if y_train[i]==0 else 'positive'))
print(toText(X_train[i]))
Sequential guide, compile() and fit()
Embedding The embeddings layer works like an effiecient one hot encoding for the word index followed by a dense layer of size embedding_vector_length.
Dropout (1/3 down the page)
"model.compile(...) sets up the "adam" optimizer, similar to SGD but with some gradient averaging that works like a larger batch size to reduce the variability in the gradient from one small batch to the next. Each SGD step is of batch_size training records. Adam is also a variant of momentum optimizers.
'binary_crossentropy' is the loss functiom used most often with logistic regression and is equivalent to softmax for only two classes.
In the "Output Shape", None is a unknown for a variable number of training records to be supplied later.
In [98]:
backend.clear_session()
embedding_vector_length = 5
rnn_vector_length = 150
#activation = 'relu'
activation = 'sigmoid'
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))
model.add(Dropout(0.2))
#model.add(LSTM(rnn_vector_length, activation=activation))
model.add(GRU(rnn_vector_length, activation=activation))
model.add(Dropout(0.2))
model.add(Dense(1, activation=activation))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
tensorboard --logdir=/data/kaggle-tensorboard
In [99]:
log_dir = '/data/kaggle-tensorboard'
shutil.rmtree(log_dir, ignore_errors=True)
os.makedirs(log_dir)
tbCallBack = TensorBoard(log_dir=log_dir, histogram_freq=0, write_graph=True, write_images=True)
full_history=[]
In [100]:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=8, batch_size=64, callbacks=[tbCallBack])
full_history += history.history['loss']
In [101]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
print( 'embedding_vector_length = ' + str( embedding_vector_length ))
print( 'rnn_vector_length = ' + str( rnn_vector_length ))
Accuracy % | Type | max val acc epoch | embedding_vector_length | RNN state size | Dropout |
---|---|---|---|---|---|
88.46 * | GRU | 6 | 5 | 150 | 0.2 (after Embedding and LSTM) |
88.4 | GRU | 4 | 5 | 100 | no dropout |
88.32 | GRU | 7 | 32 | 100 | |
88.29 | GRU | 8 | 5 | 200 | no dropout |
88.03 | GRU | >6 | 20 | 40 | 0.3 (after Embedding and LSTM) |
87.93 | GRU | 4 | 32 | 50 | 0.2 (after LSTM) |
87.60 | GRU | 5 | 5 | 50 | no dropout |
87.5 | GRU | 8 | 10 | 20 | no dropout |
87.5 | GRU | 5 | 32 | 50 | |
87.46 | GRU | 8 | 16 | 100 | |
< 87 | LSTM | 9-11 | 32 | 100 | |
87.66 | GRU | 5 | 32 | 50 | 0.3 (after Embedding and LSTM) |
86.5 | GRU | >10 | 5 | 10 | no dropout |
In [60]:
history.history # todo: add graph of all 4 values with history
Out[60]:
In [61]:
plt.plot(history.history['loss'])
plt.yscale('log')
plt.show()
plt.plot(full_history)
plt.yscale('log')
plt.show()
In [62]:
import re
words_only = r'[^\s!,.?\-":;0-9]+'
re.findall(words_only, "Some text to, tokenize. something's.Something-else?".lower())
Out[62]:
In [63]:
def encode(reviewText):
words = re.findall(words_only, reviewText.lower())
reviewIDs = [start_char]
for word in words:
index = word_index.get(word, oov_char -index_from) + index_from # defaults to oov_char for missing
if index > top_words:
index = oov_char
reviewIDs.append(index)
return reviewIDs
toText(encode('To code and back again. ikkyikyptangzooboing ni !!'))
Out[63]:
In [106]:
# reviews from:
# https://www.pluggedin.com/movie-reviews/solo-a-star-wars-story
# http://badmovie-badreview.com/category/bad-reviews/
user_reviews = ["This movie is horrible",
"This wasn't a horrible movie and I liked it actually",
"This movie was great.",
"What a waste of time. It was too long and didn't make any sense.",
"This was boring and drab.",
"I liked the movie.",
"I didn't like the movie.",
"I like the lead actor but the movie as a whole fell flat",
"I don't know. It was ok, some good and some bad. Some will like it, some will not like it.",
"There are definitely heroic seeds at our favorite space scoundrel's core, though, seeds that simply need a little life experience to nurture them to growth. And that's exactly what this swooping heist tale is all about. You get a yarn filled with romance, high-stakes gambits, flashy sidekicks, a spunky robot and a whole lot of who's-going-to-outfox-who intrigue. Ultimately, it's the kind of colorful adventure that one could imagine Harrison Ford's version of Han recalling with a great deal of flourish … and a twinkle in his eye.",
"There are times to be politically correct and there are times to write things about midget movies, and I’m afraid that sharing Ankle Biters with the wider world is an impossible task without taking the low road, so to speak. There are horrible reasons for this, all of them the direct result of the midgets that this film contains, which makes it sound like I am blaming midgets for my inability to regulate my own moral temperament but I like to think I am a…big…enough person (geddit?) to admit that the problem rests with me, and not the disabled.",
"While Beowulf didn’t really remind me much of Beowulf, it did reminded me of something else. At first I thought it was Van Helsing, but that just wasn’t it. It only hit me when Beowulf finally told his backstory and suddenly even the dumbest of the dumb will realise that this is a simple ripoff of Blade. The badass hero, who is actually born from evil, now wants to destroy it, while he apparently has to fight his urges to become evil himself (not that it is mentioned beyond a single reference at the end of Beowulf) and even the music fits into the same range. Sadly Beowulf is not even nearly as interesting or entertaining as its role model. The only good aspects I can see in Beowulf would be the stupid beginning and Christopher Lamberts hair. But after those first 10 minutes, the movie becomes just boring and you don’t care much anymore.",
"You don't frighten us, English pig-dogs! Go and boil your bottoms, son of a silly person! I blow my nose at you, so-called Arthur King! You and all your silly English Knnnnnnnn-ighuts!!!"
]
X_user = np.array([encode(review) for review in user_reviews ])
X_user
Out[106]:
In [107]:
X_user_pad = sequence.pad_sequences(X_user, maxlen=max_review_length)
X_user_pad
Out[107]:
In [108]:
for row in X_user_pad:
print()
print(toText(row))
In [109]:
user_scores = model.predict(X_user_pad)
is_positive = user_scores >= 0.5 # I'm an optimist
for i in range(len(user_reviews)):
print( '\n%.2f %s:' % (user_scores[i][0], 'positive' if is_positive[i] else 'negative' ) + ' ' + user_reviews[i] )
In [ ]:
In [ ]:
In [ ]: