In [1]:
from __future__ import division, print_function
%matplotlib inline
from importlib import reload # Python 3
import utils; reload(utils)
from utils import *
In [2]:
path = "data/imdb/"
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)
We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.
In [3]:
from keras.datasets import imdb
idx = imdb.get_word_index()
This is the word list:
In [4]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]
Out[4]:
...and this is the mapping from id to word
In [5]:
idx2word = {v: k for k, v in idx.items()}
We download the reviews using code copied from keras.datasets:
In [6]:
path = get_file('imdb_full.pkl',
origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)
In [7]:
len(x_train)
Out[7]:
Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.
In [8]:
', '.join(map(str, x_train[0]))
Out[8]:
The first word of the first review is 23022. Let's see what that is.
In [9]:
idx2word[23022]
Out[9]:
Here's the whole review, mapped from ids to words.
In [10]:
' '.join([idx2word[o] for o in x_train[0]])
Out[10]:
The labels are 1 for positive, 0 for negative.
In [11]:
labels_train[:10]
Out[11]:
Reduce vocab size by setting rare words to max index.
In [12]:
vocab_size = 5000
trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]
# check the "max index" word out of curiosity
idx2word[vocab_size-1]
Out[12]:
Look at distribution of lengths of sentences.
In [13]:
lens = np.array([*map(len, trn)])
(lens.max(), lens.min(), lens.mean())
Out[13]:
Pad (with zero) or truncate each sentence to make consistent length.
In [14]:
seq_len = 500
trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)
This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.
In [15]:
trn.shape
Out[15]:
In [16]:
batch_size = 64
The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.
In [17]:
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [18]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()
In [19]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=2, batch_size=batch_size)
Out[19]:
The stanford paper that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.
A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.
In [20]:
conv1 = Sequential([
Embedding(vocab_size, 32, input_length=seq_len),
SpatialDropout1D(0.2),
Dropout(0.2),
Conv1D(64, 5, padding='same', activation='relu'),
Dropout(0.2),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [21]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [22]:
conv1.fit(trn, labels_train, validation_data=(test, labels_test), epochs=4, batch_size=batch_size)
Out[22]:
That's well past the Stanford paper's accuracy - another win for CNNs!
In [23]:
conv1.save_weights(model_path + 'conv1.h5')
In [24]:
conv1.load_weights(model_path + 'conv1.h5')
You may want to look at wordvectors.ipynb before moving on.
In this section, we replicate the previous CNN, but using pre-trained embeddings.
In [27]:
def load_vectors(loc):
return (load_array(loc+'.dat'),
pickle.load(open(loc+'_words.pkl','rb')),
pickle.load(open(loc+'_idx.pkl','rb')))
In [30]:
vecs, words, wordidx = load_vectors('data/glove/results/6B.50d')
The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).
In [31]:
def create_emb():
n_fact = vecs.shape[1]
emb = np.zeros((vocab_size, n_fact))
for i in range(1,len(emb)):
word = idx2word[i]
if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
src_idx = wordidx[word]
emb[i] = vecs[src_idx]
else:
# If we can't find the word in glove, randomly initialize
emb[i] = normal(scale=0.6, size=(n_fact,))
# This is our "rare word" id - we want to randomly initialize
emb[-1] = normal(scale=0.6, size=(n_fact,))
emb/=3
return emb
In [32]:
emb = create_emb()
We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.
In [33]:
model = Sequential([
Embedding(vocab_size, 50, input_length=seq_len,
weights=[emb], trainable=False),
SpatialDropout1D(0.2),
Dropout(0.25),
Convolution1D(64, 5, padding='same', activation='relu'),
Dropout(0.25),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [34]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [35]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=2, batch_size=batch_size)
Out[35]:
We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.
In [36]:
model.layers[0].trainable=True
In [37]:
model.optimizer.lr=1e-4
In [38]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=1, batch_size=batch_size)
Out[38]:
As expected, that's given us a nice little boost. :)
In [39]:
model.save_weights(model_path+'glove50.h5')
This is an implementation of a multi-size CNN as shown in Ben Bowles' excellent blog post.
In [40]:
from keras.layers import Merge
We use the functional API to create multiple conv layers of different sizes, and then concatenate them.
In [41]:
graph_in = Input ((vocab_size, 50))
convs = [ ]
for fsz in range (3, 6):
x = Conv1D(64, fsz, padding='same', activation="relu")(graph_in)
x = MaxPooling1D()(x)
x = Flatten()(x)
convs.append(x)
out = Concatenate(axis=-1)(convs)
graph = Model(graph_in, out)
In [42]:
emb = create_emb()
We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.
In [43]:
model = Sequential ([
Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]),
SpatialDropout1D(0.2),
Dropout (0.2),
graph,
Dropout (0.5),
Dense (100, activation="relu"),
Dropout (0.7),
Dense (1, activation='sigmoid')
])
In [44]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [45]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=2, batch_size=batch_size)
Out[45]:
Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!
In [46]:
model.layers[0].trainable=False
In [47]:
model.optimizer.lr=1e-5
In [48]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=2, batch_size=batch_size)
Out[48]:
This more complex architecture has given us another boost in accuracy.
We haven't covered this bit yet!
In [49]:
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
embeddings_regularizer=l2(1e-6)),
SpatialDropout1D(0.2),
LSTM(100, implementation=2),
Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
In [50]:
model.fit(trn, labels_train, validation_data=(test, labels_test), epochs=5, batch_size=batch_size)
Out[50]: