In [1]:
from theano.sandbox import cuda
In [2]:
%matplotlib inline
import utils #; reload(utils)
from utils import *
from __future__ import division, print_function
import pickle
import utils_MDR
In [3]:
model_path = 'data/imdb/models/'
MDR: and needs by GPU-fan code, too...
In [4]:
import utils_MDR
from utils_MDR import *
We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.
In [5]:
from keras.datasets import imdb
idx = imdb.get_word_index()
This is the word list:
In [6]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]
Out[6]:
...and this is the mapping from id to word
In [7]:
## idx2word = {v: k for k, v in idx.iteritems()} ## Py 2.7
idx2word = {v: k for k, v in idx.items()} ## Py 3.x
We download the reviews using code copied from keras.datasets:
In [8]:
path = get_file('imdb_full.pkl',
origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)
In [9]:
len(x_train)
Out[9]:
Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.
In [10]:
', '.join(map(str, x_train[0]))
Out[10]:
The first word of the first review is 23022. Let's see what that is.
In [11]:
idx2word[23022]
Out[11]:
Here's the whole review, mapped from ids to words.
In [12]:
' '.join([idx2word[o] for o in x_train[0]])
Out[12]:
The labels are 1 for positive, 0 for negative.
In [13]:
labels_train[:10]
Out[13]:
Reduce vocab size by setting rare words to max index.
In [12]:
vocab_size = 5000
trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]
Look at distribution of lengths of sentences.
In [13]:
## create an array of 'len'...gths
# lens = np.array(map(len, trn)) ## only works in Py2.x, not 3.x ...
## 'map in Python 3 return an iterator, while map in Python 2 returns a list'
## (https://stackoverflow.com/questions/35691489/error-in-python-3-5-cant-add-map-results-together)
# This is a quick fix - not really a proper P3x approach.
lens = np.array(list(map(len, trn))) ## wrapped a list around it
(lens.max(), lens.min(), lens.mean())
Out[13]:
Pad (with zero) or truncate each sentence to make consistent length.
In [14]:
seq_len = 500
trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)
This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.
In [15]:
trn.shape
Out[15]:
The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.
In [17]:
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [18]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()
In [19]:
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
set_gpu_fan_speed(0)
The stanford paper that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.
A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.
In [20]:
conv1 = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
Dropout(0.2),
Convolution1D(64, 5, border_mode='same', activation='relu'),
Dropout(0.2),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [21]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [22]:
set_gpu_fan_speed(90)
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
set_gpu_fan_speed(0)
That's well past the Stanford paper's accuracy - another win for CNNs!
In [23]:
conv1.save_weights(model_path + 'conv1.h5')
In [32]:
conv1.load_weights(model_path + 'conv1.h5')
You may want to look at wordvectors.ipynb before moving on.
In this section, we replicate the previous CNN, but using pre-trained embeddings.
In [104]:
def load_vectors(loc):
return (load_array(loc+'.dat'),
pickle.load(open(loc+'_words.pkl','rb')),
pickle.load(open(loc+'_idx.pkl','rb')))
In [119]:
#vecs, words, wordidx = load_vectors('data/glove/results/6B.50d') ## JH's original
vecs, words, wordidx = load_vectors('data/glove/results/6B.100d') ## MDR's experiment
The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).
In [120]:
def create_emb():
n_fact = vecs.shape[1]
emb = np.zeros((vocab_size, n_fact))
for i in range(1,len(emb)):
word = idx2word[i]
if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
src_idx = wordidx[word]
emb[i] = vecs[src_idx]
else:
# If we can't find the word in glove, randomly initialize
emb[i] = normal(scale=0.6, size=(n_fact,))
# This is our "rare word" id - we want to randomly initialize
emb[-1] = normal(scale=0.6, size=(n_fact,))
emb/=3
return emb
In [121]:
emb = create_emb()
We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.
In [122]:
model = Sequential([
#Embedding(vocab_size, 50,
Embedding(vocab_size, 100,
input_length=seq_len, dropout=0.2, weights=[emb], trainable=False),
Dropout(0.25), ## JH (0.25)
Convolution1D(64, 5, border_mode='same', activation='relu'),
Dropout(0.25), ## JH (0.25)
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.3), ## JH (0.7)
Dense(1, activation='sigmoid')])
In [123]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
I get better results with the 100d embedding than I do with the 50d embedding, after 4 epochs. - MDR
In [124]:
# model.optimizer.lr = 1e-3 ## MDR: added to the 50d for marginally faster training than I was getting
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
set_gpu_fan_speed(0)
model.save_weights(model_path+'glove100_wt1.h5') ## care, with the weight count!
In [70]:
model.load_weights(model_path+'glove50_wt1.h5')
In [129]:
model.load_weights(model_path+'glove100_wt1.h5')
MDR: so my initial results were nowhere near as good, but we're not overfitting yet.
MDR: my results are nowhere near JH's! [] Investigate this!
We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.
In [126]:
model.layers[0].trainable=True
In [127]:
model.optimizer.lr=1e-4
In [128]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
Out[128]:
"As expected, that's given us a nice little boost. :)" - MDR: actually made it worse! For both 50d and 100d cases!
In [75]:
model.save_weights(model_path+'glove50.h5')
This is an implementation of a multi-size CNN as shown in Ben Bowles' excellent blog post.
In [130]:
from keras.layers import Merge
We use the functional API to create multiple conv layers of different sizes, and then concatenate them.
In [136]:
#graph_in = Input ((vocab_size, 50))
graph_in = Input ((vocab_size, 100)) ## MDR - for 100d embedding
convs = [ ]
for fsz in range (3, 6):
x = Convolution1D(64, fsz, border_mode='same', activation="relu")(graph_in)
x = MaxPooling1D()(x)
x = Flatten()(x)
convs.append(x)
out = Merge(mode="concat")(convs)
graph = Model(graph_in, out)
In [137]:
emb = create_emb()
We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.
In [138]:
model = Sequential ([
#Embedding(vocab_size, 50,
Embedding(vocab_size, 100,
input_length=seq_len, dropout=0.2, weights=[emb]),
Dropout (0.2),
graph,
Dropout (0.5),
Dense (100, activation="relu"),
Dropout (0.7),
Dense (1, activation='sigmoid')
])
In [139]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
MDR: it turns out that there's no improvement, in this expt, for using the 100d embedding over the 50d.
In [140]:
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
set_gpu_fan_speed(0)
Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!
MDR: (does it limit overfitting, maybe?) ... anyway, my running of the same code achieved nearly the same results, so much happier.
In [82]:
model.save_weights(model_path+'glove50_conv2_wt1.h5')
In [88]:
model.load_weights(model_path+'glove50_conv2_wt1.h5')
MDR: I want to test this statement from JH, above, by running another couple of epochs. First let's reduce the LR.
In [89]:
model.optimizer.lr = 1e-5
In [90]:
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
set_gpu_fan_speed(0)
Okay, so that didn't help. Reload the weights from before.
In [95]:
model.load_weights(model_path+'glove50_conv2_wt1.h5')
MDR: following JH's plan, from this point.
In [96]:
model.layers[0].trainable=False
In [97]:
model.optimizer.lr=1e-5
In [98]:
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
set_gpu_fan_speed(0)
Out[98]:
This more complex architecture has given us another boost in accuracy.
MDR: although I didn't see a huge advantage, personally.
We haven't covered this bit yet!
MDR: so, there's no preloaded embedding, here - it's a fresh, random set?
In [ ]:
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
W_regularizer=l2(1e-6), dropout=0.2),
LSTM(100, consume_less='gpu'),
Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
MDR: hang on! These summary() outputs look quite different, to me! Not least that this is apparently the 13th lstm he's produced (in this session?) - and yet I've fot a higher numbered dense layer than him. Eh?
But then I reach better results in fewer epochs than he does, this time around. Compare the times, and the more stable convergence in my results. Weird. Still, that's my first LSTM!!
In [100]:
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)
set_gpu_fan_speed(0)
In [101]:
model.save_weights(model_path+'glove50_lstm1_wt1.h5')
MDR: let's see if it's possible to improve on that.
In [102]:
model.optimizer.lr = 1e-5
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)
Out[102]:
MDR: Conclusion: that may be all that's achievable with this dataset, of course. It's sentiment, after all!
God knows whether this will work. Let's see if I can create an LSTM layer on top of pretrained embeddings...
In [150]:
model2 = Sequential([
Embedding(vocab_size, 100, input_length = seq_len,
#mask_zero=True, W_regularizer=l2(1e-6), ## used in lstm above - not needed?
dropout=0.2, weights=[emb], trainable = False),
LSTM(100, consume_less = 'gpu'),
Dense(100, activation = 'sigmoid')
])
In [151]:
model2.summary()
In [152]:
model2.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
In [153]:
set_gpu_fan_speed(90)
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
set_gpu_fan_speed(0)
MDR: OMFG. It needs one epoch to be 90% accurate.
In [ ]: