Wayne Nixalo - 2017-Jun-12 17:27
Code-Along of Lesson 5 JNB.
Lesson 5 NB: https://github.com/fastai/courses/blob/master/deeplearning1/nbs/lesson5.ipynb
In [1]:
import theano
In [2]:
%matplotlib inline
import sys, os
sys.path.insert(1, os.path.join('utils'))
import utils; reload(utils)
from utils import *
from __future__ import division, print_function
In [3]:
model_path = 'data/imdb/models/'
%mkdir -p $model_path # -p : make intermediate directories as needed
In [4]:
from keras.datasets import imdb
idx = imdb.get_word_index()
This is the word list:
In [5]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]
Out[5]:
...and this is the mapping from id to word:
In [6]:
idx2word = {v: k for k, v in idx.iteritems()}
We download the reviews using code copied from keras.datasets:
In [7]:
# getting the dataset directly bc keras's versn makes some changes
path = get_file('imdb_full.pkl',
origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)
In [8]:
# apparently cpickle can be x1000 faster than pickle? hmm
len(x_train)
Out[8]:
Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.
In [9]:
', '.join(map(str, x_train[0]))
Out[9]:
The first word of the first review is 23022. Let's see what that is.
In [10]:
idx2word[23022]
Out[10]:
In [11]:
x_train[0]
Out[11]:
Here's the whole review, mapped from ids to words.
In [12]:
' '.join([idx2word[o] for o in x_train[0]])
Out[12]:
The labels are 1 for positive, 0 for negative
In [13]:
labels_train[:10]
Out[13]:
Reduce vocabulary size by setting rare words to max index.
In [14]:
vocab_size = 5000
trn = [np.array([i if i < vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i < vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]
Look at distribution of lengths of sentences
In [15]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())
Out[15]:
Pad (with zero) or truncate each sentence to make consistent length.
In [16]:
seq_len = 500
# keras.preprocessing.sequence
trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)
This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are prepadded with zeros, those greater are truncated.
In [17]:
trn.shape
Out[17]:
In [ ]:
trn[0]
This simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.
In [18]:
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [19]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()
In [21]:
# model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[21]:
In [21]:
# redoing on Linux
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[21]:
The Stanford paper that this dataset is from cites a state of the art accuacy (without unlabelled data) of 0.883. So we're short of that, but on the right track.
In [22]:
# the embedding layer is always the first step in every NLP model
# --> after that layer, you don't have words anymore: vectors
conv1 = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
Dropout(0.2),
Convolution1D(64, 5, border_mode='same', activation='relu'),
Dropout(0.2),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
In [31]:
conv1.summary()
In [23]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [24]:
# conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
Out[24]:
In [24]:
# redoing on Linux w/ GPU
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)
Out[24]:
That's well past the Stanford paper's accuracy - another win for CNNs!
Heh, the above take a lot longer than 4s on my Mac
In [25]:
conv1.save_weights(model_path + 'conv1.h5')
# conv1.load_weights(model_path + 'conv1.h5')
In [26]:
def get_glove_dataset(dataset):
"""Download the requested glove dataset from files.fast.ai
and return a location that can be passed to load_vectors.
"""
# see wordvectors.ipynb for info on how these files were
# generated from the original glove data.
md5sums = {'6B.50d' : '8e1557d1228decbda7db6dfd81cd9909',
'6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
'6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
'6B.300d': '30290210376887dcc6d0a5a6374d8255'}
glove_path = os.path.abspath('data/glove.6B/results')
%mkdir -p $glove_path
return get_file(dataset,
'https://files.fast.ai/models/glove/' + dataset + '.tgz',
cache_subdir=glove_path,
md5_hash=md5sums.get(dataset, None),
untar=True)
# not able to download from above, so using code from wordvectors_CodeAlong.ipynb to load
def get_glove(name):
with open(path+ 'glove.' + name + '.txt', 'r') as f: lines = [line.split() for line in f]
words = [d[0] for d in lines]
vecs = np.stack(np.array(d[1:], dtype=np.float32) for d in lines)
wordidx = {o:i for i,o in enumerate(words)}
save_array(res_path+name+'.dat', vecs)
pickle.dump(words, open(res_path+name+'_words.pkl','wb'))
pickle.dump(wordidx, open(res_path+name+'_idx.pkl','wb'))
# # adding return filename
# return res_path + name + '.dat'
def load_glove(loc):
return (load_array(loc + '.dat'),
pickle.load(open(loc + '_words.pkl', 'rb')),
pickle.load(open(loc + '_idx.pkl', 'rb')))
In [27]:
def load_vectors(loc):
return (load_array(loc + '.dat'),
pickle.load(open(loc + '_words.pkl', 'rb')),
pickle.load(open(loc + '_idx.pkl', 'rb')))
# apparently pickle is a `bit-serializer` or smth like that?
In [32]:
# this isn't working, so instead..
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))
In [ ]:
# trying to load the glove data I downloaded directly, before:
vecs, words, wordix = load_vectors('data/glove.6B/' + 'glove.' + '6B.50d' + '.txt')
# vecs, words, wordix = load_vectors('data/glove.6B/' + 'glove.' + '6B.50d' + '.tgz')
# not successful. get_file(..) returns filepath as '.tar' ? as .tgz doesn't work.
# ??get_file # keras.utils.data_utils.get_file(..)
In [28]:
# that doesn't work either, but method from wordvectors JNB worked so:
path = 'data/glove.6B/'
# res_path = path + 'results/'
res_path = 'data/imdb/results/'
%mkdir -p $res_path
# this way not working; so will pull vecs,words,wordidx manually:
# vecs, words, wordidx = load_vectors(get_glove('6B.50d'))
get_glove('6B.50d')
vecs, words, wordidx = load_glove(res_path + '6B.50d')
# NOTE: yay it worked..!..
In [29]:
def create_emb():
n_fact = vecs.shape[1]
emb = np.zeros((vocab_size, n_fact))
for i in xrange(1, len(emb)):
word = idx2word[i]
if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
src_idx = wordidx[word]
emb[i] = vecs[src_idx]
else:
# If we can't find the word in glove, randomly initialize
emb[i] = normal(scale=0.6, size=(n_fact,))
# This is our "rare word" id - we want to randomly initialize
emb[-1] = normal(scale=0.6, size=(n_fact,))
emb /= 3
return emb
In [30]:
emb = create_emb()
# this embedding matrix is now the glove word vectors, indexed according to
# the imdb dataset.
We pass out embedding matrix to the Embedding constructor, and set it to non-trainable.
In [31]:
model = Sequential([
Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2,
weights=[emb], trainable=False),
Dropout(0.25),
Convolution1D(64, 5, border_mode='same', activation='relu'),
Dropout(0.25),
MaxPooling1D(),
Flatten(),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')])
# this is copy-pasted of the previous code, with the addition of the
# weights being the pre-trained embeddings.
# We figure the weights are pretty good, so we'll initially set
# trainable to False. Will finetune due to some words missing or etc..
In [32]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [60]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[60]:
In [33]:
# running on GPU
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[33]:
We've already beated our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.
In [34]:
model.layers[0].trainable=True
In [63]:
model.optimizer.lr=1e-4
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[63]:
In [35]:
# running on GPU
model.optimizer.lr=1e-4
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[35]:
In [36]:
# the above was supposed to be 3 total epochs but I did 4 by mistake
model.save_weights(model_path+'glove50.h5')
This is an implementation of a multi-size CNN as show in Ben Bowles' blog post.
In [37]:
from keras.layers import Merge
We use the functional API to create multiple ocnv layers of different sizes, and then concatenate them.
In [38]:
graph_in = Input((vocab_size, 50))
convs = [ ]
for fsz in xrange(3, 6):
x = Convolution1D(64, fsz, border_mode='same', activation='relu')(graph_in)
x = MaxPooling1D()(x)
x = Flatten()(x)
convs.append(x)
out = Merge(mode='concat')(convs)
graph = Model(graph_in, out)
In [39]:
emb = create_emb()
We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.
In [40]:
model = Sequential ([
Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]),
Dropout(0.2),
graph,
Dropout(0.5),
Dense(100, activation='relu'),
Dropout(0.7),
Dense(1, activation='sigmoid')
])
In [41]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
In [70]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[70]:
In [42]:
# on GPU
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[42]:
Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why! hmmm
In [43]:
model.layers[0].trainable=False
In [44]:
model.optimizer.lr=1e-5
In [74]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[74]:
In [45]:
# on gpu
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)
Out[45]:
In [46]:
conv1.save_weights(model_path + 'conv1_1.h5')
# conv1.load_weights(model_path + 'conv1.h5')
In [48]:
model = Sequential([
Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
W_regularizer=l2(1e-6), dropout=0.2),
LSTM(100, consume_less='gpu'),
Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
In [49]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)
# NOTE: if this took 100s/epoch using TitanX's or Tesla K80s ... use the Linux machine for this
Out[49]:
In [50]:
conv1.save_weights(model_path + 'LSTM_1.h5')
In [ ]:
In [ ]:
In [ ]: