In [1]:

    
from theano.sandbox import cuda









    



Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
/home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
  warnings.warn(warn)



In [2]:

    
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function









    



Using Theano backend.



In [3]:

    
model_path = 'data/imdb/models/'

Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.



In [4]:

    
from keras.datasets import imdb
idx = imdb.get_word_index()

This is the word list:



In [5]:

    
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]









    Out[5]:





['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

...and this is the mapping from id to word



In [6]:

    
idx2word = {v: k for k, v in idx.iteritems()}

We download the reviews using code copied from keras.datasets:



In [ ]:

    
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)









    



Downloading data from https://s3.amazonaws.com/text-datasets/imdb_full.pkl
65298432/65552540 [============================>.] - ETA: 0s



In [ ]:

    
len(x_train)

Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.



In [ ]:

    
', '.join(map(str, x_train[0]))

The first word of the first review is 23022. Let's see what that is.



In [ ]:

    
idx2word[23022]

Here's the whole review, mapped from ids to words.



In [ ]:

    
' '.join([idx2word[o] for o in x_train[0]])

The labels are 1 for positive, 0 for negative.



In [26]:

    
labels_train[:10]









    Out[26]:





[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Reduce vocab size by setting rare words to max index.



In [27]:

    
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Look at distribution of lengths of sentences.



In [29]:

    
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())









    Out[29]:





(2493, 10, 237.71364)

Pad (with zero) or truncate each sentence to make consistent length.



In [30]:

    
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.



In [32]:

    
trn.shape









    Out[32]:





(25000, 500)

Create simple models

Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.



In [35]:

    
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])



In [36]:

    
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()









    



____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 16000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           1600100     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 1)             101         dropout_1[0][0]                  
====================================================================================================
Total params: 1760201
____________________________________________________________________________________________________



In [19]:

    
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 1s - loss: 0.4651 - acc: 0.7495 - val_loss: 0.2830 - val_acc: 0.8804
Epoch 2/2
25000/25000 [==============================] - 1s - loss: 0.1969 - acc: 0.9265 - val_loss: 0.3195 - val_acc: 0.8694






    Out[19]:





<keras.callbacks.History at 0x7f0e084f4210>

The stanford paper that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.

Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.



In [37]:

    
conv1 = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
    Dropout(0.2),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])



In [45]:

    
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])



In [278]:

    
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 4s - loss: 0.4984 - acc: 0.7250 - val_loss: 0.2922 - val_acc: 0.8816
Epoch 2/4
25000/25000 [==============================] - 4s - loss: 0.2971 - acc: 0.8836 - val_loss: 0.2681 - val_acc: 0.8911
Epoch 3/4
25000/25000 [==============================] - 4s - loss: 0.2568 - acc: 0.8983 - val_loss: 0.2551 - val_acc: 0.8947
Epoch 4/4
25000/25000 [==============================] - 4s - loss: 0.2427 - acc: 0.9029 - val_loss: 0.2558 - val_acc: 0.8947






    Out[278]:





<keras.callbacks.History at 0x7f99cfa785d0>

That's well past the Stanford paper's accuracy - another win for CNNs!



In [281]:

    
conv1.save_weights(model_path + 'conv1.h5')



In [46]:

    
conv1.load_weights(model_path + 'conv1.h5')

Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.



In [17]:

    
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))



In [18]:

    
vecs, words, wordidx = load_vectors('data/glove/results/6B.50d')

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).



In [73]:

    
def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb



In [21]:

    
emb = create_emb()

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.



In [87]:

    
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])



In [88]:

    
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])



In [90]:

    
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 4s - loss: 0.5217 - acc: 0.7172 - val_loss: 0.2942 - val_acc: 0.8815
Epoch 2/2
25000/25000 [==============================] - 4s - loss: 0.3169 - acc: 0.8719 - val_loss: 0.2662 - val_acc: 0.8978






    Out[90]:





<keras.callbacks.History at 0x7f0de0f2d910>

We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.



In [91]:

    
model.layers[0].trainable=True



In [92]:

    
model.optimizer.lr=1e-4



In [93]:

    
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=1, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/1
25000/25000 [==============================] - 4s - loss: 0.2751 - acc: 0.8911 - val_loss: 0.2500 - val_acc: 0.9008






    Out[93]:





<keras.callbacks.History at 0x7f0de0c4e0d0>

As expected, that's given us a nice little boost. :)



In [94]:

    
model.save_weights(model_path+'glove50.h5')

Multi-size CNN

This is an implementation of a multi-size CNN as shown in Ben Bowles' excellent blog post.



In [23]:

    
from keras.layers import Merge

We use the functional API to create multiple conv layers of different sizes, and then concatenate them.



In [132]:

    
graph_in = Input ((vocab_size, 50))
convs = [ ] 
for fsz in range (3, 6): 
    x = Convolution1D(64, fsz, border_mode='same', activation="relu")(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = Merge(mode="concat")(convs) 
graph = Model(graph_in, out)



In [174]:

    
emb = create_emb()

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.



In [175]:

    
model = Sequential ([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]),
    Dropout (0.2),
    graph,
    Dropout (0.5),
    Dense (100, activation="relu"),
    Dropout (0.7),
    Dense (1, activation='sigmoid')
    ])



In [176]:

    
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])



In [177]:

    
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 11s - loss: 0.3997 - acc: 0.8207 - val_loss: 0.3032 - val_acc: 0.8943
Epoch 2/2
25000/25000 [==============================] - 11s - loss: 0.2882 - acc: 0.8832 - val_loss: 0.2646 - val_acc: 0.9029






    Out[177]:





<keras.callbacks.History at 0x7f55b79b7990>

Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!



In [178]:

    
model.layers[0].trainable=False



In [179]:

    
model.optimizer.lr=1e-5



In [180]:

    
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 11s - loss: 0.2556 - acc: 0.8949 - val_loss: 0.2534 - val_acc: 0.9024
Epoch 2/2
25000/25000 [==============================] - 11s - loss: 0.2360 - acc: 0.9057 - val_loss: 0.2577 - val_acc: 0.9036






    Out[180]:





<keras.callbacks.History at 0x7f55b74de110>

This more complex architecture has given us another boost in accuracy.

LSTM

We haven't covered this bit yet!



In [79]:

    
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
              W_regularizer=l2(1e-6), dropout=0.2),
    LSTM(100, consume_less='gpu'),
    Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()









    



____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_13 (Embedding)         (None, 500, 32)       160064      embedding_input_13[0][0]         
____________________________________________________________________________________________________
lstm_13 (LSTM)                   (None, 100)           53200       embedding_13[0][0]               
____________________________________________________________________________________________________
dense_18 (Dense)                 (None, 1)             101         lstm_13[0][0]                    
====================================================================================================
Total params: 213365
____________________________________________________________________________________________________



In [80]:

    
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 100s - loss: 0.5007 - acc: 0.7446 - val_loss: 0.3475 - val_acc: 0.8531
Epoch 2/5
25000/25000 [==============================] - 100s - loss: 0.3524 - acc: 0.8507 - val_loss: 0.3602 - val_acc: 0.8453
Epoch 3/5
25000/25000 [==============================] - 99s - loss: 0.3750 - acc: 0.8342 - val_loss: 0.4758 - val_acc: 0.7710
Epoch 4/5
25000/25000 [==============================] - 99s - loss: 0.3238 - acc: 0.8652 - val_loss: 0.3094 - val_acc: 0.8725
Epoch 5/5
25000/25000 [==============================] - 99s - loss: 0.2681 - acc: 0.8920 - val_loss: 0.3018 - val_acc: 0.8776






    Out[80]:





<keras.callbacks.History at 0x7f9a16b12c50>



In [ ]: