In [21]:
from theano.sandbox import cuda

In [22]:
%matplotlib inline
import utils
from utils import *

In [23]:
model_path = 'data/imdb/models/'
%mkdir -p $model_path

Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.


In [24]:
from keras.datasets import imdb
idx = imdb.get_word_index()
type(idx)


Out[24]:
dict

In [25]:
# Let's look at the word list
"""
sorted(iterable, *, key=None, reverse=False):
    built-in function; Return a new sorted list from the items in iterable.
""" 
idx_list = sorted(idx, key=idx.get)
print(idx_list[:5])

from itertools import islice
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))
print(take(5, idx.items()))


['the', 'and', 'a', 'of', 'to']
[('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('woodman', 39925), ('sonja', 16816)]

Create a mapping dict from id to word


In [26]:
idx2word = {v:k for k, v in idx.items()}

Get the reviews file


In [27]:
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
"""
get_file(fname, origin,...):
    keras function; downloads a file from a URL if it not already in the cache.

"""
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

In [28]:
print(type(x_train))
print(len(x_train))
# print the 1st review
', '.join(map(str, x_train[0]))


<class 'list'>
25000
Out[28]:
'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

In [29]:
# Let's map the idx to words
' '.join(idx2word[o] for o in x_train[0])


Out[29]:
"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

The labels are 1 for positive, 0 for negative


In [30]:
labels_train[:10]


Out[30]:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Reduce vocab size by setting rare words to max index


In [31]:
vocab_size = 5000
trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Let's look at the distribution of the sentences length


In [32]:
lens = np.array(list(map(len, trn)))
(lens.max(), lens.min(), lens.mean())


Out[32]:
(2493, 10, 237.71364)

Pad or truncate each sentence to make consistent length of 500


In [33]:
seq_len = 500
"""
keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32',
    padding='pre', truncating='pre', value=0.)
    Transform a list of num_samples sequences (lists of scalars) into a 2D Numpy array of shape
    (num_samples, num_timesteps). num_timesteps is either the maxlen argument if provided, 
    or the length of the longest sequence otherwise. Sequences that are shorter than 
    num_timesteps are padded with value at the end. Sequences longer than num_timesteps are 
    truncated so that it fits the desired length. Position where padding or truncation happens 
    is determined by padding or truncating, respectively.
"""
trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)
trn.shape


Out[33]:
(25000, 500)

Create simple models

Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so intead we use an embedding to replace them with a vector of 32 floating numbers for each word in the vocab

Note here that the final sigmoid function is the same as softmax becuase out output is binary. Whenver we use 'binary_crossentryop', we use 'sigmoid' as activation


In [34]:
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')
    ])
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()


/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 16000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           1600100     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 1)             101         dropout_1[0][0]                  
====================================================================================================
Total params: 1,760,201
Trainable params: 1,760,201
Non-trainable params: 0
____________________________________________________________________________________________________

In [35]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)


Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 4s - loss: 0.4924 - acc: 0.7200 - val_loss: 0.3204 - val_acc: 0.8603
Epoch 2/2
25000/25000 [==============================] - 4s - loss: 0.2122 - acc: 0.9204 - val_loss: 0.2964 - val_acc: 0.8749
Out[35]:
<keras.callbacks.History at 0x7fc07c311be0>

Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN since a sequence of word is 1D


In [36]:
conv1 = Sequential([
        Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
        Dropout(0.2),
        # look at 5 words at a time
        Convolution1D(64, 5, border_mode='same', activation='relu'),
        Dropout(0.2),
        MaxPooling1D(),
        Flatten(),
        Dense(100, activation='relu'),
        Dropout(0.7),
        Dense(1, activation='sigmoid')
    ])
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)


/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 12s - loss: 0.4960 - acc: 0.7296 - val_loss: 0.2897 - val_acc: 0.8827
Epoch 2/4
25000/25000 [==============================] - 12s - loss: 0.2935 - acc: 0.8818 - val_loss: 0.2660 - val_acc: 0.8935
Epoch 3/4
25000/25000 [==============================] - 12s - loss: 0.2559 - acc: 0.9019 - val_loss: 0.2660 - val_acc: 0.8902
Epoch 4/4
25000/25000 [==============================] - 12s - loss: 0.2423 - acc: 0.9038 - val_loss: 0.2640 - val_acc: 0.8939
Out[36]:
<keras.callbacks.History at 0x7fc07678e630>

In [37]:
conv1.summary()


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_2 (Embedding)          (None, 500, 32)       160000      embedding_input_2[0][0]          
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 500, 32)       0           embedding_2[0][0]                
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 500, 64)       10304       dropout_2[0][0]                  
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 500, 64)       0           convolution1d_1[0][0]            
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 250, 64)       0           dropout_3[0][0]                  
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 16000)         0           maxpooling1d_1[0][0]             
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 100)           1600100     flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 100)           0           dense_3[0][0]                    
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 1)             101         dropout_4[0][0]                  
====================================================================================================
Total params: 1,770,505
Trainable params: 1,770,505
Non-trainable params: 0
____________________________________________________________________________________________________

$10304 = 5*32*64 + 64$

Each filter is a 5x32 matrix


In [38]:
conv1.save_weights(model_path + 'conv1.h5')

In [39]:
conv1.load_weights(model_path + 'conv1.h5')

Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on. In this section, we replicate the previous CNN, but using pre-trained embeddings. You should always use pre-trained vectors


In [40]:
def get_glove_dataset(dataset):
    """Download the requested glove dataset from files.fast.ai
    and return a location that can be passed to load_vectors.
    """
    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    %mkdir -p $glove_path
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [41]:
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb'), encoding='latin1'),
        pickle.load(open(loc+'_idx.pkl','rb'), encoding='latin1'))

In [42]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))


Untaring file...

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).


In [43]:
def create_emb(vecs, vocab_size):
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))
    for i in range(1, len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

In [44]:
emb = create_emb(vecs, vocab_size)

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.


In [47]:
model = Sequential([
        Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, 
              weights=[emb]),
        Dropout(0.25),
        Convolution1D(64, 5, border_mode='same', activation='relu'),
        Dropout(0.25),
        MaxPooling1D(),
        Flatten(),
        Dense(100, activation='relu'),
        Dropout(0.7),
        Dense(1, activation='sigmoid')])

model.layers[1].trainable=False

In [48]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)


/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
25000/25000 [==============================] - 13s - loss: 0.5128 - acc: 0.7245 - val_loss: 0.3054 - val_acc: 0.8819
Epoch 2/2
25000/25000 [==============================] - 13s - loss: 0.3159 - acc: 0.8715 - val_loss: 0.2744 - val_acc: 0.8956
Out[48]:
<keras.callbacks.History at 0x7fc09e650c50>

We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings


In [49]:
model.layers[0].trainable = True

In [51]:
model.optimizer.lr = 1e-4
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 13s - loss: 0.2592 - acc: 0.8979 - val_loss: 0.2521 - val_acc: 0.8971
Epoch 2/4
25000/25000 [==============================] - 13s - loss: 0.2370 - acc: 0.9054 - val_loss: 0.2535 - val_acc: 0.8955
Epoch 3/4
25000/25000 [==============================] - 13s - loss: 0.2256 - acc: 0.9115 - val_loss: 0.2531 - val_acc: 0.8978
Epoch 4/4
25000/25000 [==============================] - 13s - loss: 0.2184 - acc: 0.9142 - val_loss: 0.2531 - val_acc: 0.8996
Out[51]:
<keras.callbacks.History at 0x7fc09e650f98>

In [52]:
model.save_weights(model_path+'glove50.h5')

Multi-size CNN


In [53]:
from keras.layers import Merge

How can we further improve?

Well, let's try not just using one size of convolution, but a few sizes of convolution layers.

We use the functional API to create multiple conv layer of different sizes, and then concatenate them


In [55]:
graph_in = Input((vocab_size, 50))
convs = []
for fsz in range(3, 6):
    x = Convolution1D(64, fsz, border_mode='same', activation='relu')(graph_in)
    x = MaxPooling1D()(x)
    x = Flatten()(x)
    convs.append(x)
out = Merge(mode='concat')(convs)
graph = Model(graph_in, out)

In [57]:
emb = create_emb(vecs, vocab_size)

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers


In [61]:
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]),
    Dropout (0.2),
    graph,
    Dropout (0.5),
    Dense (100, activation="relu"),
    Dropout (0.7),
    Dense (1, activation='sigmoid')
    ])
model.layers[1].trainable=False

In [62]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)


/home/ubuntu/anaconda2/envs/py36/lib/python3.6/site-packages/theano/tensor/basic.py:2146: UserWarning: theano.tensor.round() changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy. Use the Theano flag `warn.round=False` to disable this warning.
  "theano.tensor.round() changed its default from"
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 30s - loss: 0.4126 - acc: 0.8139 - val_loss: 0.3025 - val_acc: 0.8824
Epoch 2/4
25000/25000 [==============================] - 30s - loss: 0.3062 - acc: 0.8742 - val_loss: 0.2947 - val_acc: 0.8749
Epoch 3/4
25000/25000 [==============================] - 30s - loss: 0.2736 - acc: 0.8890 - val_loss: 0.2844 - val_acc: 0.8794
Epoch 4/4
25000/25000 [==============================] - 30s - loss: 0.2543 - acc: 0.8981 - val_loss: 0.2522 - val_acc: 0.8999
Out[62]:
<keras.callbacks.History at 0x7fc095d542b0>

In [65]:
model.layers[0].trainable=True
model.optimizer.lr=1e-5
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 29s - loss: 0.1866 - acc: 0.9269 - val_loss: 0.2570 - val_acc: 0.8961
Epoch 2/4
25000/25000 [==============================] - 29s - loss: 0.1888 - acc: 0.9249 - val_loss: 0.2655 - val_acc: 0.8899
Epoch 3/4
25000/25000 [==============================] - 29s - loss: 0.1761 - acc: 0.9315 - val_loss: 0.2739 - val_acc: 0.8863
Epoch 4/4
25000/25000 [==============================] - 29s - loss: 0.1712 - acc: 0.9335 - val_loss: 0.2911 - val_acc: 0.8752
Out[65]:
<keras.callbacks.History at 0x7fc09589b860>

LSTM