Text Models with Keras

Dense vector word embeddings

A dense vector word embedding means we represent words with number-full numerical vectors-- most components are nonzero. This is in contrast to sparse vector, or bag-of-word embeddings, which have very high-dimensional vectors (the size of the vocabulary) yet with most components zero.

Dense vector models also capture word meaning, such that similar words (car and automobile) have similar numerical vectors. In a sparse vector representation, similar words probably have completely different numerical vectors. Dense vectors are formed as a by-product of some prediction task. The quality of the embedding depends on both the prediction task and the data set upon which the prediction task was trained.

When we use word embeddings in our deep learning models, we refer to their birthplace as the embedding layer. Sometimes, we don't actually care about the trained predictor (skip-gram and cbow models); we're just interested in the embeddings by-product for use elsewhere. Other times, we need an embedding layer to represent words in a larger model such as a sentiment classifier; there, we may opt for pre-trained dense vectors.

When we don't care about the trained model and just want to create meaningful, dense word vectors, there are two popular prediction models: skip-gram and CBOW (continuous bag of words). Word embeddings constructed in this manner are termed word2vec or w2v. We will also look at another more recent method, fastText. In any case, we've first got to construct training data from our corpus. The exact procedure depends on the model.

Keras Models

Let's have a look at the Keras models we'll use in this section. (I'm keeping the code as markup since we haven't defined any of the parameters yet. We'll run this code after we develop input data and parameters.)


In [4]:
from IPython.display import Image

skip-gram


In [5]:
Image('diagrams/skip-gram.png')


Out[5]:
word1 = Input(shape=(1,), dtype='int64', name='word1')
word2 = Input(shape=(1,), dtype='int64', name='word2')

shared_embedding = Embedding(
    input_dim=VOCAB_SIZE+1, 
    output_dim=DENSEVEC_DIM, 
    input_length=1, 
    embeddings_constraint = unit_norm(),
    name='shared_embedding')

embedded_w1 = shared_embedding(word1)
embedded_w2 = shared_embedding(word2)

w1 = Flatten()(embedded_w1)
w2 = Flatten()(embedded_w2)

dotted = Dot(axes=1, name='dot_product')([w1, w2])

prediction = Dense(1, activation='sigmoid', name='output_layer')(dotted)

sg_model = Model(inputs=[word1, word2], outputs=prediction)

fastText

ft_model = Sequential()

ft_model.add(Embedding(
    input_dim = MAX_FEATURES,
    output_dim = EMBEDDING_DIMS,
    input_length= MAXLEN))

ft_model.add(GlobalAveragePooling1D())

ft_model.add(Dense(1, activation='sigmoid'))

Models and Training data construction

The first step for CBOW and skip-gram

Our training corpus is a collection of sentences, Tweets, emails, comments, or even longer documents. It is something composed of words. Each word takes is turn being the "target" word, and we collect the n words behind it and n words which follow it. This n is referred to as window size. If our example document is the sentence "I love deep learning" and the window size is 1, we'd get:

  • I, love
  • I, love, deep
  • love, deep, learning
  • deep, learning

The target word is bold.

Skip-gram model training data

Skip-gram means form word pairs with a target word and all words in the window. These become the "positive" (1) samples for the skip-gram algorithm. In our "I love deep learning" example we'd get (eliminating repeated pairs):

  • (I, love) = 1
  • (love, deep) = 1
  • (deep, learning) = 1

To create negative samples (0), we pair random vocabulary words with the target word. Yes, it's possible to unluckily pick a negative sample that usually appears around the target word.

For our prediction task, we'll take the dot product of the words in each pair (a small step away from the cosine similarity). The training will keep tweaking the word vectors to make this product as close to unity as possible for our positive samples, and zero for our negative samples.

Happily, Keras include a function for creating skipgrams from text. It even does the negative sampling for us.


In [1]:
from keras.preprocessing.sequence import skipgrams
from keras.preprocessing.text import Tokenizer, text_to_word_sequence


Using TensorFlow backend.

In [2]:
text1 = "I love deep learning."
text2 = "Read Douglas Adams as much as possible."

In [3]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])

In [4]:
word2id = tokenizer.word_index
word2id.items()


Out[4]:
[('love', 2),
 ('adams', 3),
 ('i', 4),
 ('possible', 5),
 ('deep', 6),
 ('read', 7),
 ('as', 1),
 ('much', 8),
 ('douglas', 9),
 ('learning', 10)]

Note word id's are numbered from 1, not zero


In [5]:
id2word = { wordid: word for word, wordid in word2id.items()}
id2word


Out[5]:
{1: 'as',
 2: 'love',
 3: 'adams',
 4: 'i',
 5: 'possible',
 6: 'deep',
 7: 'read',
 8: 'much',
 9: 'douglas',
 10: 'learning'}

In [6]:
encoded_text = [word2id[word] for word in text_to_word_sequence(text1)]
encoded_text


Out[6]:
[4, 2, 6, 10]

In [9]:
[word2id[word] for word in text_to_word_sequence(text2)]


Out[9]:
[7, 9, 3, 1, 8, 1, 5]

In [10]:
sg = skipgrams(encoded_text, vocabulary_size=len(word2id.keys()), window_size=1)
sg


Out[10]:
([[2, 6],
  [2, 6],
  [2, 9],
  [6, 2],
  [6, 3],
  [6, 10],
  [6, 1],
  [2, 4],
  [4, 2],
  [10, 6],
  [10, 8],
  [4, 2]],
 [1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0])

In [11]:
for i in range(len(sg[0])):
    print "({0},{1})={2}".format(id2word[sg[0][i][0]], id2word[sg[0][i][1]], sg[1][i])


(love,deep)=1
(love,deep)=0
(love,douglas)=0
(deep,love)=1
(deep,adams)=0
(deep,learning)=1
(deep,as)=0
(love,i)=1
(i,love)=1
(learning,deep)=1
(learning,much)=0
(i,love)=0

Model parameters


In [12]:
VOCAB_SIZE = len(word2id.keys())
VOCAB_SIZE


Out[12]:
10

In [13]:
DENSEVEC_DIM = 50

Model build


In [14]:
import keras

In [18]:
from keras.layers.embeddings import Embedding
from keras.constraints import unit_norm
from keras.layers.merge import Dot
from keras.layers.core import Activation
from keras.layers.core import Flatten

from keras.layers import Input, Dense
from keras.models import Model

Create a dense vector for each word in the pair. The output of Embedding has shape (batch_size, sequence_length, output_dim) which in our case is (batch_size, 1, DENSEVEC_DIM). We'll use Flatten to get rid of that pesky middle dimension (1), so going into the dot product we'll have shape (batch_size, DENSEVEC_DIM).


In [16]:
word1 = Input(shape=(1,), dtype='int64', name='word1')
word2 = Input(shape=(1,), dtype='int64', name='word2')

In [19]:
shared_embedding = Embedding(
    input_dim=VOCAB_SIZE+1, 
    output_dim=DENSEVEC_DIM, 
    input_length=1, 
    embeddings_constraint = unit_norm(),
    name='shared_embedding')

embedded_w1 = shared_embedding(word1)
embedded_w2 = shared_embedding(word2)

w1 = Flatten()(embedded_w1)
w2 = Flatten()(embedded_w2)

dotted = Dot(axes=1, name='dot_product')([w1, w2])

prediction = Dense(1, activation='sigmoid', name='output_layer')(dotted)

In [20]:
sg_model = Model(inputs=[word1, word2], outputs=prediction)

In [21]:
sg_model.compile(optimizer='adam', loss='mean_squared_error')

At this point you can check out how the data flows through your compiled model.


In [22]:
sg_model.layers


Out[22]:
[<keras.engine.topology.InputLayer at 0x116536310>,
 <keras.engine.topology.InputLayer at 0x11650ebd0>,
 <keras.layers.embeddings.Embedding at 0x1165d7e50>,
 <keras.layers.core.Flatten at 0x1165a7310>,
 <keras.layers.core.Flatten at 0x1165a73d0>,
 <keras.layers.merge.Dot at 0x1165d7f90>,
 <keras.layers.core.Dense at 0x1115a1810>]

In [108]:
def print_layer(model, num):
    print model.layers[num]
    print model.layers[num].input_shape
    print model.layers[num].output_shape

In [27]:
print_layer(sg_model,3)


<keras.layers.core.Flatten object at 0x1165a7310>
(None, 1, 50)
(None, 50)

Let's try training it with our toy data set!


In [28]:
import numpy as np

In [29]:
pairs = np.array(sg[0])
targets = np.array(sg[1])

In [30]:
targets


Out[30]:
array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0])

In [31]:
pairs


Out[31]:
array([[ 2,  6],
       [ 2,  6],
       [ 2,  9],
       [ 6,  2],
       [ 6,  3],
       [ 6, 10],
       [ 6,  1],
       [ 2,  4],
       [ 4,  2],
       [10,  6],
       [10,  8],
       [ 4,  2]])

In [32]:
w1_list = np.reshape(pairs[:, 0], (len(pairs), 1))
w1_list


Out[32]:
array([[ 2],
       [ 2],
       [ 2],
       [ 6],
       [ 6],
       [ 6],
       [ 6],
       [ 2],
       [ 4],
       [10],
       [10],
       [ 4]])

In [33]:
w2_list = np.reshape(pairs[:, 1], (len(pairs), 1))
w2_list


Out[33]:
array([[ 6],
       [ 6],
       [ 9],
       [ 2],
       [ 3],
       [10],
       [ 1],
       [ 4],
       [ 2],
       [ 6],
       [ 8],
       [ 2]])

In [34]:
w2_list.shape


Out[34]:
(12, 1)

In [35]:
w2_list.dtype


Out[35]:
dtype('int64')

In [45]:
sg_model.fit(x=[w1_list, w2_list], y=targets,  epochs=10)


Epoch 1/10
12/12 [==============================] - 0s - loss: 0.1396
Epoch 2/10
12/12 [==============================] - 0s - loss: 0.1390
Epoch 3/10
12/12 [==============================] - 0s - loss: 0.1383
Epoch 4/10
12/12 [==============================] - 0s - loss: 0.1377
Epoch 5/10
12/12 [==============================] - 0s - loss: 0.1371
Epoch 6/10
12/12 [==============================] - 0s - loss: 0.1365
Epoch 7/10
12/12 [==============================] - 0s - loss: 0.1360
Epoch 8/10
12/12 [==============================] - 0s - loss: 0.1354
Epoch 9/10
12/12 [==============================] - 0s - loss: 0.1349
Epoch 10/10
12/12 [==============================] - 0s - loss: 0.1344
Out[45]:
<keras.callbacks.History at 0x116b8f850>

In [47]:
sg_model.layers[2].weights


Out[47]:
[<tf.Variable 'shared_embedding_2/embeddings:0' shape=(11, 50) dtype=float32_ref>]

Continuous Bag of Words (CBOW) model

CBOW means we take all the words in the window and use them to predict the target word. Note we are trying to predict an actual word (or a probability distribution over words) with CBOW, whereas in skip-gram we are trying to predict a similarity score.

FastText Model

FastText is creating dense document vectors using the words in the document enhanced with n-grams. These are embedded, averaged, and fed through a hidden dense layer, with a sigmoid activation. The prediction task is some binary classification of the documents. As usual, after training we can extract the dense vectors from the model.

FastText Model Data Prep


In [48]:
MAX_FEATURES = 20000  # number of unique words in the dataset
MAXLEN = 400  # max word (feature) length of a review  
EMBEDDING_DIMS = 50
NGRAM_RANGE = 2

Some data prep functions lifted from the example


In [49]:
def create_ngram_set(input_list, ngram_value=2):
    """
    Extract a set of n-grams from a list of integers.
    """
    return set(zip(*[input_list[i:] for i in range(ngram_value)]))

In [50]:
create_ngram_set([1, 2, 3, 4, 5], ngram_value=2)


Out[50]:
{(1, 2), (2, 3), (3, 4), (4, 5)}

In [51]:
create_ngram_set([1, 2, 3, 4, 5], ngram_value=3)


Out[51]:
{(1, 2, 3), (2, 3, 4), (3, 4, 5)}

In [52]:
def add_ngram(sequences, token_indice, ngram_range=2):
    """
    Augment the input list of list (sequences) by appending n-grams values.
    """
    new_sequences = []
    for input_list in sequences:
        new_list = input_list[:]
        for i in range(len(new_list) - ngram_range + 1):
            for ngram_value in range(2, ngram_range + 1):
                ngram = tuple(new_list[i:i + ngram_value])
                if ngram in token_indice:
                    new_list.append(token_indice[ngram])
        new_sequences.append(new_list)

    return new_sequences

In [60]:
sequences = [[1,2,3,4,5, 6], [6,7,8]]
token_indice = {(1,2): 20000, (4,5): 20001, (6,7,8): 20002}

In [61]:
add_ngram(sequences, token_indice, ngram_range=2)


Out[61]:
[[1, 2, 3, 4, 5, 6, 20000, 20001], [6, 7, 8]]

In [62]:
add_ngram(sequences, token_indice, ngram_range=3)


Out[62]:
[[1, 2, 3, 4, 5, 6, 20000, 20001], [6, 7, 8, 20002]]

load canned training data


In [63]:
from keras.datasets import imdb

In [64]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)

In [65]:
x_train[0:2]


Out[65]:
array([ list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 10156, 4, 1153, 9, 194, 775, 7, 8255, 11596, 349, 2637, 148, 605, 15358, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])], dtype=object)

In [66]:
y_train[0:2]


Out[66]:
array([1, 0])

Add n-gram features


In [67]:
ngram_set = set()
for input_list in x_train:
    for i in range(2, NGRAM_RANGE + 1):
        set_of_ngram = create_ngram_set(input_list, ngram_value=i)
        ngram_set.update(set_of_ngram)

In [68]:
len(ngram_set)


Out[68]:
1185229

Assign id's to the new features


In [70]:
ngram_set.pop()


Out[70]:
(2561, 3221)

In [71]:
start_index = MAX_FEATURES + 1
token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
indice_token = {token_indice[k]: k for k in token_indice}

Update MAX_FEATURES


In [73]:
import numpy as np

In [74]:
MAX_FEATURES = np.max(list(indice_token.keys())) + 1
MAX_FEATURES


Out[74]:
1205229

Add n-grams to the input data


In [75]:
x_train = add_ngram(x_train, token_indice, NGRAM_RANGE)
x_test = add_ngram(x_test, token_indice, NGRAM_RANGE)

Make all input sequences the same length by padding with zeros


In [76]:
from keras.preprocessing import sequence

In [77]:
sequence.pad_sequences([[1,2,3,4,5], [6,7,8]], maxlen=10)


Out[77]:
array([[0, 0, 0, 0, 0, 1, 2, 3, 4, 5],
       [0, 0, 0, 0, 0, 0, 0, 6, 7, 8]], dtype=int32)

In [78]:
x_train = sequence.pad_sequences(x_train, maxlen=MAXLEN)
x_test = sequence.pad_sequences(x_test, maxlen=MAXLEN)

In [79]:
x_train.shape


Out[79]:
(25000, 400)

In [80]:
x_test.shape


Out[80]:
(25000, 400)

FastText Model


In [4]:
Image('diagrams/fasttext.png')


Out[4]:

In [82]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalAveragePooling1D
from keras.layers import Dense

In [83]:
ft_model = Sequential()

ft_model.add(Embedding(
    input_dim = MAX_FEATURES,
    output_dim = EMBEDDING_DIMS,
    input_length= MAXLEN))

ft_model.add(GlobalAveragePooling1D())

ft_model.add(Dense(1, activation='sigmoid'))

In [84]:
ft_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [302]:
ft_model.layers


Out[302]:
[<keras.layers.embeddings.Embedding at 0x117f12650>,
 <keras.layers.pooling.GlobalAveragePooling1D at 0x117bc51d0>,
 <keras.layers.core.Dense at 0x1190c3dd0>]

In [306]:
print_layer(ft_model, 0)


<keras.layers.embeddings.Embedding object at 0x117f12650>
(None, 400)
(None, 400, 50)

In [307]:
print_layer(ft_model, 1)


<keras.layers.pooling.GlobalAveragePooling1D object at 0x117bc51d0>
(None, 400, 50)
(None, 50)

In [308]:
print_layer(ft_model, 2)


<keras.layers.core.Dense object at 0x1190c3dd0>
(None, 50)
(None, 1)

In [85]:
ft_model.fit(x_train, y_train, batch_size=100, epochs=3, validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/3
25000/25000 [==============================] - 187s - loss: 0.6599 - acc: 0.7460 - val_loss: 0.6032 - val_acc: 0.8109
Epoch 2/3
25000/25000 [==============================] - 183s - loss: 0.4989 - acc: 0.8840 - val_loss: 0.4610 - val_acc: 0.8578
Epoch 3/3
25000/25000 [==============================] - 184s - loss: 0.3338 - acc: 0.9333 - val_loss: 0.3725 - val_acc: 0.8797
Out[85]:
<keras.callbacks.History at 0x1167c5b10>

fastText classifier vs. convolutional neural network (CNN) vs. long short-term memory (LSTM) classifier: Fight!

A CNN takes the dot product of various "filters" (some new vector) with each word window down the sentence. For each convolutional layer in your model, you can choose the size of the filter (for example, 3 word vectors long) and the number of filters in the layer (for example, ten 3-word filters, or five 3-word filters).

Add a bias to each dot product of the filter and word window, and run it through an activation function. This produces a number.

Running a single filter down a sentence produces a series of numbers. Generally the maximum value is taken to represent the alignment of the sentence with that particular filter. All of this is just another way of extracting features from a sentence. In fastText, we extracted features in a human-readable way (n-grams) and tacked them onto the input data. With a CNN we take a different approach, letting the algorithm figure out what makes good features for the dataset.

insert filter operating on sentence image here


In [7]:
Image('diagrams/text-cnn-classifier.png')


Out[7]:

Diagram from Convolutional Neural Networks for Sentence Classification, Kim Yoon (2014)

A CNN sentence classifier


In [114]:
embedding_dim = 50  # we'll get a vector representation of words as a by-product
filter_sizes = (2, 3, 4)  # we'll make one convolutional layer for each filter we specify here
num_filters = 10  # each layer will contain this many filters

In [115]:
dropout_prob = (0.2, 0.2)
hidden_dims = 50

# Prepossessing parameters
sequence_length = 400
max_words = 5000

Canned input data


In [88]:
from keras.datasets import imdb

In [89]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_words)  # limits vocab to num_words

In [97]:
?imdb.load_data

In [90]:
from keras.preprocessing import sequence

In [91]:
x_train = sequence.pad_sequences(x_train, maxlen=sequence_length, padding="post", truncating="post")
x_test = sequence.pad_sequences(x_test, maxlen=sequence_length, padding="post", truncating="post")

In [92]:
x_train[0]


Out[92]:
array([   1,   14,   22,   16,   43,  530,  973, 1622, 1385,   65,  458,
       4468,   66, 3941,    4,  173,   36,  256,    5,   25,  100,   43,
        838,  112,   50,  670,    2,    9,   35,  480,  284,    5,  150,
          4,  172,  112,  167,    2,  336,  385,   39,    4,  172, 4536,
       1111,   17,  546,   38,   13,  447,    4,  192,   50,   16,    6,
        147, 2025,   19,   14,   22,    4, 1920, 4613,  469,    4,   22,
         71,   87,   12,   16,   43,  530,   38,   76,   15,   13, 1247,
          4,   22,   17,  515,   17,   12,   16,  626,   18,    2,    5,
         62,  386,   12,    8,  316,    8,  106,    5,    4, 2223,    2,
         16,  480,   66, 3785,   33,    4,  130,   12,   16,   38,  619,
          5,   25,  124,   51,   36,  135,   48,   25, 1415,   33,    6,
         22,   12,  215,   28,   77,   52,    5,   14,  407,   16,   82,
          2,    8,    4,  107,  117,    2,   15,  256,    4,    2,    7,
       3766,    5,  723,   36,   71,   43,  530,  476,   26,  400,  317,
         46,    7,    4,    2, 1029,   13,  104,   88,    4,  381,   15,
        297,   98,   32, 2071,   56,   26,  141,    6,  194,    2,   18,
          4,  226,   22,   21,  134,  476,   26,  480,    5,  144,   30,
          2,   18,   51,   36,   28,  224,   92,   25,  104,    4,  226,
         65,   16,   38, 1334,   88,   12,   16,  283,    5,   16, 4472,
        113,  103,   32,   15,   16,    2,   19,  178,   32,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0], dtype=int32)

In [93]:
vocabulary = imdb.get_word_index()  # word to integer map

In [94]:
vocabulary['good']


Out[94]:
49

In [98]:
len(vocabulary)


Out[98]:
88584

Model build


In [96]:
from keras.models import Model
from keras.layers import Input
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers.merge import Concatenate

In [116]:
# Input, embedding, and dropout layers
input_shape = (sequence_length,)
model_input = Input(shape=input_shape)
z = Embedding(
        input_dim=len(vocabulary) + 1, 
        output_dim=embedding_dim, 
        input_length=sequence_length, 
        name="embedding")(model_input)
z = Dropout(dropout_prob[0])(z)

# Convolutional block
# parallel set of n convolutions; output of all n are
# concatenated into one vector
conv_blocks = []
for sz in filter_sizes:
    conv = Conv1D(filters=num_filters, kernel_size=sz, activation="relu" )(z)
    conv = MaxPooling1D(pool_size=2)(conv)
    conv = Flatten()(conv)
    conv_blocks.append(conv)
    
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(dropout_prob[1])(z)

# Hidden dense layer and output layer
z = Dense(hidden_dims, activation="relu")(z)
model_output = Dense(1, activation="sigmoid")(z)

cnn_model = Model(model_input, model_output)
cnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [121]:
cnn_model.layers


Out[121]:
[<keras.engine.topology.InputLayer at 0x126639c10>,
 <keras.layers.embeddings.Embedding at 0x126639d10>,
 <keras.layers.core.Dropout at 0x11f1039d0>,
 <keras.layers.convolutional.Conv1D at 0x10defda50>,
 <keras.layers.convolutional.Conv1D at 0x126639c90>,
 <keras.layers.convolutional.Conv1D at 0x1121d3d90>,
 <keras.layers.pooling.MaxPooling1D at 0x126abe790>,
 <keras.layers.pooling.MaxPooling1D at 0x112185dd0>,
 <keras.layers.pooling.MaxPooling1D at 0x1121c6cd0>,
 <keras.layers.core.Flatten at 0x126676850>,
 <keras.layers.core.Flatten at 0x112197610>,
 <keras.layers.core.Flatten at 0x11cc280d0>,
 <keras.layers.merge.Concatenate at 0x10b926e90>,
 <keras.layers.core.Dropout at 0x126aaf9d0>,
 <keras.layers.core.Dense at 0x11cc51e50>,
 <keras.layers.core.Dense at 0x11cd6bf50>]

In [122]:
print_layer(cnn_model, 12)


<keras.layers.merge.Concatenate object at 0x10b926e90>
[(None, 1990), (None, 1990), (None, 1980)]
(None, 5960)

In [112]:
print_layer(cnn_model, 12)


<keras.layers.core.Dense object at 0x112d2d990>
(None, 50)
(None, 1)

In [123]:
cnn_model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/3
25000/25000 [==============================] - 50s - loss: 0.4335 - acc: 0.7704 - val_loss: 0.2938 - val_acc: 0.8758
Epoch 2/3
25000/25000 [==============================] - 50s - loss: 0.2309 - acc: 0.9071 - val_loss: 0.2859 - val_acc: 0.8802
Epoch 3/3
25000/25000 [==============================] - 50s - loss: 0.1893 - acc: 0.9257 - val_loss: 0.2995 - val_acc: 0.8766
Out[123]:
<keras.callbacks.History at 0x126751c50>

In [57]:
cnn_model.layers[1].weights


Out[57]:
[<tf.Variable 'embedding_5/embeddings:0' shape=(88585, 50) dtype=float32_ref>]

In [51]:
cnn_model.layers[1].get_weights()


Out[51]:
[array([[-0.00537731,  0.01004505, -0.01243093, ..., -0.000989  ,
          0.00684546, -0.00744937],
        [-0.05866501, -0.00139329,  0.01262602, ...,  0.01466062,
          0.01777977, -0.04964167],
        [ 0.0226872 , -0.00739046, -0.01942088, ...,  0.00778489,
          0.02367541, -0.02095466],
        ..., 
        [ 0.00730095,  0.03500965, -0.02484518, ..., -0.04528098,
          0.03952632, -0.00274396],
        [ 0.00923758, -0.03889309, -0.00641484, ..., -0.00164293,
         -0.02593929,  0.01862602],
        [-0.04924661,  0.02040339,  0.00640279, ...,  0.01267619,
          0.04790827,  0.00526398]], dtype=float32)]

In [55]:
cnn_model.layers[3].weights


Out[55]:
[<tf.Variable 'conv1d_11/kernel:0' shape=(3, 50, 100) dtype=float32_ref>,
 <tf.Variable 'conv1d_11/bias:0' shape=(100,) dtype=float32_ref>]

An LSTM sentence classifier


In [43]:
Image('diagrams/LSTM.png')


Out[43]:

In [37]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import SpatialDropout1D
from keras.layers.core import Dropout
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense

In [38]:
hidden_dims = 50
embedding_dim = 50

In [39]:
lstm_model = Sequential()
lstm_model.add(Embedding(len(vocabulary) + 1, embedding_dim, input_length=sequence_length, name="embedding"))
lstm_model.add(SpatialDropout1D(Dropout(0.2)))
lstm_model.add(LSTM(hidden_dims, dropout=0.2, recurrent_dropout=0.2))  # first arg, like Dense, is dim of output
lstm_model.add(Dense(1, activation='sigmoid'))

In [40]:
lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [41]:
lstm_model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test))


Train on 25000 samples, validate on 25000 samples
Epoch 1/3
25000/25000 [==============================] - 482s - loss: 0.6939 - acc: 0.5040 - val_loss: 0.6946 - val_acc: 0.5006
Epoch 2/3
25000/25000 [==============================] - 462s - loss: 0.6896 - acc: 0.5157 - val_loss: 0.6911 - val_acc: 0.5069
Epoch 3/3
25000/25000 [==============================] - 463s - loss: 0.6758 - acc: 0.5354 - val_loss: 0.6948 - val_acc: 0.5100
Out[41]:
<keras.callbacks.History at 0x12016a510>

In [44]:
lstm_model.layers


Out[44]:
[<keras.layers.embeddings.Embedding at 0x11e953950>,
 <keras.layers.core.SpatialDropout1D at 0x111fed590>,
 <keras.layers.recurrent.LSTM at 0x11e975690>,
 <keras.layers.core.Dense at 0x10ef82590>]

In [47]:
lstm_model.layers[2].input_shape


Out[47]:
(None, 400, 50)

In [46]:
lstm_model.layers[2].output_shape


Out[46]:
(None, 50)

In [ ]:

Appendix: Our own data download and preparation

We'll use the Large Movie Review Dataset v1.0 for our corpus. While Keras has its own data samples you can import for modeling (including this one), I think it's very important to get and process your own data. Otherwise, the results appear to materialize out of thin air and it's more difficult to get on with your own research.


In [42]:
%matplotlib inline
import pandas as pd

In [14]:
import glob

In [51]:
datapath = "/Users/pfigliozzi/aclImdb/train/unsup"
files = glob.glob(datapath+"/*.txt")[:1000] #first 1000 (there are 50k)

In [52]:
df = pd.concat([pd.read_table(filename, header=None, names=['raw']) for filename in files], ignore_index=True)

In [53]:
df.raw.map(lambda x: len(x)).plot.hist()


Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x107ee6710>

In [47]:
50000. * 2000. / 10**6


Out[47]:
100.0

In [ ]: