Text Models with Keras

Dense vector word embeddings

A dense vector word embedding means we represent words with number-full numerical vectors-- most components are nonzero. This is in contrast to sparse vector, or bag-of-word embeddings, which have very high-dimensional vectors (the size of the vocabulary) yet with most components zero.

Dense vector models also capture word meaning, such that similar words (car and automobile) have similar numerical vectors. In a sparse vector representation, similar words probably have completely different numerical vectors. Dense vectors are formed as a by-product of some prediction task. The quality of the embedding depends on both the prediction task and the data set upon which the prediction task was trained.

When we use word embeddings in our deep learning models, we refer to their birthplace as the embedding layer. Sometimes, we don't actually care about the trained predictor (skip-gram and cbow models); we're just interested in the embeddings by-product for use elsewhere. Other times, we need an embedding layer to represent words in a larger model such as a sentiment classifier; there, we may opt for pre-trained dense vectors.

When we don't care about the trained model and just want to create meaningful, dense word vectors, there are two popular prediction models: skip-gram and CBOW (continuous bag of words). Word embeddings constructed in this manner are termed word2vec or w2v. We will also look at another more recent method, fastText. In any case, we've first got to construct training data from our corpus. The exact procedure depends on the model.

Keras Models

Let's have a look at the Keras models we'll use in this section. (I'm keeping the code as markup since we haven't defined any of the parameters yet. We'll run this code after we develop input data and parameters.)



In [4]:

    
from IPython.display import Image

skip-gram



In [5]:

    
Image('diagrams/skip-gram.png')









    Out[5]:

word1 = Input(shape=(1,), dtype='int64', name='word1')
word2 = Input(shape=(1,), dtype='int64', name='word2')

shared_embedding = Embedding(
    input_dim=VOCAB_SIZE+1, 
    output_dim=DENSEVEC_DIM, 
    input_length=1, 
    embeddings_constraint = unit_norm(),
    name='shared_embedding')

embedded_w1 = shared_embedding(word1)
embedded_w2 = shared_embedding(word2)

w1 = Flatten()(embedded_w1)
w2 = Flatten()(embedded_w2)

dotted = Dot(axes=1, name='dot_product')([w1, w2])

prediction = Dense(1, activation='sigmoid', name='output_layer')(dotted)

sg_model = Model(inputs=[word1, word2], outputs=prediction)

fastText

ft_model = Sequential()

ft_model.add(Embedding(
    input_dim = MAX_FEATURES,
    output_dim = EMBEDDING_DIMS,
    input_length= MAXLEN))

ft_model.add(GlobalAveragePooling1D())

ft_model.add(Dense(1, activation='sigmoid'))

Models and Training data construction

The first step for CBOW and skip-gram

Our training corpus is a collection of sentences, Tweets, emails, comments, or even longer documents. It is something composed of words. Each word takes is turn being the "target" word, and we collect the n words behind it and n words which follow it. This n is referred to as window size. If our example document is the sentence "I love deep learning" and the window size is 1, we'd get:

I, love
I, love, deep
love, deep, learning
deep, learning

The target word is bold.

Skip-gram model training data

Skip-gram means form word pairs with a target word and all words in the window. These become the "positive" (1) samples for the skip-gram algorithm. In our "I love deep learning" example we'd get (eliminating repeated pairs):

(I, love) = 1
(love, deep) = 1
(deep, learning) = 1

To create negative samples (0), we pair random vocabulary words with the target word. Yes, it's possible to unluckily pick a negative sample that usually appears around the target word.

For our prediction task, we'll take the dot product of the words in each pair (a small step away from the cosine similarity). The training will keep tweaking the word vectors to make this product as close to unity as possible for our positive samples, and zero for our negative samples.

Happily, Keras include a function for creating skipgrams from text. It even does the negative sampling for us.



In [1]:

    
from keras.preprocessing.sequence import skipgrams
from keras.preprocessing.text import Tokenizer, text_to_word_sequence









    



Using TensorFlow backend.



In [2]:

    
text1 = "I love deep learning."
text2 = "Read Douglas Adams as much as possible."



In [3]:

    
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])



In [4]:

    
word2id = tokenizer.word_index
word2id.items()









    Out[4]:





[('love', 2),
 ('adams', 3),
 ('i', 4),
 ('possible', 5),
 ('deep', 6),
 ('read', 7),
 ('as', 1),
 ('much', 8),
 ('douglas', 9),
 ('learning', 10)]

Note word id's are numbered from 1, not zero



In [5]:

    
id2word = { wordid: word for word, wordid in word2id.items()}
id2word









    Out[5]:





{1: 'as',
 2: 'love',
 3: 'adams',
 4: 'i',
 5: 'possible',
 6: 'deep',
 7: 'read',
 8: 'much',
 9: 'douglas',
 10: 'learning'}



In [6]:

    
encoded_text = [word2id[word] for word in text_to_word_sequence(text1)]
encoded_text









    Out[6]:





[4, 2, 6, 10]



In [9]:

    
[word2id[word] for word in text_to_word_sequence(text2)]









    Out[9]:





[7, 9, 3, 1, 8, 1, 5]



In [10]:

    
sg = skipgrams(encoded_text, vocabulary_size=len(word2id.keys()), window_size=1)
sg









    Out[10]:





([[2, 6],
  [2, 6],
  [2, 9],
  [6, 2],
  [6, 3],
  [6, 10],
  [6, 1],
  [2, 4],
  [4, 2],
  [10, 6],
  [10, 8],
  [4, 2]],
 [1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0])



In [11]:

    
for i in range(len(sg[0])):
    print "({0},{1})={2}".format(id2word[sg[0][i][0]], id2word[sg[0][i][1]], sg[1][i])









    



(love,deep)=1
(love,deep)=0
(love,douglas)=0
(deep,love)=1
(deep,adams)=0
(deep,learning)=1
(deep,as)=0
(love,i)=1
(i,love)=1
(learning,deep)=1
(learning,much)=0
(i,love)=0

Model parameters



In [12]:

    
VOCAB_SIZE = len(word2id.keys())
VOCAB_SIZE









    Out[12]:





10



In [13]:

    
DENSEVEC_DIM = 50

Model build



In [14]:

    
import keras



In [18]:

    
from keras.layers.embeddings import Embedding
from keras.constraints import unit_norm
from keras.layers.merge import Dot
from keras.layers.core import Activation
from keras.layers.core import Flatten

from keras.layers import Input, Dense
from keras.models import Model

Create a dense vector for each word in the pair. The output of Embedding has shape (batch_size, sequence_length, output_dim) which in our case is (batch_size, 1, DENSEVEC_DIM). We'll use Flatten to get rid of that pesky middle dimension (1), so going into the dot product we'll have shape (batch_size, DENSEVEC_DIM).



In [16]:

    
word1 = Input(shape=(1,), dtype='int64', name='word1')
word2 = Input(shape=(1,), dtype='int64', name='word2')



In [19]:

    
shared_embedding = Embedding(
    input_dim=VOCAB_SIZE+1, 
    output_dim=DENSEVEC_DIM, 
    input_length=1, 
    embeddings_constraint = unit_norm(),
    name='shared_embedding')

embedded_w1 = shared_embedding(word1)
embedded_w2 = shared_embedding(word2)

w1 = Flatten()(embedded_w1)
w2 = Flatten()(embedded_w2)

dotted = Dot(axes=1, name='dot_product')([w1, w2])

prediction = Dense(1, activation='sigmoid', name='output_layer')(dotted)



In [20]:

    
sg_model = Model(inputs=[word1, word2], outputs=prediction)



In [21]:

    
sg_model.compile(optimizer='adam', loss='mean_squared_error')

At this point you can check out how the data flows through your compiled model.



In [22]:

    
sg_model.layers









    Out[22]:





[<keras.engine.topology.InputLayer at 0x116536310>,
 <keras.engine.topology.InputLayer at 0x11650ebd0>,
 <keras.layers.embeddings.Embedding at 0x1165d7e50>,
 <keras.layers.core.Flatten at 0x1165a7310>,
 <keras.layers.core.Flatten at 0x1165a73d0>,
 <keras.layers.merge.Dot at 0x1165d7f90>,
 <keras.layers.core.Dense at 0x1115a1810>]



In [108]:

    
def print_layer(model, num):
    print model.layers[num]
    print model.layers[num].input_shape
    print model.layers[num].output_shape



In [27]:

    
print_layer(sg_model,3)









    



<keras.layers.core.Flatten object at 0x1165a7310>
(None, 1, 50)
(None, 50)

Let's try training it with our toy data set!



In [28]:

    
import numpy as np



In [29]:

    
pairs = np.array(sg[0])
targets = np.array(sg[1])



In [30]:

    
targets









    Out[30]:





array([1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0])



In [31]:

    
pairs









    Out[31]:





array([[ 2,  6],
       [ 2,  6],
       [ 2,  9],
       [ 6,  2],
       [ 6,  3],
       [ 6, 10],
       [ 6,  1],
       [ 2,  4],
       [ 4,  2],
       [10,  6],
       [10,  8],
       [ 4,  2]])



In [32]:

    
w1_list = np.reshape(pairs[:, 0], (len(pairs), 1))
w1_list









    Out[32]:





array([[ 2],
       [ 2],
       [ 2],
       [ 6],
       [ 6],
       [ 6],
       [ 6],
       [ 2],
       [ 4],
       [10],
       [10],
       [ 4]])



In [33]:

    
w2_list = np.reshape(pairs[:, 1], (len(pairs), 1))
w2_list









    Out[33]:





array([[ 6],
       [ 6],
       [ 9],
       [ 2],
       [ 3],
       [10],
       [ 1],
       [ 4],
       [ 2],
       [ 6],
       [ 8],
       [ 2]])



In [34]:

    
w2_list.shape









    Out[34]:





(12, 1)



In [35]:

    
w2_list.dtype









    Out[35]:





dtype('int64')



In [45]:

    
sg_model.fit(x=[w1_list, w2_list], y=targets,  epochs=10)









    



Epoch 1/10
12/12 [==============================] - 0s - loss: 0.1396
Epoch 2/10
12/12 [==============================] - 0s - loss: 0.1390
Epoch 3/10
12/12 [==============================] - 0s - loss: 0.1383
Epoch 4/10
12/12 [==============================] - 0s - loss: 0.1377
Epoch 5/10
12/12 [==============================] - 0s - loss: 0.1371
Epoch 6/10
12/12 [==============================] - 0s - loss: 0.1365
Epoch 7/10
12/12 [==============================] - 0s - loss: 0.1360
Epoch 8/10
12/12 [==============================] - 0s - loss: 0.1354
Epoch 9/10
12/12 [==============================] - 0s - loss: 0.1349
Epoch 10/10
12/12 [==============================] - 0s - loss: 0.1344






    Out[45]:





<keras.callbacks.History at 0x116b8f850>



In [47]:

    
sg_model.layers[2].weights









    Out[47]:





[<tf.Variable 'shared_embedding_2/embeddings:0' shape=(11, 50) dtype=float32_ref>]

Continuous Bag of Words (CBOW) model

CBOW means we take all the words in the window and use them to predict the target word. Note we are trying to predict an actual word (or a probability distribution over words) with CBOW, whereas in skip-gram we are trying to predict a similarity score.

FastText Model

FastText is creating dense document vectors using the words in the document enhanced with n-grams. These are embedded, averaged, and fed through a hidden dense layer, with a sigmoid activation. The prediction task is some binary classification of the documents. As usual, after training we can extract the dense vectors from the model.

FastText Model Data Prep



In [48]:

    
MAX_FEATURES = 20000  # number of unique words in the dataset
MAXLEN = 400  # max word (feature) length of a review  
EMBEDDING_DIMS = 50
NGRAM_RANGE = 2

Some data prep functions lifted from the example



In [49]:

    
def create_ngram_set(input_list, ngram_value=2):
    """
    Extract a set of n-grams from a list of integers.
    """
    return set(zip(*[input_list[i:] for i in range(ngram_value)]))



In [50]:

    
create_ngram_set([1, 2, 3, 4, 5], ngram_value=2)









    Out[50]:





{(1, 2), (2, 3), (3, 4), (4, 5)}



In [51]:

    
create_ngram_set([1, 2, 3, 4, 5], ngram_value=3)









    Out[51]:





{(1, 2, 3), (2, 3, 4), (3, 4, 5)}



In [52]:

    
def add_ngram(sequences, token_indice, ngram_range=2):
    """
    Augment the input list of list (sequences) by appending n-grams values.
    """
    new_sequences = []
    for input_list in sequences:
        new_list = input_list[:]
        for i in range(len(new_list) - ngram_range + 1):
            for ngram_value in range(2, ngram_range + 1):
                ngram = tuple(new_list[i:i + ngram_value])
                if ngram in token_indice:
                    new_list.append(token_indice[ngram])
        new_sequences.append(new_list)

    return new_sequences



In [60]:

    
sequences = [[1,2,3,4,5, 6], [6,7,8]]
token_indice = {(1,2): 20000, (4,5): 20001, (6,7,8): 20002}



In [61]:

    
add_ngram(sequences, token_indice, ngram_range=2)









    Out[61]:





[[1, 2, 3, 4, 5, 6, 20000, 20001], [6, 7, 8]]



In [62]:

    
add_ngram(sequences, token_indice, ngram_range=3)









    Out[62]:





[[1, 2, 3, 4, 5, 6, 20000, 20001], [6, 7, 8, 20002]]

load canned training data



In [63]:

    
from keras.datasets import imdb



In [64]:

    
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)



In [65]:

    
x_train[0:2]









    Out[65]:





array([ list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 10156, 4, 1153, 9, 194, 775, 7, 8255, 11596, 349, 2637, 148, 605, 15358, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])], dtype=object)



In [66]:

    
y_train[0:2]









    Out[66]:





array([1, 0])

Add n-gram features



In [67]:

    
ngram_set = set()
for input_list in x_train:
    for i in range(2, NGRAM_RANGE + 1):
        set_of_ngram = create_ngram_set(input_list, ngram_value=i)
        ngram_set.update(set_of_ngram)



In [68]:

    
len(ngram_set)









    Out[68]:





1185229

Assign id's to the new features



In [70]:

    
ngram_set.pop()









    Out[70]:





(2561, 3221)



In [71]:

    
start_index = MAX_FEATURES + 1
token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
indice_token = {token_indice[k]: k for k in token_indice}

Update MAX_FEATURES



In [73]:

    
import numpy as np



In [74]:

    
MAX_FEATURES = np.max(list(indice_token.keys())) + 1
MAX_FEATURES









    Out[74]:





1205229

Add n-grams to the input data



In [75]:

    
x_train = add_ngram(x_train, token_indice, NGRAM_RANGE)
x_test = add_ngram(x_test, token_indice, NGRAM_RANGE)

Make all input sequences the same length by padding with zeros



In [76]:

    
from keras.preprocessing import sequence



In [77]:

    
sequence.pad_sequences([[1,2,3,4,5], [6,7,8]], maxlen=10)









    Out[77]:





array([[0, 0, 0, 0, 0, 1, 2, 3, 4, 5],
       [0, 0, 0, 0, 0, 0, 0, 6, 7, 8]], dtype=int32)



In [78]:

    
x_train = sequence.pad_sequences(x_train, maxlen=MAXLEN)
x_test = sequence.pad_sequences(x_test, maxlen=MAXLEN)



In [79]:

    
x_train.shape









    Out[79]:





(25000, 400)



In [80]:

    
x_test.shape









    Out[80]:





(25000, 400)

FastText Model



In [4]:

    
Image('diagrams/fasttext.png')









    Out[4]:



In [82]:

    
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalAveragePooling1D
from keras.layers import Dense



In [83]:

    
ft_model = Sequential()

ft_model.add(Embedding(
    input_dim = MAX_FEATURES,
    output_dim = EMBEDDING_DIMS,
    input_length= MAXLEN))

ft_model.add(GlobalAveragePooling1D())

ft_model.add(Dense(1, activation='sigmoid'))



In [84]:

    
ft_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



In [302]:

    
ft_model.layers









    Out[302]:





[<keras.layers.embeddings.Embedding at 0x117f12650>,
 <keras.layers.pooling.GlobalAveragePooling1D at 0x117bc51d0>,
 <keras.layers.core.Dense at 0x1190c3dd0>]



In [306]:

    
print_layer(ft_model, 0)









    



<keras.layers.embeddings.Embedding object at 0x117f12650>
(None, 400)
(None, 400, 50)



In [307]:

    
print_layer(ft_model, 1)









    



<keras.layers.pooling.GlobalAveragePooling1D object at 0x117bc51d0>
(None, 400, 50)
(None, 50)



In [308]:

    
print_layer(ft_model, 2)









    



<keras.layers.core.Dense object at 0x1190c3dd0>
(None, 50)
(None, 1)



In [85]:

    
ft_model.fit(x_train, y_train, batch_size=100, epochs=3, validation_data=(x_test, y_test))









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/3
25000/25000 [==============================] - 187s - loss: 0.6599 - acc: 0.7460 - val_loss: 0.6032 - val_acc: 0.8109
Epoch 2/3
25000/25000 [==============================] - 183s - loss: 0.4989 - acc: 0.8840 - val_loss: 0.4610 - val_acc: 0.8578
Epoch 3/3
25000/25000 [==============================] - 184s - loss: 0.3338 - acc: 0.9333 - val_loss: 0.3725 - val_acc: 0.8797






    Out[85]:





<keras.callbacks.History at 0x1167c5b10>

fastText classifier vs. convolutional neural network (CNN) vs. long short-term memory (LSTM) classifier: Fight!

A CNN takes the dot product of various "filters" (some new vector) with each word window down the sentence. For each convolutional layer in your model, you can choose the size of the filter (for example, 3 word vectors long) and the number of filters in the layer (for example, ten 3-word filters, or five 3-word filters).

Add a bias to each dot product of the filter and word window, and run it through an activation function. This produces a number.

Running a single filter down a sentence produces a series of numbers. Generally the maximum value is taken to represent the alignment of the sentence with that particular filter. All of this is just another way of extracting features from a sentence. In fastText, we extracted features in a human-readable way (n-grams) and tacked them onto the input data. With a CNN we take a different approach, letting the algorithm figure out what makes good features for the dataset.

insert filter operating on sentence image here



In [7]:

    
Image('diagrams/text-cnn-classifier.png')









    Out[7]:

Diagram from Convolutional Neural Networks for Sentence Classification, Kim Yoon (2014)

A CNN sentence classifier



In [114]:

    
embedding_dim = 50  # we'll get a vector representation of words as a by-product
filter_sizes = (2, 3, 4)  # we'll make one convolutional layer for each filter we specify here
num_filters = 10  # each layer will contain this many filters



In [115]:

    
dropout_prob = (0.2, 0.2)
hidden_dims = 50

# Prepossessing parameters
sequence_length = 400
max_words = 5000

Canned input data



In [88]:

    
from keras.datasets import imdb



In [89]:

    
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_words)  # limits vocab to num_words



In [97]:

    
?imdb.load_data



In [90]:

    
from keras.preprocessing import sequence



In [91]:

    
x_train = sequence.pad_sequences(x_train, maxlen=sequence_length, padding="post", truncating="post")
x_test = sequence.pad_sequences(x_test, maxlen=sequence_length, padding="post", truncating="post")



In [92]:

    
x_train[0]









    Out[92]:





array([   1,   14,   22,   16,   43,  530,  973, 1622, 1385,   65,  458,
       4468,   66, 3941,    4,  173,   36,  256,    5,   25,  100,   43,
        838,  112,   50,  670,    2,    9,   35,  480,  284,    5,  150,
          4,  172,  112,  167,    2,  336,  385,   39,    4,  172, 4536,
       1111,   17,  546,   38,   13,  447,    4,  192,   50,   16,    6,
        147, 2025,   19,   14,   22,    4, 1920, 4613,  469,    4,   22,
         71,   87,   12,   16,   43,  530,   38,   76,   15,   13, 1247,
          4,   22,   17,  515,   17,   12,   16,  626,   18,    2,    5,
         62,  386,   12,    8,  316,    8,  106,    5,    4, 2223,    2,
         16,  480,   66, 3785,   33,    4,  130,   12,   16,   38,  619,
          5,   25,  124,   51,   36,  135,   48,   25, 1415,   33,    6,
         22,   12,  215,   28,   77,   52,    5,   14,  407,   16,   82,
          2,    8,    4,  107,  117,    2,   15,  256,    4,    2,    7,
       3766,    5,  723,   36,   71,   43,  530,  476,   26,  400,  317,
         46,    7,    4,    2, 1029,   13,  104,   88,    4,  381,   15,
        297,   98,   32, 2071,   56,   26,  141,    6,  194,    2,   18,
          4,  226,   22,   21,  134,  476,   26,  480,    5,  144,   30,
          2,   18,   51,   36,   28,  224,   92,   25,  104,    4,  226,
         65,   16,   38, 1334,   88,   12,   16,  283,    5,   16, 4472,
        113,  103,   32,   15,   16,    2,   19,  178,   32,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0], dtype=int32)



In [93]:

    
vocabulary = imdb.get_word_index()  # word to integer map



In [94]:

    
vocabulary['good']









    Out[94]:





49



In [98]:

    
len(vocabulary)









    Out[98]:





88584

Model build



In [96]:

    
from keras.models import Model
from keras.layers import Input
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers.merge import Concatenate



In [116]:

    
# Input, embedding, and dropout layers
input_shape = (sequence_length,)
model_input = Input(shape=input_shape)
z = Embedding(
        input_dim=len(vocabulary) + 1, 
        output_dim=embedding_dim, 
        input_length=sequence_length, 
        name="embedding")(model_input)
z = Dropout(dropout_prob[0])(z)

# Convolutional block
# parallel set of n convolutions; output of all n are
# concatenated into one vector
conv_blocks = []
for sz in filter_sizes:
    conv = Conv1D(filters=num_filters, kernel_size=sz, activation="relu" )(z)
    conv = MaxPooling1D(pool_size=2)(conv)
    conv = Flatten()(conv)
    conv_blocks.append(conv)
    
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(dropout_prob[1])(z)

# Hidden dense layer and output layer
z = Dense(hidden_dims, activation="relu")(z)
model_output = Dense(1, activation="sigmoid")(z)

cnn_model = Model(model_input, model_output)
cnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])



In [121]:

    
cnn_model.layers









    Out[121]:





[<keras.engine.topology.InputLayer at 0x126639c10>,
 <keras.layers.embeddings.Embedding at 0x126639d10>,
 <keras.layers.core.Dropout at 0x11f1039d0>,
 <keras.layers.convolutional.Conv1D at 0x10defda50>,
 <keras.layers.convolutional.Conv1D at 0x126639c90>,
 <keras.layers.convolutional.Conv1D at 0x1121d3d90>,
 <keras.layers.pooling.MaxPooling1D at 0x126abe790>,
 <keras.layers.pooling.MaxPooling1D at 0x112185dd0>,
 <keras.layers.pooling.MaxPooling1D at 0x1121c6cd0>,
 <keras.layers.core.Flatten at 0x126676850>,
 <keras.layers.core.Flatten at 0x112197610>,
 <keras.layers.core.Flatten at 0x11cc280d0>,
 <keras.layers.merge.Concatenate at 0x10b926e90>,
 <keras.layers.core.Dropout at 0x126aaf9d0>,
 <keras.layers.core.Dense at 0x11cc51e50>,
 <keras.layers.core.Dense at 0x11cd6bf50>]



In [122]:

    
print_layer(cnn_model, 12)









    



<keras.layers.merge.Concatenate object at 0x10b926e90>
[(None, 1990), (None, 1990), (None, 1980)]
(None, 5960)



In [112]:

    
print_layer(cnn_model, 12)









    



<keras.layers.core.Dense object at 0x112d2d990>
(None, 50)
(None, 1)



In [123]:

    
cnn_model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test))









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/3
25000/25000 [==============================] - 50s - loss: 0.4335 - acc: 0.7704 - val_loss: 0.2938 - val_acc: 0.8758
Epoch 2/3
25000/25000 [==============================] - 50s - loss: 0.2309 - acc: 0.9071 - val_loss: 0.2859 - val_acc: 0.8802
Epoch 3/3
25000/25000 [==============================] - 50s - loss: 0.1893 - acc: 0.9257 - val_loss: 0.2995 - val_acc: 0.8766






    Out[123]:





<keras.callbacks.History at 0x126751c50>



In [57]:

    
cnn_model.layers[1].weights









    Out[57]:





[<tf.Variable 'embedding_5/embeddings:0' shape=(88585, 50) dtype=float32_ref>]



In [51]:

    
cnn_model.layers[1].get_weights()









    Out[51]:





[array([[-0.00537731,  0.01004505, -0.01243093, ..., -0.000989  ,
          0.00684546, -0.00744937],
        [-0.05866501, -0.00139329,  0.01262602, ...,  0.01466062,
          0.01777977, -0.04964167],
        [ 0.0226872 , -0.00739046, -0.01942088, ...,  0.00778489,
          0.02367541, -0.02095466],
        ..., 
        [ 0.00730095,  0.03500965, -0.02484518, ..., -0.04528098,
          0.03952632, -0.00274396],
        [ 0.00923758, -0.03889309, -0.00641484, ..., -0.00164293,
         -0.02593929,  0.01862602],
        [-0.04924661,  0.02040339,  0.00640279, ...,  0.01267619,
          0.04790827,  0.00526398]], dtype=float32)]



In [55]:

    
cnn_model.layers[3].weights









    Out[55]:





[<tf.Variable 'conv1d_11/kernel:0' shape=(3, 50, 100) dtype=float32_ref>,
 <tf.Variable 'conv1d_11/bias:0' shape=(100,) dtype=float32_ref>]

An LSTM sentence classifier



In [43]:

    
Image('diagrams/LSTM.png')









    Out[43]:



In [37]:

    
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import SpatialDropout1D
from keras.layers.core import Dropout
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense



In [38]:

    
hidden_dims = 50
embedding_dim = 50



In [39]:

    
lstm_model = Sequential()
lstm_model.add(Embedding(len(vocabulary) + 1, embedding_dim, input_length=sequence_length, name="embedding"))
lstm_model.add(SpatialDropout1D(Dropout(0.2)))
lstm_model.add(LSTM(hidden_dims, dropout=0.2, recurrent_dropout=0.2))  # first arg, like Dense, is dim of output
lstm_model.add(Dense(1, activation='sigmoid'))



In [40]:

    
lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])



In [41]:

    
lstm_model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test))









    



Train on 25000 samples, validate on 25000 samples
Epoch 1/3
25000/25000 [==============================] - 482s - loss: 0.6939 - acc: 0.5040 - val_loss: 0.6946 - val_acc: 0.5006
Epoch 2/3
25000/25000 [==============================] - 462s - loss: 0.6896 - acc: 0.5157 - val_loss: 0.6911 - val_acc: 0.5069
Epoch 3/3
25000/25000 [==============================] - 463s - loss: 0.6758 - acc: 0.5354 - val_loss: 0.6948 - val_acc: 0.5100






    Out[41]:





<keras.callbacks.History at 0x12016a510>



In [44]:

    
lstm_model.layers









    Out[44]:





[<keras.layers.embeddings.Embedding at 0x11e953950>,
 <keras.layers.core.SpatialDropout1D at 0x111fed590>,
 <keras.layers.recurrent.LSTM at 0x11e975690>,
 <keras.layers.core.Dense at 0x10ef82590>]



In [47]:

    
lstm_model.layers[2].input_shape









    Out[47]:





(None, 400, 50)



In [46]:

    
lstm_model.layers[2].output_shape









    Out[46]:





(None, 50)



In [ ]:

Appendix: Our own data download and preparation

We'll use the Large Movie Review Dataset v1.0 for our corpus. While Keras has its own data samples you can import for modeling (including this one), I think it's very important to get and process your own data. Otherwise, the results appear to materialize out of thin air and it's more difficult to get on with your own research.



In [42]:

    
%matplotlib inline
import pandas as pd



In [14]:

    
import glob



In [51]:

    
datapath = "/Users/pfigliozzi/aclImdb/train/unsup"
files = glob.glob(datapath+"/*.txt")[:1000] #first 1000 (there are 50k)



In [52]:

    
df = pd.concat([pd.read_table(filename, header=None, names=['raw']) for filename in files], ignore_index=True)



In [53]:

    
df.raw.map(lambda x: len(x)).plot.hist()









    Out[53]:





<matplotlib.axes._subplots.AxesSubplot at 0x107ee6710>



In [47]:

    
50000. * 2000. / 10**6









    Out[47]:





100.0



In [ ]: