A dense vector word embedding means we represent words with number-full numerical vectors-- most components are nonzero. This is in contrast to sparse vector, or bag-of-word embeddings, which have very high-dimensional vectors (the size of the vocabulary) yet with most components zero.
Dense vector models also capture word meaning, such that similar words (car and automobile) have similar numerical vectors. In a sparse vector representation, similar words probably have completely different numerical vectors. Dense vectors are formed as a by-product of some prediction task. The quality of the embedding depends on both the prediction task and the data set upon which the prediction task was trained.
When we use word embeddings in our deep learning models, we refer to their birthplace as the embedding layer. Sometimes, we don't actually care about the trained predictor (skip-gram and cbow models); we're just interested in the embeddings by-product for use elsewhere. Other times, we need an embedding layer to represent words in a larger model such as a sentiment classifier; there, we may opt for pre-trained dense vectors.
When we don't care about the trained model and just want to create meaningful, dense word vectors, there are two popular prediction models: skip-gram and CBOW (continuous bag of words). Word embeddings constructed in this manner are termed word2vec or w2v. We will also look at another more recent method, fastText. In any case, we've first got to construct training data from our corpus. The exact procedure depends on the model.
Let's have a look at the Keras models we'll use in this section. (I'm keeping the code as markup since we haven't defined any of the parameters yet. We'll run this code after we develop input data and parameters.)
In [4]:
from IPython.display import Image
In [5]:
Image('diagrams/skip-gram.png')
Out[5]:
word1 = Input(shape=(1,), dtype='int64', name='word1')
word2 = Input(shape=(1,), dtype='int64', name='word2')
shared_embedding = Embedding(
input_dim=VOCAB_SIZE+1,
output_dim=DENSEVEC_DIM,
input_length=1,
embeddings_constraint = unit_norm(),
name='shared_embedding')
embedded_w1 = shared_embedding(word1)
embedded_w2 = shared_embedding(word2)
w1 = Flatten()(embedded_w1)
w2 = Flatten()(embedded_w2)
dotted = Dot(axes=1, name='dot_product')([w1, w2])
prediction = Dense(1, activation='sigmoid', name='output_layer')(dotted)
sg_model = Model(inputs=[word1, word2], outputs=prediction)
ft_model = Sequential()
ft_model.add(Embedding(
input_dim = MAX_FEATURES,
output_dim = EMBEDDING_DIMS,
input_length= MAXLEN))
ft_model.add(GlobalAveragePooling1D())
ft_model.add(Dense(1, activation='sigmoid'))
The first step for CBOW and skip-gram
Our training corpus is a collection of sentences, Tweets, emails, comments, or even longer documents. It is something composed of words. Each word takes is turn being the "target" word, and we collect the n words behind it and n words which follow it. This n is referred to as window size. If our example document is the sentence "I love deep learning" and the window size is 1, we'd get:
The target word is bold.
Skip-gram means form word pairs with a target word and all words in the window. These become the "positive" (1) samples for the skip-gram algorithm. In our "I love deep learning" example we'd get (eliminating repeated pairs):
To create negative samples (0), we pair random vocabulary words with the target word. Yes, it's possible to unluckily pick a negative sample that usually appears around the target word.
For our prediction task, we'll take the dot product of the words in each pair (a small step away from the cosine similarity). The training will keep tweaking the word vectors to make this product as close to unity as possible for our positive samples, and zero for our negative samples.
Happily, Keras include a function for creating skipgrams from text. It even does the negative sampling for us.
In [1]:
from keras.preprocessing.sequence import skipgrams
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
In [2]:
text1 = "I love deep learning."
text2 = "Read Douglas Adams as much as possible."
In [3]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text1, text2])
In [4]:
word2id = tokenizer.word_index
word2id.items()
Out[4]:
Note word id's are numbered from 1, not zero
In [5]:
id2word = { wordid: word for word, wordid in word2id.items()}
id2word
Out[5]:
In [6]:
encoded_text = [word2id[word] for word in text_to_word_sequence(text1)]
encoded_text
Out[6]:
In [9]:
[word2id[word] for word in text_to_word_sequence(text2)]
Out[9]:
In [10]:
sg = skipgrams(encoded_text, vocabulary_size=len(word2id.keys()), window_size=1)
sg
Out[10]:
In [11]:
for i in range(len(sg[0])):
print "({0},{1})={2}".format(id2word[sg[0][i][0]], id2word[sg[0][i][1]], sg[1][i])
Model parameters
In [12]:
VOCAB_SIZE = len(word2id.keys())
VOCAB_SIZE
Out[12]:
In [13]:
DENSEVEC_DIM = 50
Model build
In [14]:
import keras
In [18]:
from keras.layers.embeddings import Embedding
from keras.constraints import unit_norm
from keras.layers.merge import Dot
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers import Input, Dense
from keras.models import Model
Create a dense vector for each word in the pair. The output of Embedding
has shape (batch_size, sequence_length, output_dim)
which in our case is (batch_size, 1, DENSEVEC_DIM)
. We'll use Flatten
to get rid of that pesky middle dimension (1), so going into the dot product we'll have shape (batch_size, DENSEVEC_DIM)
.
In [16]:
word1 = Input(shape=(1,), dtype='int64', name='word1')
word2 = Input(shape=(1,), dtype='int64', name='word2')
In [19]:
shared_embedding = Embedding(
input_dim=VOCAB_SIZE+1,
output_dim=DENSEVEC_DIM,
input_length=1,
embeddings_constraint = unit_norm(),
name='shared_embedding')
embedded_w1 = shared_embedding(word1)
embedded_w2 = shared_embedding(word2)
w1 = Flatten()(embedded_w1)
w2 = Flatten()(embedded_w2)
dotted = Dot(axes=1, name='dot_product')([w1, w2])
prediction = Dense(1, activation='sigmoid', name='output_layer')(dotted)
In [20]:
sg_model = Model(inputs=[word1, word2], outputs=prediction)
In [21]:
sg_model.compile(optimizer='adam', loss='mean_squared_error')
At this point you can check out how the data flows through your compiled model.
In [22]:
sg_model.layers
Out[22]:
In [108]:
def print_layer(model, num):
print model.layers[num]
print model.layers[num].input_shape
print model.layers[num].output_shape
In [27]:
print_layer(sg_model,3)
Let's try training it with our toy data set!
In [28]:
import numpy as np
In [29]:
pairs = np.array(sg[0])
targets = np.array(sg[1])
In [30]:
targets
Out[30]:
In [31]:
pairs
Out[31]:
In [32]:
w1_list = np.reshape(pairs[:, 0], (len(pairs), 1))
w1_list
Out[32]:
In [33]:
w2_list = np.reshape(pairs[:, 1], (len(pairs), 1))
w2_list
Out[33]:
In [34]:
w2_list.shape
Out[34]:
In [35]:
w2_list.dtype
Out[35]:
In [45]:
sg_model.fit(x=[w1_list, w2_list], y=targets, epochs=10)
Out[45]:
In [47]:
sg_model.layers[2].weights
Out[47]:
CBOW means we take all the words in the window and use them to predict the target word. Note we are trying to predict an actual word (or a probability distribution over words) with CBOW, whereas in skip-gram we are trying to predict a similarity score.
FastText is creating dense document vectors using the words in the document enhanced with n-grams. These are embedded, averaged, and fed through a hidden dense layer, with a sigmoid activation. The prediction task is some binary classification of the documents. As usual, after training we can extract the dense vectors from the model.
In [48]:
MAX_FEATURES = 20000 # number of unique words in the dataset
MAXLEN = 400 # max word (feature) length of a review
EMBEDDING_DIMS = 50
NGRAM_RANGE = 2
Some data prep functions lifted from the example
In [49]:
def create_ngram_set(input_list, ngram_value=2):
"""
Extract a set of n-grams from a list of integers.
"""
return set(zip(*[input_list[i:] for i in range(ngram_value)]))
In [50]:
create_ngram_set([1, 2, 3, 4, 5], ngram_value=2)
Out[50]:
In [51]:
create_ngram_set([1, 2, 3, 4, 5], ngram_value=3)
Out[51]:
In [52]:
def add_ngram(sequences, token_indice, ngram_range=2):
"""
Augment the input list of list (sequences) by appending n-grams values.
"""
new_sequences = []
for input_list in sequences:
new_list = input_list[:]
for i in range(len(new_list) - ngram_range + 1):
for ngram_value in range(2, ngram_range + 1):
ngram = tuple(new_list[i:i + ngram_value])
if ngram in token_indice:
new_list.append(token_indice[ngram])
new_sequences.append(new_list)
return new_sequences
In [60]:
sequences = [[1,2,3,4,5, 6], [6,7,8]]
token_indice = {(1,2): 20000, (4,5): 20001, (6,7,8): 20002}
In [61]:
add_ngram(sequences, token_indice, ngram_range=2)
Out[61]:
In [62]:
add_ngram(sequences, token_indice, ngram_range=3)
Out[62]:
load canned training data
In [63]:
from keras.datasets import imdb
In [64]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=MAX_FEATURES)
In [65]:
x_train[0:2]
Out[65]:
In [66]:
y_train[0:2]
Out[66]:
Add n-gram features
In [67]:
ngram_set = set()
for input_list in x_train:
for i in range(2, NGRAM_RANGE + 1):
set_of_ngram = create_ngram_set(input_list, ngram_value=i)
ngram_set.update(set_of_ngram)
In [68]:
len(ngram_set)
Out[68]:
Assign id's to the new features
In [70]:
ngram_set.pop()
Out[70]:
In [71]:
start_index = MAX_FEATURES + 1
token_indice = {v: k + start_index for k, v in enumerate(ngram_set)}
indice_token = {token_indice[k]: k for k in token_indice}
Update MAX_FEATURES
In [73]:
import numpy as np
In [74]:
MAX_FEATURES = np.max(list(indice_token.keys())) + 1
MAX_FEATURES
Out[74]:
Add n-grams to the input data
In [75]:
x_train = add_ngram(x_train, token_indice, NGRAM_RANGE)
x_test = add_ngram(x_test, token_indice, NGRAM_RANGE)
Make all input sequences the same length by padding with zeros
In [76]:
from keras.preprocessing import sequence
In [77]:
sequence.pad_sequences([[1,2,3,4,5], [6,7,8]], maxlen=10)
Out[77]:
In [78]:
x_train = sequence.pad_sequences(x_train, maxlen=MAXLEN)
x_test = sequence.pad_sequences(x_test, maxlen=MAXLEN)
In [79]:
x_train.shape
Out[79]:
In [80]:
x_test.shape
Out[80]:
In [4]:
Image('diagrams/fasttext.png')
Out[4]:
In [82]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalAveragePooling1D
from keras.layers import Dense
In [83]:
ft_model = Sequential()
ft_model.add(Embedding(
input_dim = MAX_FEATURES,
output_dim = EMBEDDING_DIMS,
input_length= MAXLEN))
ft_model.add(GlobalAveragePooling1D())
ft_model.add(Dense(1, activation='sigmoid'))
In [84]:
ft_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
In [302]:
ft_model.layers
Out[302]:
In [306]:
print_layer(ft_model, 0)
In [307]:
print_layer(ft_model, 1)
In [308]:
print_layer(ft_model, 2)
In [85]:
ft_model.fit(x_train, y_train, batch_size=100, epochs=3, validation_data=(x_test, y_test))
Out[85]:
A CNN takes the dot product of various "filters" (some new vector) with each word window down the sentence. For each convolutional layer in your model, you can choose the size of the filter (for example, 3 word vectors long) and the number of filters in the layer (for example, ten 3-word filters, or five 3-word filters).
Add a bias to each dot product of the filter and word window, and run it through an activation function. This produces a number.
Running a single filter down a sentence produces a series of numbers. Generally the maximum value is taken to represent the alignment of the sentence with that particular filter. All of this is just another way of extracting features from a sentence. In fastText, we extracted features in a human-readable way (n-grams) and tacked them onto the input data. With a CNN we take a different approach, letting the algorithm figure out what makes good features for the dataset.
insert filter operating on sentence image here
In [7]:
Image('diagrams/text-cnn-classifier.png')
Out[7]:
Diagram from Convolutional Neural Networks for Sentence Classification, Kim Yoon (2014)
In [114]:
embedding_dim = 50 # we'll get a vector representation of words as a by-product
filter_sizes = (2, 3, 4) # we'll make one convolutional layer for each filter we specify here
num_filters = 10 # each layer will contain this many filters
In [115]:
dropout_prob = (0.2, 0.2)
hidden_dims = 50
# Prepossessing parameters
sequence_length = 400
max_words = 5000
In [88]:
from keras.datasets import imdb
In [89]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_words) # limits vocab to num_words
In [97]:
?imdb.load_data
In [90]:
from keras.preprocessing import sequence
In [91]:
x_train = sequence.pad_sequences(x_train, maxlen=sequence_length, padding="post", truncating="post")
x_test = sequence.pad_sequences(x_test, maxlen=sequence_length, padding="post", truncating="post")
In [92]:
x_train[0]
Out[92]:
In [93]:
vocabulary = imdb.get_word_index() # word to integer map
In [94]:
vocabulary['good']
Out[94]:
In [98]:
len(vocabulary)
Out[98]:
In [96]:
from keras.models import Model
from keras.layers import Input
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers import Conv1D
from keras.layers import MaxPooling1D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers.merge import Concatenate
In [116]:
# Input, embedding, and dropout layers
input_shape = (sequence_length,)
model_input = Input(shape=input_shape)
z = Embedding(
input_dim=len(vocabulary) + 1,
output_dim=embedding_dim,
input_length=sequence_length,
name="embedding")(model_input)
z = Dropout(dropout_prob[0])(z)
# Convolutional block
# parallel set of n convolutions; output of all n are
# concatenated into one vector
conv_blocks = []
for sz in filter_sizes:
conv = Conv1D(filters=num_filters, kernel_size=sz, activation="relu" )(z)
conv = MaxPooling1D(pool_size=2)(conv)
conv = Flatten()(conv)
conv_blocks.append(conv)
z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
z = Dropout(dropout_prob[1])(z)
# Hidden dense layer and output layer
z = Dense(hidden_dims, activation="relu")(z)
model_output = Dense(1, activation="sigmoid")(z)
cnn_model = Model(model_input, model_output)
cnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
In [121]:
cnn_model.layers
Out[121]:
In [122]:
print_layer(cnn_model, 12)
In [112]:
print_layer(cnn_model, 12)
In [123]:
cnn_model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test))
Out[123]:
In [57]:
cnn_model.layers[1].weights
Out[57]:
In [51]:
cnn_model.layers[1].get_weights()
Out[51]:
In [55]:
cnn_model.layers[3].weights
Out[55]:
In [43]:
Image('diagrams/LSTM.png')
Out[43]:
In [37]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers.core import SpatialDropout1D
from keras.layers.core import Dropout
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense
In [38]:
hidden_dims = 50
embedding_dim = 50
In [39]:
lstm_model = Sequential()
lstm_model.add(Embedding(len(vocabulary) + 1, embedding_dim, input_length=sequence_length, name="embedding"))
lstm_model.add(SpatialDropout1D(Dropout(0.2)))
lstm_model.add(LSTM(hidden_dims, dropout=0.2, recurrent_dropout=0.2)) # first arg, like Dense, is dim of output
lstm_model.add(Dense(1, activation='sigmoid'))
In [40]:
lstm_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
In [41]:
lstm_model.fit(x_train, y_train, batch_size=64, epochs=3, validation_data=(x_test, y_test))
Out[41]:
In [44]:
lstm_model.layers
Out[44]:
In [47]:
lstm_model.layers[2].input_shape
Out[47]:
In [46]:
lstm_model.layers[2].output_shape
Out[46]:
In [ ]:
We'll use the Large Movie Review Dataset v1.0 for our corpus. While Keras has its own data samples you can import for modeling (including this one), I think it's very important to get and process your own data. Otherwise, the results appear to materialize out of thin air and it's more difficult to get on with your own research.
In [42]:
%matplotlib inline
import pandas as pd
In [14]:
import glob
In [51]:
datapath = "/Users/pfigliozzi/aclImdb/train/unsup"
files = glob.glob(datapath+"/*.txt")[:1000] #first 1000 (there are 50k)
In [52]:
df = pd.concat([pd.read_table(filename, header=None, names=['raw']) for filename in files], ignore_index=True)
In [53]:
df.raw.map(lambda x: len(x)).plot.hist()
Out[53]:
In [47]:
50000. * 2000. / 10**6
Out[47]:
In [ ]: