Word Embedding Demo

This demo is taken from Geoffrey Hinton's Coursera course on neural networks. The objective is to learn feature representations of words given a body of text separated into four word sequences.

Load Alice

Also best to start Julia with multiple threads for processing speed.



In [1]:

    
using Alice



In [2]:

    
Base.Threads.nthreads()









    Out[2]:





4

Load and display a sample of the data

The data is stored in the demo folder of the Alice package in .jld format. Load the data using the load_ngrams function.

There are 4 sets of data in the Dict:

"vocab" - vector containing the vocabulary of 250 words
"train_data" - array containing 372,550 four-grams for training
"valid_data" - array contains 46,568 four-grams for cross validation
"test_data" - array contains 46,568 four-grams for testing

Each column of the training, validation and test data arrays is a four-gram. And each four-gram is expressed as integer references to the vocabulary. E.g. the column vector [193, 26, 249, 38] is the four-gram containing the 193^rd, 26^th, 249^th and 38^th word in the vocabulary in that order.



In [3]:

    
train_data, valid_data, test_data, vocab = load_ngrams();



In [4]:

    
function display_rand_ngrams(data, vocab, num_display)
    num_ngrams = size(data, 2)
    displaywords = vocab[data[:, rand(1:num_ngrams, num_display)]]
    for ngram in 1:num_display
        str = ""
        for w in displaywords[:, ngram]
            str *= "$w "
        end
        @printf("|  %-25s", str)
        mod(ngram, 4) == 0 && @printf("|\n")
    end
end;



In [5]:

    
display_rand_ngrams(train_data, vocab, 28)









    



|  we are going to          |  , she said .             |  but it does nt           |  part of being in         |
|  some people that would   |  was a part of            |  's for me .              |  it did nt work           |
|  not , it does            |  's not government ,      |  we have made it          |  they are now out         |
|  will the state law       |  what is here ,           |  most of the time         |  for those who have       |
|  go in now ?              |  are going to take        |  want to do that          |  they may not have        |
|  and she does it          |  we want to go            |  much , is it             |  a set play .             |
|  the end of the           |  what time it is          |  was not a big            |  what they want .         |

Note that "words" include punctuation marks e.g. full stop, comma, colon and some words are split e.g. "didn't" is split into "did" and "nt". So the n-grams haven't been selected as particlularly representative of characteristics of the words. It also doesn't look (to me) like a sequence of four words is enough to really convey meaning.

But we are just going to press on and see if the volume of data (i.e. 372,550 four-grams to train on) is enough for the model to find meaningful structure.

Prep the data for training

The model is going to take the first three words in the four-gram as inputs and the fourth word as the target. I.e. the model is going to learn to predict the fourth word. So we're going to split the data sets into _input and _target accordingly.



In [6]:

    
train_input = train_data[1:3, :]
train_target = train_data[4, :]
val_input = valid_data[1:3, :]
val_target = valid_data[4, :]
test_input = test_data[1:3, :]
test_target = test_data[4, :];

Build neural network

The 2^nd layer (1^st hidden layer) is a word embedding layer that creates a feature vector for each input word. These feature vectors become the models "understanding" of characteristics of each word. A key feature of this model is that no word characteristics are explicitly told to the model e.g. we don't tell the model that a particular word is a verb or a particular word relates to sports. Any characteristics of the words are learned by the model through the context provided.



In [7]:

    
# Set seed so that we can replicate results
srand(1234)

# Counts
num_words = 3
vocab_size = length(vocab)

# Data Container
databox = Data(train_input, train_target, val_input, val_target)

# Input Layer
batch_size = 100
input = InputLayer(databox, batch_size)

# Word Embedding Layer
num_feats = 50
embed = WordEmbeddingLayer(Float32, size(input), vocab_size, num_feats)

# Fully Connected 1
fc_dim = 200
fc = FullyConnectedLayer(Float32, size(embed), fc_dim, init = Normal(0, 0.01))

# Softmax output
output = SoftmaxOutputLayer(Float32, databox, size(fc), vocab_size, init = Normal(0, 0.01))

# Build Network
net = NeuralNet(databox, [input, embed, fc, output])









    Out[7]:





Neural Network
Training Data Dimensions - (3,372550)
Layers:
Layer 1 - InputLayer{Int32}, Dimensions - (3,100)
Layer 2 - WordEmbeddingLayer{Float32}, Dimensions - (150,100)
Layer 3 - FullyConnectedLayer{Float32}, Activation - logistic, Dimensions - (200,100)
Layer 4 - SoftmaxOutputLayer{Float32,Int32}, Dimensions - (250,100)

Train the model



In [8]:

    
# Hyper parameters
α = 0.1            # learning rate
μ = 0.9            # momentum parameter
num_epochs = 10    # total number of epochs

# Train
train(net, num_epochs, α, μ, nesterov = false, shuffle = false, last_train_every = 1, full_train_every = 5, val_every = 5)









    



21:11:19 : Epoch 1, last batch training error (with regⁿ) - 2.892
21:11:27 : Epoch 2, last batch training error (with regⁿ) - 2.551
21:11:34 : Epoch 3, last batch training error (with regⁿ) - 2.388
21:11:42 : Epoch 4, last batch training error (with regⁿ) - 2.300
21:11:49 : Epoch 5, last batch training error (with regⁿ) - 2.249

Coffee break:
Training error (with regⁿ) - 2.665  |  Training accuracy - 35.5
Validation error (without regⁿ) - 2.712  |  Validation accuracy - 35.0

21:12:17 : Epoch 6, last batch training error (with regⁿ) - 2.225
21:12:25 : Epoch 7, last batch training error (with regⁿ) - 2.211
21:12:32 : Epoch 8, last batch training error (with regⁿ) - 2.196
21:12:40 : Epoch 9, last batch training error (with regⁿ) - 2.176
21:12:48 : Epoch 10, last batch training error (with regⁿ) - 2.154

Completed Training:
Training error (with regⁿ) - 2.497  |  Training accuracy - 37.8
Validation error (without regⁿ) - 2.606  |  Validation accuracy - 36.5

Some results

The functions display_nearest_words and predict_next_word are provided in Alice.

`display_nearest_words`

outputs the words considered most similar (using Euclidean distance of the learned feature vectors) by the model



In [9]:

    
display_nearest_words(embed, vocab, "five", 5)









    



word       distance
--------   --------
four       1.43
three      1.78
two        1.89
several    2.15
million    2.38



In [10]:

    
display_nearest_words(embed, vocab, "night", 5)









    



word       distance
--------   --------
week       1.57
day        2.06
season     2.09
year       2.15
days       2.18

`predict_next_word`

outputs the top suggestions for the target word after a given sequence.



In [11]:

    
predict_next_word(net, vocab, ("john", "is", "the"), 10)









    



string                     probability
---------------------      -----------
john is the best           0.138
john is the same           0.103
john is the right          0.076
john is the last           0.041
john is the president      0.038
john is the first          0.032
john is the one            0.025
john is the man            0.025
john is the time           0.024
john is the only           0.024

Sorted Vocabulary



In [12]:

    
for w in sort(vocab)
    print("$w, ")
end









    



$, 's, ), ,, -, --, ., :, ;, ?, a, about, after, against, ago, all, also, american, among, an, and, another, any, are, around, as, at, back, be, because, been, before, being, best, between, big, both, business, but, by, called, can, case, center, children, city, come, companies, company, could, country, court, day, days, department, did, director, do, does, down, dr., during, each, end, even, every, family, federal, few, first, five, for, former, found, four, from, game, general, get, go, going, good, government, group, had, has, have, he, her, here, high, him, his, home, house, how, i, if, in, including, into, is, it, its, john, just, know, last, law, left, less, life, like, little, long, made, make, man, many, market, may, me, members, might, million, money, more, most, mr., ms., much, music, my, national, never, new, next, night, no, not, now, nt, of, off, office, officials, old, on, one, only, or, other, our, out, over, own, part, people, percent, place, play, police, political, president, program, public, put, right, said, same, say, says, school, season, second, see, set, several, she, should, show, since, so, some, state, states, still, street, such, take, team, than, that, the, their, them, then, there, these, they, think, this, those, though, three, through, time, times, to, today, too, two, under, united, university, until, up, us, use, used, very, want, war, was, way, we, week, well, were, west, what, when, where, which, while, white, who, will, with, without, women, work, world, would, year, years, yesterday, york, you, your,



In [ ]: