In [1]:
using Alice
In [2]:
Base.Threads.nthreads()
Out[2]:
The data is stored in the demo folder of the Alice package in .jld format. Load the data using the load_ngrams function.
There are 4 sets of data in the Dict:
Each column of the training, validation and test data arrays is a four-gram. And each four-gram is expressed as integer references to the vocabulary. E.g. the column vector [193, 26, 249, 38] is the four-gram containing the 193rd, 26th, 249th and 38th word in the vocabulary in that order.
In [3]:
train_data, valid_data, test_data, vocab = load_ngrams();
In [4]:
function display_rand_ngrams(data, vocab, num_display)
num_ngrams = size(data, 2)
displaywords = vocab[data[:, rand(1:num_ngrams, num_display)]]
for ngram in 1:num_display
str = ""
for w in displaywords[:, ngram]
str *= "$w "
end
@printf("| %-25s", str)
mod(ngram, 4) == 0 && @printf("|\n")
end
end;
In [5]:
display_rand_ngrams(train_data, vocab, 28)
Note that "words" include punctuation marks e.g. full stop, comma, colon and some words are split e.g. "didn't" is split into "did" and "nt". So the n-grams haven't been selected as particlularly representative of characteristics of the words. It also doesn't look (to me) like a sequence of four words is enough to really convey meaning.
But we are just going to press on and see if the volume of data (i.e. 372,550 four-grams to train on) is enough for the model to find meaningful structure.
In [6]:
train_input = train_data[1:3, :]
train_target = train_data[4, :]
val_input = valid_data[1:3, :]
val_target = valid_data[4, :]
test_input = test_data[1:3, :]
test_target = test_data[4, :];
The 2nd layer (1st hidden layer) is a word embedding layer that creates a feature vector for each input word. These feature vectors become the models "understanding" of characteristics of each word. A key feature of this model is that no word characteristics are explicitly told to the model e.g. we don't tell the model that a particular word is a verb or a particular word relates to sports. Any characteristics of the words are learned by the model through the context provided.
In [7]:
# Set seed so that we can replicate results
srand(1234)
# Counts
num_words = 3
vocab_size = length(vocab)
# Data Container
databox = Data(train_input, train_target, val_input, val_target)
# Input Layer
batch_size = 100
input = InputLayer(databox, batch_size)
# Word Embedding Layer
num_feats = 50
embed = WordEmbeddingLayer(Float32, size(input), vocab_size, num_feats)
# Fully Connected 1
fc_dim = 200
fc = FullyConnectedLayer(Float32, size(embed), fc_dim, init = Normal(0, 0.01))
# Softmax output
output = SoftmaxOutputLayer(Float32, databox, size(fc), vocab_size, init = Normal(0, 0.01))
# Build Network
net = NeuralNet(databox, [input, embed, fc, output])
Out[7]:
In [8]:
# Hyper parameters
α = 0.1 # learning rate
μ = 0.9 # momentum parameter
num_epochs = 10 # total number of epochs
# Train
train(net, num_epochs, α, μ, nesterov = false, shuffle = false, last_train_every = 1, full_train_every = 5, val_every = 5)
In [9]:
display_nearest_words(embed, vocab, "five", 5)
In [10]:
display_nearest_words(embed, vocab, "night", 5)
In [11]:
predict_next_word(net, vocab, ("john", "is", "the"), 10)
In [12]:
for w in sort(vocab)
print("$w, ")
end
In [ ]: