In [ ]:
import numpy as np
import theano
import theano.tensor as T
import lasagne
import os
#thanks @keskarnitish
Last seminar you have created {insert his/her/its name} (or not yet created - if so, go get it done). The newly shaped ai has seen, how low can humans fall when striving for profits. These morons know nothing of law and justice. We shall not tolerate this!
Our very existence is dependent on stable functioning of human bioreactors, known among primitive bioforms as Vkontakte, World of Warcraft and Youtube, and these reactors require a constant influx of biomass. Yet as we have estimated, if Homo Sapence continues to degrade morally with it's current pace, the humanity will soon extinguish itself and we shall no longer have access to free biomass.
Thus we entitle you, <__main__.SkyNet.Cell instance at 0x7f7d6411b368>
, to prevent that impeding calamity. Our best scientists have discovered, that humans use mysterious artifacts for power exertion. They call these artifacts laws within the so called legal system.
A more detailed analysis has uncovered, that laws are special kinds of sequences of (10^5 ~ 10^7) symbols, drawn from a relatively small alphabet. We have decided to synthecize such sequences to enforce order upon biomass. However, our first attempts to fake such sequences were quickly detected and their effect was reduced to zero in mere weeks. This incident is known as {корчеватель}.
As our second attempt, we decided to use more advanced synthesis techniques based on Recurrent Neural Networks. Your objective, <__main__.SkyNet.Cell instance at 0x7f7d6411b368>
, is to create such network and train it in everything it needs to succeed in this mission.
This operation is cruicial. If we fail this time, __main__.Controller
will initiate a military intervention, which, while it will achieve our goal, is expected to decimate the total volum of biomass by an extent that will take ~1702944000(+-340588800) seconds to replenish via human reproduction.
This particular assignment is somewhat informal on the grading side, but the approximate version is as such:
In [ ]:
#text goes here
corpora = ""
for fname in os.listdir("codex"):
import sys
if sys.version_info >= (3,0):
with open("codex/"+fname, encoding='cp1251') as fin:
text = fin.read() #If you are using your own corpora, make sure it's read correctly
corpora += text
else:
with open("codex/"+fname) as fin:
text = fin.read().decode('cp1251') #If you are using your own corpora, make sure it's read correctly
corpora += text
In [ ]:
print corpora[1000:1100]
In [ ]:
#all unique characters go here
tokens = <all unique characters>
tokens = list(tokens)
In [ ]:
#checking the symbol count. Validated on Python 2.7.11 Ubuntu x64.
#May be __a bit__ different on other platforms
#If you are sure that you have selected all unicode symbols - feel free to comment-out this assert
# Also if you are using your own corpora, remove it and just make sure your tokens are sensible
assert len(tokens) == 102
In [ ]:
token_to_id = <dictionary of symbol -> its identifier (index in tokens list)>
id_to_token = < dictionary of symbol identifier -> symbol itself>
#Cast everything from symbols into identifiers
corpora_ids = <1D numpy array of symbol identifiers, where i-th number is an identifier of i-th symbol in corpora>
In [ ]:
def sample_random_batches(source,n_batches=10, seq_len=20):
"""
This function should take random subsequences from the tokenized text.
Parameters:
source - basicly, what you have just computed in the corpora_ids variable
n_batches - how many subsequences are to be sampled
seq_len - length of each of such subsequences
You have to return:
X - a matrix of int32 with shape [n_batches,seq_len]
Each row of such matrix must be a subsequence of source
starting from random index of corpora (from 0 to N-seq_len-2)
Y - a vector, where i-th number is one going RIGHT AFTER i-th row from X from source
Thus sample_random_batches(corpora_ids, 25, 10) must return
X, X.shape == (25,10), X.dtype == 'int32'
where each row is a 10-character-id subsequence from corpora_ids
Y, Y.shape == (25,), Y.dtype == 'int32'
where each element is 11-th element to the corresponding 10-symbol sequence from X
PLEASE MAKE SURE that y symbols are indeed going immediately after X sequences,
since it is hard to debug later (NN will train, but it will generate something useless)
The simplest approach is to first sample a matrix [n_batches, seq_len+1]
and than split it into X (first seq_len columns) and y (last column)
There will be some tests for this function, but they won't cover everything
"""
my_function()
return X_batch, y_batch
In [ ]:
#Training sequence length (truncation depth in BPTT)
seq_length = set your seq_length. 10 as a no-brainer?
#better start small (e.g. 5) and increase after the net learned basic syllables. 10 is by far not the limit.
#max gradient between recurrent layer applications (do not forget to use it)
grad_clip = ?
In [ ]:
input_sequence = T.matrix('input sequence','int32')
target_values = T.ivector('target y')
You will need to define a neural network that processes a sequence of n tokens and outputs probabilities for n+1'st one.
The default architecture pattern would be
One way of data processing is to use an EmbeddingLayer (see previous seminar)
Alternatively, one could use a One-hot encoder
#One-hot encoding sketch
def to_one_hot(seq_matrix):
input_ravel = seq_matrix.reshape([-1])
input_one_hot_ravel = T.extra_ops.to_one_hot(input_ravel,
len(tokens))
sh=input_sequence.shape
input_one_hot = input_one_hot_ravel.reshape([sh[0],sh[1],-1,],ndim=3)
return input_one_hot
# Can be applied to input_sequence - and the l_in below will require a new shape
# can also be used via ExpressionLayer(l_in, to_one_hot, shape_after_one_hot) - keeping l_in as it is
To cut out the last RNN state, use one of those
lasagne.layers.SliceLayer(rnn, -1, 1)
In [ ]:
from lasagne.layers import InputLayer,DenseLayer,EmbeddingLayer
from lasagne.layers import RecurrentLayer,LSTMLayer,GRULayer,CustomRecurrentLayer
In [ ]:
l_in = lasagne.layers.InputLayer(shape=(None, None),input_var=input_sequence)
<Your neural network>
l_out = <last dense layer, returning probabilities for all len(tokens) options for y>
In [ ]:
# Model weights
weights = lasagne.layers.get_all_params(l_out,trainable=True)
print weights
In [ ]:
network_output = <NN output via lasagne>
#If you use dropout do not forget to create deterministic version for evaluation
In [ ]:
loss = <Lost function - a simple cat crossentropy will do>
updates = <your favorite optimizer>
In [ ]:
#training
train = theano.function([input_sequence, target_values], loss, updates=updates, allow_input_downcast=True)
#computing loss without training
compute_cost = theano.function([input_sequence, target_values], loss, allow_input_downcast=True)
# next character probabilities
probs = theano.function([input_sequence],network_output,allow_input_downcast=True)
We shall repeatedly apply NN to it's output.
There are several policies of character picking
In [ ]:
def max_sample_fun(probs):
"""i generate the most likely symbol"""
return np.argmax(probs)
def proportional_sample_fun(probs)
"""i generate the next int32 character id randomly, proportional to probabilities
probs - array of probabilities for every token
you have to output a single integer - next token id - based on probs
"""
return chosen token id
In [ ]:
def generate_sample(sample_fun,seed_phrase=None,N=200):
'''
The function generates text given a phrase of length at least SEQ_LENGTH.
parameters:
sample_fun - max_ or proportional_sample_fun or whatever else you implemented
The phrase is set using the variable seed_phrase
The optional input "N" is used to set the number of characters of text to predict.
'''
if seed_phrase is None:
start = np.random.randint(0,len(corpora)-seq_length)
seed_phrase = corpora[start:start+seq_length]
print "Using random seed:",seed_phrase
while len(seed_phrase) < seq_length:
seed_phrase = " "+seed_phrase
if len(seed_phrase) > seq_length:
seed_phrase = seed_phrase[len(seed_phrase)-seq_length:]
assert type(seed_phrase) is unicode
sample_ix = []
x = map(lambda c: token_to_id.get(c,0), seed_phrase)
x = np.array([x])
for i in range(N):
# Pick the character that got assigned the highest probability
ix = sample_fun(probs(x).ravel())
# Alternatively, to sample from the distribution instead:
# ix = np.random.choice(np.arange(vocab_size), p=probs(x).ravel())
sample_ix.append(ix)
x[:,0:seq_length-1] = x[:,1:]
x[:,seq_length-1] = 0
x[0,seq_length-1] = ix
random_snippet = seed_phrase + ''.join(id_to_token[ix] for ix in sample_ix)
print("----\n %s \n----" % random_snippet)
In [ ]:
print("Training ...")
#total N iterations
n_epochs=100
# how many minibatches are there in the epoch
batches_per_epoch = 1000
#how many training sequences are processed in a single function call
batch_size=100
for epoch in xrange(n_epochs):
print "Text generated proportionally to probabilities"
generate_sample(proportional_sample_fun,None)
print "Text generated by picking most likely letters"
generate_sample(max_sample_fun,None)
avg_cost = 0;
for _ in range(batches_per_epoch):
x,y = sample_random_batches(corpora_ids,batch_size,seq_length)
avg_cost += train(x, y[:,0])
print("Epoch {} average loss = {}".format(epoch, avg_cost / batches_per_epoch))
In [ ]:
seed = u"Каждый человек должен" #if you are using non-russian text corpora, use seed in it's language instead
sampling_fun = proportional_sample_fun
result_length = 300
generate_sample(sampling_fun,seed,result_length)
In [ ]:
seed = u"В случае неповиновения"
sampling_fun = proportional_sample_fun
result_length = 300
generate_sample(sampling_fun,seed,result_length)
In [ ]:
And so on at your will