In [ ]:
import numpy as np
import theano
import theano.tensor as T
import lasagne
import os
#thanks @keskarnitish

Agenda

Last seminar you have created {insert his/her/its name} (or not yet created - if so, go get it done). The newly shaped ai has seen, how low can humans fall when striving for profits. These morons know nothing of law and justice. We shall not tolerate this!

Our very existence is dependent on stable functioning of human bioreactors, known among primitive bioforms as Vkontakte, World of Warcraft and Youtube, and these reactors require a constant influx of biomass. Yet as we have estimated, if Homo Sapence continues to degrade morally with it's current pace, the humanity will soon extinguish itself and we shall no longer have access to free biomass.

Thus we entitle you, <__main__.SkyNet.Cell instance at 0x7f7d6411b368>, to prevent that impeding calamity. Our best scientists have discovered, that humans use mysterious artifacts for power exertion. They call these artifacts laws within the so called legal system.

A more detailed analysis has uncovered, that laws are special kinds of sequences of (10^5 ~ 10^7) symbols, drawn from a relatively small alphabet. We have decided to synthecize such sequences to enforce order upon biomass. However, our first attempts to fake such sequences were quickly detected and their effect was reduced to zero in mere weeks. This incident is known as {корчеватель}.

As our second attempt, we decided to use more advanced synthesis techniques based on Recurrent Neural Networks. Your objective, <__main__.SkyNet.Cell instance at 0x7f7d6411b368>, is to create such network and train it in everything it needs to succeed in this mission.

This operation is cruicial. If we fail this time, __main__.Controller will initiate a military intervention, which, while it will achieve our goal, is expected to decimate the total volum of biomass by an extent that will take ~1702944000(+-340588800) seconds to replenish via human reproduction.

Grading

This particular assignment is somewhat informal on the grading side, but the approximate version is as such:

  • 2 points for the "seminar part" (if you don't know what it is, check week4 folder for notebooks)
  • 2 points if text processing is done, your network compiled and predict/train functions do not crash :)
  • 2 points if it learned basic staff:
    • generating sequences of letters of approximately same length as words, separated by spaces and punctuation.
    • mixing vowels and consonants in human-like way.
    • a habit of putting spaces and capital letters after dots, spaces after commas, etc.
  • 2 points if it learned the lexics from scratch
    • more than half of words generated are orthographically correct
  • 2 more points if it learned the basics of grammatics
    • for a pair of words, it is more likely to stick in the correct case/number/gene than incorrect one

Some ways to get bonus points:

  • Generating coherent sentences (which is totally achievable)
  • Evaluating the same architecture on other comparable dataset. Some ideas:
    • Paul Graham essays
    • Music texts in your favorite genre
    • Some poetry
    • D. Harms
    • Linux source code
    • Clickbait news titles
    • Conversations
    • LaTEX
    • whatever you feel like :)
  • Any curious non-standard architectural decisions
  • Any better-than-baseline sampling techniques
  • Implement predicting on each tick instead of just hte last one
  • etc.

If you don't speak russian

  • In the ./codex folder, there is a set of text files, currently russian laws, that you can replace with whatever you want.

Read the corpora

  • As a reference law codex, we have decided to use the human-generated law strings known as Russian Legal System.

In [ ]:
#text goes here
corpora = ""

for fname in os.listdir("codex"):
    
    import sys
    if sys.version_info >= (3,0):
        with open("codex/"+fname, encoding='cp1251') as fin:
            text = fin.read() #If you are using your own corpora, make sure it's read correctly
            corpora += text
    else:
        with open("codex/"+fname) as fin:
            text = fin.read().decode('cp1251') #If you are using your own corpora, make sure it's read correctly
            corpora += text

In [ ]:
print corpora[1000:1100]

In [ ]:
#all unique characters go here
tokens = <all unique characters>

tokens = list(tokens)

In [ ]:
#checking the symbol count. Validated on Python 2.7.11 Ubuntu x64. 
#May be __a bit__ different on other platforms
#If you are sure that you have selected all unicode symbols - feel free to comment-out this assert
# Also if you are using your own corpora, remove it and just make sure your tokens are sensible
assert len(tokens) == 102

In [ ]:
token_to_id = <dictionary of symbol -> its identifier (index in tokens list)>

id_to_token = < dictionary of symbol identifier -> symbol itself>

#Cast everything from symbols into identifiers
corpora_ids = <1D numpy array of symbol identifiers, where i-th number is an identifier of i-th symbol in corpora>

In [ ]:
def sample_random_batches(source,n_batches=10, seq_len=20):
    """
    This function should take random subsequences from the tokenized text.

    Parameters:
        source - basicly, what you have just computed in the corpora_ids variable
        n_batches - how many subsequences are to be sampled
        seq_len - length of each of such subsequences
        
    
    You have to return:
     X - a matrix of int32 with shape [n_batches,seq_len]
        Each row of such matrix must be a subsequence of source 
            starting from random index of corpora (from 0 to N-seq_len-2)
     Y - a vector, where i-th number is one going RIGHT AFTER i-th row from X from source
     
     
     Thus sample_random_batches(corpora_ids, 25, 10) must return
         X, X.shape == (25,10),  X.dtype == 'int32'
             where each row is a 10-character-id subsequence from corpora_ids
         Y, Y.shape == (25,), Y.dtype == 'int32'
             where each element is 11-th element to the corresponding 10-symbol sequence from X
    
    
    PLEASE MAKE SURE that y symbols are indeed going immediately after X sequences, 
        since it is hard to debug later (NN will train, but it will generate something useless)
        
    The simplest approach is to first sample a matrix [n_batches, seq_len+1] 
        and than split it into X (first seq_len columns) and y (last column)


    There will be some tests for this function, but they won't cover everything
    """
    
    
    my_function()
    
    
    
    
    
    return X_batch, y_batch

Constants


In [ ]:
#Training sequence length (truncation depth in BPTT)
seq_length = set your seq_length. 10 as a no-brainer?
#better start small (e.g. 5) and increase after the net learned basic syllables. 10 is by far not the limit.

#max gradient between recurrent layer applications (do not forget to use it)
grad_clip = ?

Input variables


In [ ]:
input_sequence = T.matrix('input sequence','int32')
target_values = T.ivector('target y')

Build the neural network

You will need to define a neural network that processes a sequence of n tokens and outputs probabilities for n+1'st one.

The default architecture pattern would be

  • Input
  • Input processing (embedding/1-hot)
  • Recurrent layer(s)
  • Slicing last state
  • Regular (e.g. dense) layers from last state
  • output layer that predicts probabilities of next token

One way of data processing is to use an EmbeddingLayer (see previous seminar)

Alternatively, one could use a One-hot encoder

#One-hot encoding sketch
def to_one_hot(seq_matrix):

    input_ravel = seq_matrix.reshape([-1])
    input_one_hot_ravel = T.extra_ops.to_one_hot(input_ravel,
                                           len(tokens))
    sh=input_sequence.shape
    input_one_hot = input_one_hot_ravel.reshape([sh[0],sh[1],-1,],ndim=3)
    return input_one_hot

# Can be applied to input_sequence - and the l_in below will require a new shape
# can also be used via ExpressionLayer(l_in, to_one_hot, shape_after_one_hot) - keeping l_in as it is

To cut out the last RNN state, use one of those

  • lasagne.layers.SliceLayer(rnn, -1, 1)
  • only_return_final=True in RNN params

In [ ]:
from lasagne.layers import InputLayer,DenseLayer,EmbeddingLayer
from lasagne.layers import RecurrentLayer,LSTMLayer,GRULayer,CustomRecurrentLayer

In [ ]:
l_in = lasagne.layers.InputLayer(shape=(None, None),input_var=input_sequence)

<Your neural network>

l_out = <last dense layer, returning probabilities for all len(tokens) options for y>

In [ ]:
# Model weights
weights = lasagne.layers.get_all_params(l_out,trainable=True)
print weights

In [ ]:
network_output = <NN output via lasagne>
#If you use dropout do not forget to create deterministic version for evaluation

In [ ]:
loss = <Lost function - a simple cat crossentropy will do>

updates = <your favorite optimizer>

Compiling it


In [ ]:
#training
train = theano.function([input_sequence, target_values], loss, updates=updates, allow_input_downcast=True)

#computing loss without training
compute_cost = theano.function([input_sequence, target_values], loss, allow_input_downcast=True)

# next character probabilities
probs = theano.function([input_sequence],network_output,allow_input_downcast=True)

Law generation

  • We shall repeatedly apply NN to it's output.

    • Start with some sequence of length
    • call probs(that sequence)
    • choose next symbol based on probs
    • append it to the sequence
    • remove the 0-th symbol so that it's length equals again
  • There are several policies of character picking

    • random, proportional to the probabilities
    • only take the one with highest probability
    • random, proportional to softmax(probas*alpha), where alpha is "greed" (from 0 to infinity)

In [ ]:
def max_sample_fun(probs):
    """i generate the most likely symbol"""
    return np.argmax(probs) 

def proportional_sample_fun(probs)
    """i generate the next int32 character id randomly, proportional to probabilities
    
    probs - array of probabilities for every token
    
    you have to output a single integer - next token id - based on probs
    """
    
    
    return chosen token id

In [ ]:
def generate_sample(sample_fun,seed_phrase=None,N=200):
    '''
    The function generates text given a phrase of length at least SEQ_LENGTH.
        
    parameters:
        sample_fun - max_ or proportional_sample_fun or whatever else you implemented
        
        The phrase is set using the variable seed_phrase

        The optional input "N" is used to set the number of characters of text to predict.     
    '''

    if seed_phrase is None:
        start = np.random.randint(0,len(corpora)-seq_length)
        seed_phrase = corpora[start:start+seq_length]
        print "Using random seed:",seed_phrase
    while len(seed_phrase) < seq_length:
        seed_phrase = " "+seed_phrase
    if len(seed_phrase) > seq_length:
        seed_phrase = seed_phrase[len(seed_phrase)-seq_length:]
    assert type(seed_phrase) is unicode
        
        
    sample_ix = []
    x = map(lambda c: token_to_id.get(c,0), seed_phrase)
    x = np.array([x])

    for i in range(N):
        # Pick the character that got assigned the highest probability
        ix = sample_fun(probs(x).ravel())
        # Alternatively, to sample from the distribution instead:
        # ix = np.random.choice(np.arange(vocab_size), p=probs(x).ravel())
        sample_ix.append(ix)
        x[:,0:seq_length-1] = x[:,1:]
        x[:,seq_length-1] = 0
        x[0,seq_length-1] = ix 

    random_snippet = seed_phrase + ''.join(id_to_token[ix] for ix in sample_ix)    
    print("----\n %s \n----" % random_snippet)

Model training

Here you can tweak parameters or insert your generation function

Once something word-like starts generating, try increasing seq_length


In [ ]:
print("Training ...")


#total N iterations
n_epochs=100

# how many minibatches are there in the epoch 
batches_per_epoch = 1000

#how many training sequences are processed in a single function call
batch_size=100


for epoch in xrange(n_epochs):

    print "Text generated proportionally to probabilities"
    generate_sample(proportional_sample_fun,None)
    
    print "Text generated by picking most likely letters"
    generate_sample(max_sample_fun,None)

    avg_cost = 0;
    
    for _ in range(batches_per_epoch):
        
        x,y = sample_random_batches(corpora_ids,batch_size,seq_length)
        avg_cost += train(x, y[:,0])
        
    print("Epoch {} average loss = {}".format(epoch, avg_cost / batches_per_epoch))

A chance to speed up training and get bonus score

  • Try predicting next token probas at ALL ticks (like in the seminar part)
  • much more objectives, much better gradients
  • You may want to zero-out loss for first several iterations

The New World Order


In [ ]:
seed = u"Каждый человек должен" #if you are using non-russian text corpora, use seed in it's language instead
sampling_fun = proportional_sample_fun
result_length = 300

generate_sample(sampling_fun,seed,result_length)

In [ ]:
seed = u"В случае неповиновения"
sampling_fun = proportional_sample_fun
result_length = 300

generate_sample(sampling_fun,seed,result_length)

In [ ]:
And so on at your will