Wayne H Nixalo - 18 June 2017

CodeAlong of Lesson 6 at 1:29:00 onwards link --- note: this JNB not meant to be run.

Coding in Theano is different to coding in Python because Theano's job in life is to provide a way for you to describe a computation, and compile it for the GPU and run it there.


In [ ]:
n_input = vocab_size
n_output = vocab_size

In [ ]:
def init_wgts(rows, cols):
    scale = math.sqrt(2/rows)
    return shared(normal(scale=scale, size=(rwos, cols)).astype(np.float32))
def init_bias(rows):
    vec = np.zeros(rows, dtype=np.float32)
    return shared(vec)

For our hidden weights (the arrow in the diagram which loops back to its matrix), we initialize it using an identity matrix.


In [ ]:
def wgts_and_bias(n_in, n_out):
    return init_wgts(n_in, n_out), init_bias(n_out)
def id_and_bias(n):
    return shared(np.eye(n, dtype=np.float32)), init_bias(n)

We have to build up a computation-graph, a series of steps saying 'in the future I'm going to give you some data, and when I do I want you to do these steps...'

So we start off by describing the types of data we'll give it


In [ ]:
# we'll give it some input data
t_inp = T.matrix('inp')
# some output data
t_outp = T.matrix('outp')
# give it some way of initializing the first hidden state
t_h0 = T.vector('h0')
# also give it a learning rate which we can change later
lr = T.scalar('lr')

# we create a list of all args we provide to Theano later
all_args = [t_h0, t_inp, t_outp, lr]

To create the weights & biases, up above we have a function called wgts_and_bias(..) in which we tell Theano the size of the matrix we want to create.

The matrix that goes from input to hidden has n_input rows and n_hidden collumns.

wgts_and_bias returns a tuple of weights and biases.

To create the weights, we first calculate the Glorot number, sqrt(2/n) -- the scale of the random numbers we're going to use -- then we create those random numbers using the Numpy normal(..) random number function, and then we use a special Theano keyword called shared. shared(..) tells Theano that the data inside is something we want it to pass off to the GPU later and keep track of.

So once you wrap something in shared, it kind of belongs to Theano now.


In [ ]:
# weights & bias to hidden layer
W_h = id_and_bias(b_hidden)
# weights & bias to input
W_x = wgts_and_bias(n_input, n_hidden)
# weights & bias to output
W_y = wgts_and_bias(n_hidden, n_output)
# stick all manually constructed weight matrices and bias vectors 
# in a list:
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

Python has a function, chain.from_iterable(..), that takes a list of tuples and turns them into one big list.

The next thing we have to do is tell Theano what happens each time we take a single step of this RNN.

There's no such thing as a for-loop on a GPU bc the GPU is built to & wants to parallelize things & do multiple things at the same time.

There is something very similar to a for-loop that you can parallelize, it's called a scan operation.

A scan operation is smth where you call some function (step) for every element (t_inp) of some sequence, and at every point the function returns some output, and the next time thru that function is called, it gets the output of the previous time to you called it along with the next element of the sequence.


In [ ]:
# example of scan:
def scan(fn, start, seq):
    res = [] # array of results
    prev = start
    for s in seq:
        app = fn(prev, s) # apply function to previous result & next elem in sequence
        res.append(app)
        prev = app
    return res

scan(lambda prev,curr: prev+curr, 0, range(5))

# output:
# [0, 1, 3, 6, 10]

# the scan operation defines a cumulative sum.

# It is possible to write a parallel version of this.
# If you can turn you algorithm into a scan, you can run it quickly 
# on a GPU.

The fucntion we'll call on ea. step thru is a fn called step. step takes the input, x, does a dot-product by the weight-matrix we created earlier, W_x, adds on the bias term we created earlier, b_x. then we do the same thing, taking our previous hidden state h, multiplying it by the hidden weight-matrix W_h and adding the biases b_h; then putting that whole thing through the activation function, nnet.relu.

After we do that we want to create an output each time. Output is exactly the same thing. It takes the result of h, the hidden state, multiply it by the outputs weight-vector, adding on the bias; and this time we use nnnet.softmax.

At the end of that, we return the hidden state we have so far, and the output, T.flatten(y, 1)

All that happens each step.


In [ ]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    h = nnet.relu(T.dot(x, W_x) + b_x + T.dot(h, W_h) + b_h)
    y = nnet.softmax(T.dot(h, W_y) + b_y)
    return h, T.flatten(y, 1)

For the sequence we're passing into it -- we're just describing a computation here, so we tell Theano it will will be a matrix.

For the starting point, outputs_info=[t_h0, None], we tell Theano via t_h0 = T.vector('h0') we'll provide it an initial hidden state.

Finally, in Theano you have to tell it what all the other things you'll pass to the function, and we'll pass it the whole list of weights we created up above via: non_sequences=w_all


In [ ]:
[v_h, v_y], = theano.scan(step, sequences=t_inp,
                          outputs_info=[t_h0, None], non_sequences=w_all)

By this point we've described how to execute a whole sequence of stesp for an RNN. We haven't given it any data to do it; we've just set up the computation.

When that computation is run, it's going to return 2 things bc step returns 2 things: the hidden state v_h, and our output activations v_y.

Now we need to calculate our error. Using cat-crossent we compare the output of our scan, v_y to some matrix t_outp. Then add it all together.

We want to apply SGD after every step, meaning we have to take the derivative of w_all wrt. all the weights, and use that along with the learning rate to update all the weights. Theano has a simple function call to do that: T.grad(..)


In [ ]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)
# "please tell me the gradient, `g_all`, of the function `error`, 
# wrt. these gradients: `w_all`"
# Theano will symbolically calculate all the derivatives for you.

We're ready now to build our final function. It takes as input, all our arguments, all_args. It'll create the error as its output. At each step it's going to do some updates via upd_dict.

What upd_dict does is create a dictionary that's going to map every noe of our weights to the weight w minus each one of our gradients g times the learning rate lr.

w comes from wgts, g comes from grads.

So it updates every weight to itself minus its gradient times the learning rate.

What Theano does via updates is it says: every time you calculate the next step, I want you to change your shared variables as follows: (and upd is the list of changes to make).

And that's it.


In [ ]:
def upd_dict(w_all, g_all, lr):
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

In [ ]:
upd = upd_dict(w_all, g_all, lr)
fn = theano.function(all_args, error, updates=upd, allow_input_downcast=True)

We create our one-hot encoded X's and Y's, and we have to manually create our own loop; Theano has nothing built-in for this.

We go through every element of our input X, and call that function that we just created above, and pass in all its inputs, all_args, for initial hidden state t_h0, input t_inp, target t_outp, and learning rate lr.

Initial hidden state, a bunch of zeros: np.zeros(n_hidden).

The condition if i % 1000 == 999: ... just prints out the error every 1000 loops.

(we're using stochastic gradient descent with a minibatch size of 1)

gradient descent w/o stochastic means your using a minibatch size of the whole dataset

so this is 'online' gradient descent


In [ ]:
X = oh_x_rnn
Y = oh_y_rnn
# X.shape, Y.shape

In [ ]:
err=0.0; l_rate=0.01
for i in range(len(X)):
    err+=fn(np.zeros(n_hidden), X[i], Y[i], l_rate)
    if i % 1000 == 999:
        print("Error:{:.3f}".format(err/1000))
        err=0.0

At the end of learning (via the loop above), we create a new Theano fn which takes some piece of input along with some initial hidden state, and produce not the loss, but the output.

This is to do testing: the fn goes from out inputs to our vector of outputs v_y.

Our predictions preds, will be to take that fn f_y, pass it our initial hidden state, np.zeros(n_hidden), and some input, say X[6], and that'll give us some predictions.


In [ ]:
f_y = theano.function([t_h0, t_inp], v_y, allow_input_downcast=True)

In [ ]:
pred = np.argmax(f_y(np.zeros(n_hidden), X[6]), axis=1)

In [ ]:
# grabbing sample input text
act = np.argmax(X[6], axis=1)

In [ ]:
# displaying input text
[indices_char[o] for o in act]

In [ ]:
# displaying expected outputs
[indices_char[o] for o in pred]

Running the above 2 lines will show what the model expected to come after each character.

In lecture:

act:  ['t', 'h', 'e', 'n', '?', ' ', 'I', 's']
---
pred: ['h', 'e', ' ', ' ', ' ', 'T', 'n', ' ']

And that's building a Recurrent Neural Network from Scratch using Theano.


In [ ]: