In this post I’ll cover how to build a Recurrent Neural Network (RNN) from scratch using python3.6
with basic numerical libraries like numpy
. Then I'll train this neural network using bacterial names from a popular microbialdatabase and 'sample' from this trained network to come up with novel names for bacteria, that have not been observed in our training dataset.
First part is obtaining the dataset. I am going to download a fairly popular SILVA
database.
A comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).
I am only going to download taxonomy annotations from 132 release of SILVA.
In [37]:
! wget https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_LSUParc_tax_silva.fasta.gz
First, have a look at the data
In [39]:
!zcat SILVA_132_LSUParc_tax_silva.fasta.gz | head -20
This is a standard .FASTA
file that contains nucleotide sequences (that we'll discard) and
The data itself needs to be reformatted.
In [64]:
!zcat SILVA_132_LSUParc_tax_silva.fasta.gz | grep ">" | cut -f2 | cut -f6 -d ';' | cut -f1 -d ' ' | sed 's/[^a-zA-Z ]/ /g' | awk '{print $1}' | sort | uniq | sed '/^$/d' | sed -r '/^.{,3}$/d' > genera.txt
Our formatting is a set of simple pipes, that I'll shortly explain:
grep ">"
catches all headers from a $.fasta$ file. We are essentially throwing away nucleotide sequences,cut -f2
cuts the second collumn with taxonomy annotationscut -f6 -d ';'
cuts sixth columns containing genera names that are separated by ;
cut -f1 -d ' '
cuts first column from genera names that sometimes have two words. As our model will only work on a single word for a bacteria, this is sufficient,sed 's/[^a-zA-Z0 ]/ /g
- replaces all non letters with a blank space,awk '{print \$1}'
- print the first column of the datasort
- sorts results alphabetically, so duplicates are put next to each otheruniq
removes all duplicatessed '/^$/d'
removes empty rowssed -r '/^.{,3}$/d'
removes rows with words shorten than 3 characters (letters like A, XY, etc...)
In [58]:
!head -10 genera.txt
In [59]:
!tail -10 genera.txt
Later I'll replace uppercase characters with lowercase characters.
In [60]:
!cat genera.txt | wc -l
All in all our training dataset has around 79 hundred examples. This should be sufficient for training our RNN network.
We need to load the data. We'll convert all characters to lowercase as model doesn't need to learn something that is not useful for our task.
The majority of NLP (Natural Language Processing) systems work on some kind of ecodings for processed words. Here we generate chars
, an 'alphabeth' of characters that is present in our dataset. The function set
creates a "set" of unique elements, and list
converts them for conenivence.
We create char_to_ix
that converts given character to an index, and the reverse operation (from index to a character) is carried out by ix_to_char
dictionary. We are ready to build our model (from the scratch).
In [13]:
data = open('genera.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(ix_to_char)
We can inspect here that we have 26 latin characters + 1 new line character that is going to indicate in our model that the bacterial name has finished being generated.
We are going to implement our architecture in Python with only the use of numpy. We have to define all the functions ourselves. Let's start!
In [14]:
import numpy as np
import random
Softmax is a normalized exponential function. Its input is a $K$ dimensional vector of real values that get normalized to be in the range $\in (0,1)$, where sum of this vectors adds up to $1$:
$$ softmax(z) = \frac{e^{z_j}}{\Sigma^K_{k=1}e^{z_k}} $$where: $j=1,2,...,K$
In [15]:
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
A simple function defined as:
$$ \sigma(x)= \frac{e^x}{1+e^x} = \frac{1}{1+e^{-x}} $$
In [16]:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
This will used later for RNN training. I borrowed these from Andrew Ng's Sequence Models course. The idea is to "smooth" a little bit the cumulative loss function, by using loss function for current computation curr_loss
:
In [17]:
def smooth(loss, cur_loss):
return loss * 0.999 + cur_loss * 0.001
In [18]:
def get_initial_loss(vocab_size, seq_length):
return -np.log(1.0/vocab_size)*seq_length
Each character is converted to a one-hot coded vector (later). We need a function that takes a series of numbers (indexed) that correspond to created word, and a dictionary mapping index to a character (ix_to_char
).
In [19]:
def print_sample(sample_ix, ix_to_char):
'''
Takes a list of indexes in `sample_ix` and a conversion array `ix_to_char`
'''
txt = ''.join(ix_to_char[ix] for ix in sample_ix)
txt = txt[0].upper() + txt[1:] # capitalize first character
print ('%s' % (txt, ), end='')
an example:
In [20]:
print_sample([0,2,5,1,2],ix_to_char)
Before we proceed with building NN architecture we need to initialize our parameters $W_{ax}, W_{aa}, W_{ya}, b, b_y$ to some (non-zero) values. We are implementing a vectorized implementation:
X
contains m
examples, each of size n_x
; its size is (n_x
,m
)n_a
,m
)b_a
is of shape (n_a
,1
)b_y
is of shape (n_y
,1
)These size imply sizes of our matrices:
Wax
)=(n_a, n_x)
Waa
)=(n_a, n_a)
Wya
)=(n_y, n_a)
We use numpy randn
function and multiply resulting random number by 0.01
to obtain small, random numbers
In [21]:
def initialize_parameters(n_a, n_x, n_y):
"""
Initialize parameters with small random values
Returns:
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
b -- Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
"""
Wax = np.random.randn(n_a, n_x)*0.01 # input to hidden
Waa = np.random.randn(n_a, n_a)*0.01 # hidden to hidden
Wya = np.random.randn(n_y, n_a)*0.01 # hidden to output
b = np.zeros((n_a, 1)) # hidden bias
by = np.zeros((n_y, 1)) # output bias
parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
return parameters
A recurrent neural network is comprised of repeated blocks for each time step $x^{<t>}$ as shown below on the diagram.
Our building block takes as input previous hidden state $a^{<t-1>}$ and current (already one-hot encoded) input $x^{<t>}$ and computes next hidden state $a^{<t>}$ and softmax
prediction $\hat{y}^{<t>}$. This in implemented in rnn_step_forward
python function:
In [23]:
def rnn_step_forward(parameters, a_prev, x):
# Retreive parameters from a dictionary
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
# Compute next hidden state
a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
#Compute softmax probability
p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars
#Return next hidden state and probability
return a_next, p_t
Forward pass through the network is a repetition of single blocks implemented in section 2.2.1. Each cell takes an input the hidden state of previous cell $a^{<t-1>}$ and current time-step $x^{<t>}$ of input sequence $X$.
Forward pass needs to compute vectors of $\hat{Y}$ predictions and hidden states 'a'
and 'caches'
for computing gradient in backpropagation step.
Our loss function is a cross entropy. For a given time-step $t$ it is defined as: $$ L_t(y_t,\hat{y}_t) = - y_tlog(\hat{y}_t) $$
for an entire sequence cost is just a sum of these values
$$ L(y,\hat{y}) = - \Sigma_t y_tlog(\hat{y}_t) $$so durng each forward pass we update the cost function $ L(y,\hat{y})$ by loss on single time-step $ L_t(y_t,\hat{y}_t)$
In [25]:
def rnn_forward(X, Y, a0, parameters, vocab_size = vocab_size):
# Initialize x, a and y_hat as empty dictionaries
x, a, y_hat = {}, {}, {}
a[-1] = np.copy(a0) # First hidden state (initial), in the figure above as a0, is here represented by index -1.
# initialize your loss to 0
loss = 0
# We loop over our input sequence X (Tx, n_x,m), through time steps t in Tx
for t in range(len(X)):
# Set x[t] to be the one-hot vector representation of the t'th character in X.
# if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector.
x[t] = np.zeros((vocab_size,1))
if (X[t] != None):
x[t][X[t]] = 1
# Run one step forward of the RNN
a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
# Update the loss by substracting the cross-entropy term of this time-step from it.
loss -= np.log(y_hat[t][Y[t],0])
cache = (y_hat, a, x)
return loss, cache
Backpropagation in an RNN is called somewhat fancy: backpropagation through time. As somebody that had some exposure to physics and quantum mechanics I twitch with a convulsion, but hey, the whole field is a giant hype, and the more fancy the terms, the more people's attention (and money) perhaps it will pull.
Backpropagation is the most complicated problem in this post. In my neighboring post where I implement everything in Keras this would not be the case, as existing frameworks "outsource" the need to compute derivatives for the framework itself.
However it is better to follow a simple rule "unless you build it yourself, you don't really understand it". Presumably it is attributed to Richard Feynman, but who really knows.
I'll reproduce our RNN cell, so it will become clear how computations are derived:
So for a given cell we are given a set of equations. I am going to place all of the here so it will serve as a reference:
$\hat{y}^{<t>}$ - this is our (model's) prediction for a given time-step $t$.
where:
$$softmax(z_j) = \frac{e^z_j}{\Sigma^K_k e^z_k}$$where $j=1,2,...k,k+1,...,K$$
$$tanh(z) = \frac{sinh(z)}{cosh(z)} = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$I've looked up the derivative of $tanh(z)$:
$$\frac{\partial tanh(z)}{\partial z} = sech^2(z)$$Then one can use one of hyperbolic identities:
$$ tanh^2(z) + sech^2(z) = 1 \Rightarrow sech^2(z) = 1 - tanh^2(z) $$So: $$\frac{\partial tanh(z)}{\partial z} = 1 - tanh^2(z) $$
Now let's proceed to a $softmax$ derivative:
$$ \frac{\partial softmax(z_j))}{\partial z_i} = \frac{1}{\partial z_i} (\frac{e^z_j}{\Sigma^K_k e^z_k} ) $$where $j=1,2,...k,k+1,...,K$
We have to consider two situations. One, where indices $i=j$, the second where $i\neq j$.
and $\Sigma_{k, k\neq j}^K e^z_k$ is essentially a constant part in calculating derivation, and for clarity I'll use $const.$ to represent it:
$$ \Sigma_{k, k\neq j}^K e^z_k \equiv const. $$so there it goes:
$$ \frac{\partial softmax(z_i))}{\partial z_i} = \frac{1}{\partial z_i} (\frac{e^z_j}{e^z_j + const.}) $$I'll also use a basic rule of calculus - quotient rule, to recall:
$$ \frac{1}{\partial x} (\frac{f(x)}{g(x)}) = \frac{ \frac{\partial f(x)}{\partial x} g(x) - f(x) \frac{\partial g(x)}{\partial x} }{g(x)^2} $$Finally, we'll have:
$$ \frac{\partial softmax(z_j))}{\partial z_i} = \frac{ \frac{\partial e^z_j}{\partial z_i} (e^z_j + const.) - e^z_j \frac{\partial e^z_j + const.}{\partial z_i} }{ (e^z_j + const.)^2} = \frac{ e^z_j (e^z_j + const.) - e^z_j \cdot e^z_j}{ (e^z_j + const.)^2} $$We can use our $softmax$ definition to come up with:
$$ \frac{\partial softmax(z_j))}{\partial z_i} = softmax(z_j) - softmax^2(z_j) $$if $i=j$
Here we can treat $e^z_j$ and $ \Sigma_{k, k\neq i}^K e^z_k$ as constant. We compute derivatives over $i$ index, over $e^z_i$. So this could be viewed as calculating a simple derivative:
$$ \frac{\partial }{\partial x} (\frac{contst_1}{e^x + constant_2}) = const_1 \cdot \frac{\partial }{\partial x} (\frac{1}{e^x + constant_2}) $$We can again use a quotient rule, but it is going to be much simpler:
$$ ... = const_1 \cdot \frac{ 0 \cdot (e^x + constant_2) - 1 \cdot e^x }{(e^x + constant_2 )^2} $$so: $$ \frac{\partial softmax(z_j))}{\partial z_i} = e^z_j \cdot \frac{1}{\partial z_i} (\frac{1}{e^z_i + \Sigma_{k, k\neq i}^K e^z_k}) = ...$$
$$ ... = e^z_j \cdot \frac{ \frac{1}{\partial z_i}(1) \cdot (e^z_i + \Sigma_{k, k\neq i}^K e^z_k) - 1 \frac{1}{\partial z_i}(e^z_i + \Sigma_{k, k\neq i}^K e^z_k) }{(e^z_i + \Sigma_{k, k\neq i}^K e^z_k)^2 } = ... $$$$ ... = e^z_j \cdot \frac{ 0 - e^z_i }{(e^z_i + \Sigma_{k, k\neq i}^K e^z_k)^2} = \frac{ - e^z_j \cdot e^z_i }{(e^z_i + \Sigma_{k, k\neq i}^K e^z_k)^2} = \frac{ - e^z_j \cdot e^z_i }{( \Sigma_{k}^K e^z_k)^2} = - softmax(z_i) \cdot softmax(z_j)$$Alright, just to summarize our results and have a clear reference:
Our goal is to compute derivatives of the cost function $L$ for a single cell with relation to weights. Let's compute the first derivative $ \frac{\partial L_t}{W_{ya}}$.
Let's recall the functions that we'd be derivating over:
Following a simple chain rule we can write:
We can already see that: $\frac{\partial z}{\partial W_{ya}} = a^{<t>}$. We also already computed derivative of $softmax$. Let's compute: $\frac{\partial L_t}{\partial \hat{y}^{<t>}}$
It is nice to remember that $ \frac{1}{\partial x} (log(x)) = \frac{1}{x} $. With that we are properly equipped to solve:
$$\frac{\partial L_t}{\partial \hat{y}^{<t>}} = \frac{\partial}{\partial \hat{y}^{<t>}} (- y^{<t>} log(\hat{y}^{<t>})) = \frac{- y^{<t>}}{\hat{y}^{<t>}}$$Let's put everything here together:
$$ \frac{\partial L_t}{\partial W_{ya}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot \frac{\partial z}{\partial W_{ya}} = \frac{- y^{<t>}}{\hat{y}^{<t>}} \cdot \Big( ( softmax(z_i) - softmax^2(z_j) ) + (- softmax(z_i)\cdot softmax(z_j)) \Big) \cdot a^{<t>} $$But first let's make sure we have proper notation. What is $\hat{y}^{<t>}$? It is our model's prediction. And this prediction is made with $softmax$ of course. Hence, we have to unify our notation:
Which is simplified as
$$ ... = - y^{<t_i>} + y^{<t_i>}\hat{y}^{<t_l>} + \Sigma^K_{k,k\neq i} y^{<t_k>}\hat{y}^{<t_i>}$$We see the sum $\Sigma^K_{k,k\neq i}$ that goes over every element except $k=l$, but there is also one term $y^{<t_i>}\hat{y}^{<t_l>}$ that adds this "except" term. We can thus simplify:
So, just to recap. $\hat{y}$ are predictions, they take values between 0 and 1, and sum up to one. But, $y$ (without this funny hat) are 'ground truth' labels. Essentially $y$ is one-hot coded vector that has one $'1'$ and the rest $'0'$. It also sums to one obviously. Thus, we can further simplify:
$$ \frac{\partial L_t}{\partial z} = -y^{<t_l>} + \hat{y}^{<t_l>} = \hat{y}^{<t_l>} -y^{<t_l>} $$So essentially prediction $\hat{y}$ minus the true label.
Thus in our code in we write
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
gradients['dWya'] += np.dot(dy, a.T)
...
We add this gradients for each time-step $t$. As our total cost function is a sum of costs for each time step:
$$ L = \Sigma^T_{i=0} L_i $$so:
$$ \frac{\partial L}{\partial W_{ya}} = \Sigma^T_{i=0} \frac{\partial L_i}{\partial W_{ya}} $$This will be simple as we've already computed most of derivatives:
$$ \frac{\partial L_t}{\partial b_{y}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot \frac{\partial z}{\partial b_y} $$We essentialy don't have the last component which is equal to 1, as $z = W_{ya} a^{<t>} + b_y$, so $\frac{\partial z}{\partial b_y} = 1$
So we're left with $$ \frac{\partial L_t}{\partial b_{y}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot 1 = \frac{\partial L_t}{\partial z} = \hat{y}^{<t_l>} -y^{<t_l>} $$
Hence in our code we'll write
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
#...
gradients['dby'] += dy
#...
Let's recap our RNN cell:
And our equations (I added a bit of helper notations, $p^{<t>}$ and $z^{<t>}$, in order to split them into several equations for the clarity):
What we want to compute now is $\frac{\partial a^{<t>}}{\partial W_{aa}}$.
Lets use (again) a chain rule for our equations:
$$ \frac{\partial a^{<t>}}{\partial W_{aa}} = \frac{ \partial a^{<t>} }{\partial p} \frac{\partial p}{\partial W_{aa}} $$We already know $\frac{ \partial a^{<t>} }{\partial p} $ is just a derivative of tanh
function we've computed before:
$$\frac{\partial tanh(z)}{\partial z} = 1 - tanh^2(z) $$
And $\frac{\partial p}{\partial W_{aa}} $ is a simple derivative: $$ \frac{\partial p}{\partial W_{aa}} = a^{<t-1>} $$
Combining these we get:
$$ \frac{\partial a^{<t>}}{\partial W_{aa}} = (1 - tanh^2(p^{<t>}) \cdot a^{<t-1>} = (1-a^{2<t>})a^{<t-1>} $$So we write in our code:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
#...
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
gradients['dWaa'] += np.dot(daraw, a_prev.T)
#...
We can now easily compute other gradient $\frac{\partial a^{<t>}}{\partial b_a}$ from the go:
$$\frac{\partial a^{<t>}}{\partial b_a} = \frac{\partial a^{<t>}}{\partial p} \cdot \frac{\partial p}{\partial b_a} = (1-a^{2<t>})$$Thus in our code we write:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
#....
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
gradients['db'] += daraw
#...
Most of calculations are done. This doesn't need much explanation:
$$\frac{\partial a^{<t>}}{\partial W{ax}} = \frac{\partial a^{<t>}}{\partial p} \cdot \frac{\partial p}{\partial W_{ax}} = (1-a^{2<t>}) \cdot x^{<t>}$$Hence, our code has:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
#...
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
#...
gradients['dWax'] += np.dot(daraw, x.T)
#...
The final step is to compute gradients with of our loss function for a single example $L_t$ but with respect to $a^{<t>}$.
Again, I am going to recall our equations for clarity:
The first two derivatives we have already solved.
$$ \frac{\partial L_t}{\partial z} = \hat{y}^{<t_l>} -y^{<t_l>} = `dy` \text{ in our code } $$$$ \frac{\partial z^{<t>} }{\partial a^{<t>}} = W_{ya}$$So we get: $$ \frac{\partial L_t}{\partial W_{ya} } = (\hat{y}^{<t_l>} -y^{<t_l>}) W_{ya} $$
Which we implement in code as:
da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
And now, the final (!) derivative:
$$ \frac{\partial a^{<t>}}{\partial a^{<t-1>}} = \frac{\partial a^{<t>} }{\partial p } \cdot \frac{\partial p}{\partial a^{<t-1>} }$$The first derivative is a derivative over tanh
, the second is pretty simple
Which we refer to our code as:
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
Ufff... derivatives are easy, but take time, patients and energy. No wonder major frameworks have this implemented it "under the hood" so people can only create computation graphs.
In [27]:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
gradients['dWya'] += np.dot(dy, a.T)
gradients['dby'] += dy
da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
daraw = (1 - a * a) * da # backprop through tanh nonlinearity
gradients['db'] += daraw
gradients['dWax'] += np.dot(daraw, x.T)
gradients['dWaa'] += np.dot(daraw, a_prev.T)
gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
return gradients
Some useful sources I've used:
After we've implemented our backward pass through the network for a single cell. Now we just have to repeat this for our network.
In [28]:
def update_parameters(parameters, gradients, lr):
parameters['Wax'] += -lr * gradients['dWax']
parameters['Waa'] += -lr * gradients['dWaa']
parameters['Wya'] += -lr * gradients['dWya']
parameters['b'] += -lr * gradients['db']
parameters['by'] += -lr * gradients['dby']
return parameters
def rnn_backward(X, Y, parameters, cache):
# Initialize gradients as an empty dictionary
gradients = {}
# Retrieve from cache and parameters
(y_hat, a, x) = cache
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
# each one should be initialized to zeros of the same dimension as its corresponding parameter
gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
gradients['da_next'] = np.zeros_like(a[0])
# Backpropagate through time
for t in reversed(range(len(X))):
dy = np.copy(y_hat[t])
dy[Y[t]] -= 1
gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
return gradients, a
Our "flow" throughout the network is as follows:
Between points 3 and 4 we'd be far better of with gradient clipping to prevent "exploding gradients", i.e. a situation of numberical overflow that would often produced NaN
s as gradients take enormously huge number. In order to avoid it we just specify na maxValue
. If a gradients goes above or below this value we clip it, i.e. set to this maxValue
. Pretty simple:
In [33]:
def clip(gradients, maxValue):
'''
Clips the gradients' values between minimum and maximum.
Arguments:
gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
Returns:
gradients -- a dictionary with the clipped gradients.
'''
# Unpacking gradient for clarity
dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
# clipping to mitigate exploding gradients
for gradient in [dWax, dWaa, dWya, db, dby]:
np.clip(gradient, -maxValue,+maxValue,gradient) #we overwrite current 'gradient' with clipped one.
# We pack gradient to a ditionary
gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
#and return it.
return gradients
Sampling is the process of generating novel observation, in our example new letters that would constitue a made-up bacterial genera name.
If our network is trained we can then pass a zero vector $\vec{0}$ as input hidden state $a^{<0>}$. Then we perform propagation through the first unit of our NN. So we obtain next hidden state $a^{<t+1>}$ and prediction vector $\hat{y}^{<t+1>}$ that represents the probabilities of each character $i$:
$$ a^{<t+1>} = tanh(W_{ax}x^{<t>} + W_{aa}a^{<t>} + b) $$$$ \hat{y}^{<t+1>} = softmax( W_{ya}a^{<t+1>} + b_y ) $$Having computed a vector of probabilities $\hat{y}^{<t+1>}$ we now perform sampling procedure. We do not pick just the highest probability, this would in turn generate the same results each time for a given dataset. We do not want to pick our characters randomly, as results would become random, and all the architecture build would become useless. The key is to select (i.e. sample) from our $\hat{y}^{<t+1>}$ distribution based on that distribution.
In other words, we'll pick the index $i$ (remember that we have a one-hot encoded our alphabeth) with the probability encoded by the $i$-th index in $\hat{y}^{<t+1>}$ matrix. We'll use numpy random.choice
to achieve this.
import numpy as np
np.random.choice([list of possible values], p = flattened-probability-distribution)
The final step is to overwrite the variable x^{<t+1>}
with our predicted one-hot encoding of selected index $i$ from the previous step. This is represented as red arrow on the picture below:
If our network is trained we can then pass a zero vector $\vec{0}$ as input hidden state $a^{<0>}$. Then we perform propagation through the first unit of our NN. So we obtain next hidden state $a^{<t+1>}$ and prediction $\hat{y}^{<t+1>}$ that represents the probability that the character indexed by $i$ will become the next character:
$$ a^{<t+1>} = tanh(W_{ax}x^{<t>} + W_{aa}a^{<t>} + b) $$$$ \hat{y}^{<t+1>} = softmax( W_{ya}a^{<t+1>} + b_y ) $$Having computed a vector of probabilities $\hat{y}^{<t+1>}$ we now perform sampling procedure. We do not pick just the highest probability, this would in turn generate the same results each time for a given dataset. We do not want to pick our characters randomly, as results would become random, and all the architecture build would become useless. The key is to select (i.e. sample) from our $\hat{y}^{<t+1>}$ distribution based on that distribution.
In other words, we'll pick the index $i$ (remember that we have a one-hot encoded our alphabeth) with the probability encoded by the $i$-th index in $\hat{y}^{<t+1>}$ matrix. We'll use numpy random.choice
to achieve this.
import numpy as np
np.random.choice([list of possible values], p = flattened-probability-distribution)
The final step is to overwrite the variable x^{<t+1>}
with our predicted one-hot encoding of selected index $i$ from the previous step. This is represented as red arrow on the picture below:
We'll continue this propagation until we reach end of the line character \n
. Then our generation is finished, and we can print our result. If something goes wrong, then we additionally limit ourselves to 50
character limit as indicated by counter
variable.
In [34]:
def sample(parameters, char_to_ix):
"""
Sample a sequence of characters according to a sequence of probability distributions output of the RNN
Arguments:
parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b.
char_to_ix -- python dictionary mapping each character to an index.
Returns:
indices -- a list of length n containing the indices of the sampled characters.
"""
# Retrieve parameters and relevant shapes from "parameters" dictionary
Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
vocab_size = by.shape[0]
n_a = Waa.shape[1]
# Step 1: Create the one-hot vector x for the first character (initializing the sequence generation)
x = np.zeros((vocab_size,1))
# Step 1': Initialize a_prev as zeros
a_prev = np.zeros((n_a,1))
# An empty list of indices, this is the list which will contain the list of indices of the characters to generate
indices = []
# Idx is a flag to detect a newline character, we initialize it to -1
idx = -1
# Loop over time-steps t. At each time-step, sample a character from a probability distribution and append
# its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well
# trained model), which helps debugging and prevents entering an infinite loop.
counter = 0
newline_character = char_to_ix['\n']
while (idx != newline_character and counter != 50):
# Step 2: Forward propagate x using the above equations
a = np.tanh(np.dot(Wax,x)+np.dot(Waa,a_prev)+b)
z = np.dot(Wya,a)+by
y = softmax(z)
# Step 3: Sample the index of a character within the vocabulary from the probability distribution y
idx = np.random.choice(range(vocab_size), p = y.ravel()) # numpy 'ravel' flattens our list.
# Append the index to "indices"
indices.append(idx)
# Step 4: Overwrite the input character as the one corresponding to the sampled index.
x = np.zeros((vocab_size,1))
x[idx] = 1 #this is one hot encoding. All zeros, then set index as '1'.
# Update "a_prev" to be "a"
a_prev = a
counter +=1
# if something goes wrong, we'll stop at 50th index.
if (counter == 50):
indices.append(char_to_ix['\n'])
return indices
In our model we'll use stochastic gradient descent, meaning we'll loop over one example at a time.
Function optimize
will forward propagate through our network, compute and clip gradients, and update parameters according to provided learning rate. Here we are going to use already implemented functions:
In [35]:
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01,maxGradient=5):
"""
Execute one step of the optimization to train the model.
Arguments:
X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
Y -- list of integers, exactly the same as X but shifted one index to the left.
a_prev -- previous hidden state.
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
b -- Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
learning_rate -- learning rate for the model.
Returns:
loss -- value of the loss function (cross-entropy)
gradients -- python dictionary containing:
dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
db -- Gradients of bias vector, of shape (n_a, 1)
dby -- Gradients of output bias vector, of shape (n_y, 1)
a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
"""
# Forward propagate through time
loss, cache = rnn_forward(X, Y, a_prev, parameters)
# Backpropagate through time
gradients, a = rnn_backward(X, Y, parameters, cache)
# Clip your gradients between -maxGradient (min) and maxGradient (max)
gradients = clip(gradients, maxGradient)
# Update parameters
parameters = update_parameters(parameters, gradients, learning_rate)
return loss, gradients, a[len(X)-1]
Training the model
In [36]:
def model(filename, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, genera_names = 7, vocab_size = vocab_size,learning_rate=0.01,params=None,printing=True):
"""
Trains the model and generates novel genera names.
Arguments:
data -- text corpus
ix_to_char -- dictionary that maps the index to a character
char_to_ix -- dictionary that maps a character to an index
num_iterations -- number of iterations to train the model for
n_a -- number of units of the RNN cell
genera_names -- number of bacterial names you want to sample at each iteration.
vocab_size -- number of unique characters found in the text, size of the vocabulary
Returns:
parameters -- learned parameters
"""
# Retrieve n_x and n_y from vocab_size
n_x, n_y = vocab_size, vocab_size
# Initialize parameters
if params==None:
parameters = initialize_parameters(n_a, n_x, n_y)
else:
parameters=params
# Initialize loss (this is required because we want to smooth our loss)
loss = get_initial_loss(vocab_size, genera_names)
# Build list of all bacterial names (training examples).
with open(filename) as f:
examples = f.readlines()
examples = [x.lower().strip() for x in examples] #we convert names to lowercase and create a list of genera names
# Shuffle list of all bacterial names
np.random.shuffle(examples)
# Initialize the hidden state
a_prev = np.zeros((n_a, 1))
# Optimization loop
for j in range(num_iterations):
# Define one training example (X,Y) (≈ 2 lines)
index = j % len(examples) # we don't want our index to exceed the number of iterations, so we always refer to a valid example
X = [None] + [char_to_ix[ch] for ch in examples[index]]
# [None] at the beginning will be interpreted as zero vector by a function `rnn_forward`
Y = X[1:] + [char_to_ix["\n"]]
# Y won't have [None] but will be shifted to the right and have end of sentence sign.
# Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
# Choose a learning rate of 0.01
curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = learning_rate)
# Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
loss = smooth(loss, curr_loss)
# Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
if (j % 2000==0 and printing==True):
print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
# The number of genera names to print
for name in range(genera_names):
# Sample indices and print them
sampled_indices = sample(parameters, char_to_ix)
print_sample(sampled_indices, ix_to_char)
print('\n')
return parameters
In [37]:
parameters = model('genera.txt', ix_to_char, char_to_ix, num_iterations = 85000,params=None,printing=False)
In [38]:
parameters = model('genera.txt', ix_to_char, char_to_ix, num_iterations = 8000,params=parameters, genera_names=10)