Bacterial name generator with RNNs

In this post I’ll cover how to build a Recurrent Neural Network (RNN) from scratch using python3.6 with basic numerical libraries like numpy. Then I'll train this neural network using bacterial names from a popular microbialdatabase and 'sample' from this trained network to come up with novel names for bacteria, that have not been observed in our training dataset.

I. Getting and formatting the data

First part is obtaining the dataset. I am going to download a fairly popular SILVA database.

A comprehensive on-line resource for quality checked and aligned ribosomal RNA sequence data. SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life (Bacteria, Archaea and Eukarya).

1. Downloading the data

I am only going to download taxonomy annotations from 132 release of SILVA.


In [37]:
! wget https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_LSUParc_tax_silva.fasta.gz


--2018-09-23 14:51:51--  https://www.arb-silva.de/fileadmin/silva_databases/release_132/Exports/SILVA_132_LSUParc_tax_silva.fasta.gz
Resolving www.arb-silva.de (www.arb-silva.de)... 134.102.40.6
Connecting to www.arb-silva.de (www.arb-silva.de)|134.102.40.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 272844631 (260M) [application/gzip]
Saving to: ‘SILVA_132_LSUParc_tax_silva.fasta.gz’

SILVA_132_LSUParc_t 100%[===================>] 260.20M  4.84MB/s    in 61s     

2018-09-23 14:52:53 (4.29 MB/s) - ‘SILVA_132_LSUParc_tax_silva.fasta.gz’ saved [272844631/272844631]

2. Formatting the data

First, have a look at the data


In [39]:
!zcat SILVA_132_LSUParc_tax_silva.fasta.gz | head -20


>AY187551.1.496 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;Bradyrhizobium sp. Ih3-2
CUACGCUGCGAUAAGCCGUGGGGAGCUGCGAAGAAGCUUUGAUCCACGGAUUUCCGAAUGGGGAAACCCACCUUCGAUAG
CCGGAACUCCAAGACCUUUGUCGAAAGACAUCGGUGUGGGGUUCAACCAGACAAUGGAAGAAGCCAGGCCUUUAGAUUUC
GAUCGAAGAGGUUUUGGAUUUCCGGUUAUCAAGUGAAGGUAUGAGACUUCUGAAUAAAAUAGGAGGUUUCAAGCAAACCC
AGGGAACUGAAACAUCUAAGUACCUGGAGGAAAGGACAUCAACAGAGACUCCGUUAGUAGUGGCGAGCGAACGCGGACCA
GGCCAGUGAUACAUCAAAGACAAUCGGAACCAGUCAGGAAAGCUGGGCCUCAGAGGGUGAUAGCCCCGUACGAGUAAUGC
GAUGAUGUAUCCACGAGUAAGGCGGGACACGUGAAAUCCUGUCUGAACACGGGGGGACCACCCUCCAAGCCUAAGUACUC
CUCAGCGACCGAUAGC
>AY187559.1.472 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;Bradyrhizobium sp. Vp10-1
CUACGCUGCGAUAAGCCGUGGGGAGCUGCGAAGAAGCUUUGAUCCGCGGAUUUCCGAAUGGGGAAACCCACCUUCGAUAG
CCGGAACUCCAAAACCUUGUGGUUUUGGGGUUUGACCAGAAAUGAUCGGAACCAUGACCUCUUGAGGUUUUGGAUUUCCG
GUUAUCAAGAGAAGGUAUGAGACCUCCGAAUAAAAUAGGAGGUUUUAAGCAAACCCAGGGAACUGAAACAUCUAAGUACC
UGGAGGAAAGGACAUCAAUCGAGACUCCGUUAGUAGUGGCGAGCGAACGCGGACCAGGCCAGUGAUACAUCAAAGACAAU
CGGAACCUGUCAGGAAAGCAGGGCCUCAGAGGGUGAUAGCCCCGUACGAGUAAUGCGAUGAUGUAUCCACGAGUAAGGCG
GGACACGUGUAAUCCUGUCUGAACACGGGGGGACCACCCUCCAAGCCUAAGUACUCCUCAGCGACCGAUAGC
>AY190663.649.1113 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Sorangium cellulosum
GGUCAAGUGAAGAAGCGCAUACGGUGGAUGCCUUGGCAGUCAGAGGCGAUGAAAGACGUGGUAGCCUGCGAAAAGCUUCG
GGGAGUCGGCAAACAGACUUUGAUCCGGAGAUGUCUGAAUGGGGGAACCCACCUAACAUAAGUUAGGUAUCUUAAGCUGA
AUACAUAGGCUUAAGAAGCGAACCAGGGGAACUGAAACAUCUAAGUACCCUGAGGAAAAGAAAUCAACCGAGAUUCCCUU
AGUAGUGGCGAGCGAACGGGGACCAGCCCUUAAGCUGUAUUGAUGUUAGCGGAACGCUCUGGAAAGUGCGGCCAUAGUGG

gzip: stdout: Broken pipe

This is a standard .FASTA file that contains nucleotide sequences (that we'll discard) and

The data itself needs to be reformatted.


In [64]:
!zcat SILVA_132_LSUParc_tax_silva.fasta.gz | grep ">" | cut -f2 | cut -f6 -d ';' | cut -f1 -d ' '  | sed 's/[^a-zA-Z ]/ /g' | awk '{print $1}'  | sort | uniq | sed  '/^$/d' | sed -r '/^.{,3}$/d' > genera.txt

Our formatting is a set of simple pipes, that I'll shortly explain:

  • grep ">" catches all headers from a $.fasta$ file. We are essentially throwing away nucleotide sequences,
  • cut -f2 cuts the second collumn with taxonomy annotations
  • cut -f6 -d ';' cuts sixth columns containing genera names that are separated by ;
  • cut -f1 -d ' ' cuts first column from genera names that sometimes have two words. As our model will only work on a single word for a bacteria, this is sufficient,
  • sed 's/[^a-zA-Z0 ]/ /g - replaces all non letters with a blank space,
  • awk '{print \$1}' - print the first column of the data
  • sort - sorts results alphabetically, so duplicates are put next to each other
  • uniq removes all duplicates
  • sed '/^$/d' removes empty rows
  • sed -r '/^.{,3}$/d' removes rows with words shorten than 3 characters (letters like A, XY, etc...)

In [58]:
!head -10 genera.txt


Abelmoschus
Abies
Abiotrophia
Abolboda
Abortiporus
Abrahamia
Abroma
Abrus
Absidia
Abuta

In [59]:
!tail -10 genera.txt


Zosterograptus
Zoysia
Zunongwangia
Zygnematales
Zygoascus
Zygodon
Zygophyllum
Zygostates
Zymobacter
Zymomonas

Later I'll replace uppercase characters with lowercase characters.


In [60]:
!cat genera.txt | wc -l


7903

All in all our training dataset has around 79 hundred examples. This should be sufficient for training our RNN network.

3. Loading the data

We need to load the data. We'll convert all characters to lowercase as model doesn't need to learn something that is not useful for our task.

The majority of NLP (Natural Language Processing) systems work on some kind of ecodings for processed words. Here we generate chars, an 'alphabeth' of characters that is present in our dataset. The function set creates a "set" of unique elements, and list converts them for conenivence.

We create char_to_ix that converts given character to an index, and the reverse operation (from index to a character) is carried out by ix_to_char dictionary. We are ready to build our model (from the scratch).


In [13]:
data = open('genera.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))


char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) } 
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) } 
print(ix_to_char)


There are 89903 total characters and 27 unique characters in your data.
{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'}

We can inspect here that we have 26 latin characters + 1 new line character that is going to indicate in our model that the bacterial name has finished being generated.


II. Building an RNN architecture in Python and Numpy


1. Helper functions

We are going to implement our architecture in Python with only the use of numpy. We have to define all the functions ourselves. Let's start!


In [14]:
import numpy as np 
import random

a. Softmax

Softmax is a normalized exponential function. Its input is a $K$ dimensional vector of real values that get normalized to be in the range $\in (0,1)$, where sum of this vectors adds up to $1$:

$$ softmax(z) = \frac{e^{z_j}}{\Sigma^K_{k=1}e^{z_k}} $$

where: $j=1,2,...,K$


In [15]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

b. Sigmoid:

A simple function defined as:

$$ \sigma(x)= \frac{e^x}{1+e^x} = \frac{1}{1+e^{-x}} $$


In [16]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

c. Other helper functions:

This will used later for RNN training. I borrowed these from Andrew Ng's Sequence Models course. The idea is to "smooth" a little bit the cumulative loss function, by using loss function for current computation curr_loss:


In [17]:
def smooth(loss, cur_loss):
    return loss * 0.999 + cur_loss * 0.001

In [18]:
def get_initial_loss(vocab_size, seq_length):
    return -np.log(1.0/vocab_size)*seq_length

Each character is converted to a one-hot coded vector (later). We need a function that takes a series of numbers (indexed) that correspond to created word, and a dictionary mapping index to a character (ix_to_char).


In [19]:
def print_sample(sample_ix, ix_to_char):
    '''
    Takes a list of indexes in `sample_ix` and a conversion array `ix_to_char`
    '''
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    txt = txt[0].upper() + txt[1:]  # capitalize first character 
    print ('%s' % (txt, ), end='')

an example:


In [20]:
print_sample([0,2,5,1,2],ix_to_char)


beab

2. Bulding blocks of RNN architecture

2.1. Parameters initialization

Before we proceed with building NN architecture we need to initialize our parameters $W_{ax}, W_{aa}, W_{ya}, b, b_y$ to some (non-zero) values. We are implementing a vectorized implementation:

  • input vector data X contains m examples, each of size n_x; its size is (n_x,m)
  • previous hidden state $a^{<t-1>}$ is of size (n_a,m)
  • bias of the hidden state b_a is of shape (n_a,1)
  • bias of the output b_y is of shape (n_y,1)

These size imply sizes of our matrices:

  • size(Wax)=(n_a, n_x)
  • size(Waa)=(n_a, n_a)
  • size(Wya)=(n_y, n_a)

We use numpy randn function and multiply resulting random number by 0.01 to obtain small, random numbers


In [21]:
def initialize_parameters(n_a, n_x, n_y):
    """
    Initialize parameters with small random values
    
    Returns:
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    """

    Wax = np.random.randn(n_a, n_x)*0.01 # input to hidden
    Waa = np.random.randn(n_a, n_a)*0.01 # hidden to hidden
    Wya = np.random.randn(n_y, n_a)*0.01 # hidden to output
    b = np.zeros((n_a, 1)) # hidden bias
    by = np.zeros((n_y, 1)) # output bias
    
    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
    
    return parameters

2.2. Building RNN

2.2.1. Forward pass through a cell

A recurrent neural network is comprised of repeated blocks for each time step $x^{<t>}$ as shown below on the diagram.

Our building block takes as input previous hidden state $a^{<t-1>}$ and current (already one-hot encoded) input $x^{<t>}$ and computes next hidden state $a^{<t>}$ and softmax prediction $\hat{y}^{<t>}$. This in implemented in rnn_step_forward python function:


In [23]:
def rnn_step_forward(parameters, a_prev, x):
    
    # Retreive parameters from a dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    # Compute next hidden state
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
    
    #Compute softmax probability
    p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars 
    
    #Return next hidden state and probability
    return a_next, p_t

2.2.2. Forward pass through the network

Forward pass through the network is a repetition of single blocks implemented in section 2.2.1. Each cell takes an input the hidden state of previous cell $a^{<t-1>}$ and current time-step $x^{<t>}$ of input sequence $X$.

Forward pass needs to compute vectors of $\hat{Y}$ predictions and hidden states 'a' and 'caches' for computing gradient in backpropagation step.

Our loss function is a cross entropy. For a given time-step $t$ it is defined as: $$ L_t(y_t,\hat{y}_t) = - y_tlog(\hat{y}_t) $$

for an entire sequence cost is just a sum of these values

$$ L(y,\hat{y}) = - \Sigma_t y_tlog(\hat{y}_t) $$

so durng each forward pass we update the cost function $ L(y,\hat{y})$ by loss on single time-step $ L_t(y_t,\hat{y}_t)$


In [25]:
def rnn_forward(X, Y, a0, parameters, vocab_size = vocab_size):
    
    # Initialize x, a and y_hat as empty dictionaries
    x, a, y_hat = {}, {}, {}
    
    a[-1] = np.copy(a0) # First hidden state (initial), in the figure above as a0, is here represented by index -1.
    
    # initialize your loss to 0
    loss = 0
    
    # We loop over our input sequence X (Tx, n_x,m), through time steps t in Tx
    for t in range(len(X)):
        
        # Set x[t] to be the one-hot vector representation of the t'th character in X.
        # if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector. 
        x[t] = np.zeros((vocab_size,1)) 
        if (X[t] != None):
            x[t][X[t]] = 1
        
        # Run one step forward of the RNN
        a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
        
        # Update the loss by substracting the cross-entropy term of this time-step from it.
        loss -= np.log(y_hat[t][Y[t],0])
        
    cache = (y_hat, a, x)
        
    return loss, cache

2.2.3. Backpropagation pass through a cell

Backpropagation in an RNN is called somewhat fancy: backpropagation through time. As somebody that had some exposure to physics and quantum mechanics I twitch with a convulsion, but hey, the whole field is a giant hype, and the more fancy the terms, the more people's attention (and money) perhaps it will pull.

Backpropagation is the most complicated problem in this post. In my neighboring post where I implement everything in Keras this would not be the case, as existing frameworks "outsource" the need to compute derivatives for the framework itself.

However it is better to follow a simple rule "unless you build it yourself, you don't really understand it". Presumably it is attributed to Richard Feynman, but who really knows.

I'll reproduce our RNN cell, so it will become clear how computations are derived:

So for a given cell we are given a set of equations. I am going to place all of the here so it will serve as a reference:

$\hat{y}^{<t>}$ - this is our (model's) prediction for a given time-step $t$.

  1. $L_t = - y^{<t>} log(\hat{y}^{<t>})$
  2. $\hat{y}^{<t>} = softmax ( W_{ya} a^{<t>} + b_y ) = softmax(z)$ (I add a helper variable $z$ for clarity of derivations)
  3. $a^{<t>} = tanh (W_{ax} x^{<t>} + W_{aa}a^{<t-1>} + b_a)$

where:

$$softmax(z_j) = \frac{e^z_j}{\Sigma^K_k e^z_k}$$

where $j=1,2,...k,k+1,...,K$$

$$tanh(z) = \frac{sinh(z)}{cosh(z)} = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$
a) Softmax and tanh derivatives

I've looked up the derivative of $tanh(z)$:

$$\frac{\partial tanh(z)}{\partial z} = sech^2(z)$$

Then one can use one of hyperbolic identities:

$$ tanh^2(z) + sech^2(z) = 1 \Rightarrow sech^2(z) = 1 - tanh^2(z) $$

So: $$\frac{\partial tanh(z)}{\partial z} = 1 - tanh^2(z) $$

Now let's proceed to a $softmax$ derivative:

$$ \frac{\partial softmax(z_j))}{\partial z_i} = \frac{1}{\partial z_i} (\frac{e^z_j}{\Sigma^K_k e^z_k} ) $$

where $j=1,2,...k,k+1,...,K$

We have to consider two situations. One, where indices $i=j$, the second where $i\neq j$.

  • if $i=j$, then the derivate of softmax will look like (just to present you):
$$ \frac{\partial softmax(z_j))}{\partial z_i} = \frac{1}{\partial z_i} (\frac{e^z_j}{e^z_j + \Sigma_{k, k\neq j}^K e^z_k }) $$

and $\Sigma_{k, k\neq j}^K e^z_k$ is essentially a constant part in calculating derivation, and for clarity I'll use $const.$ to represent it:

$$ \Sigma_{k, k\neq j}^K e^z_k \equiv const. $$

so there it goes:

$$ \frac{\partial softmax(z_i))}{\partial z_i} = \frac{1}{\partial z_i} (\frac{e^z_j}{e^z_j + const.}) $$

I'll also use a basic rule of calculus - quotient rule, to recall:

$$ \frac{1}{\partial x} (\frac{f(x)}{g(x)}) = \frac{ \frac{\partial f(x)}{\partial x} g(x) - f(x) \frac{\partial g(x)}{\partial x} }{g(x)^2} $$

Finally, we'll have:

$$ \frac{\partial softmax(z_j))}{\partial z_i} = \frac{ \frac{\partial e^z_j}{\partial z_i} (e^z_j + const.) - e^z_j \frac{\partial e^z_j + const.}{\partial z_i} }{ (e^z_j + const.)^2} = \frac{ e^z_j (e^z_j + const.) - e^z_j \cdot e^z_j}{ (e^z_j + const.)^2} $$

We can use our $softmax$ definition to come up with:

$$ \frac{\partial softmax(z_j))}{\partial z_i} = softmax(z_j) - softmax^2(z_j) $$

if $i=j$

  • if $j \neq i $
$$ \frac{\partial softmax(z_j))}{\partial z_i} = \frac{1}{\partial z_i} (\frac{e^z_j}{e^z_i + \Sigma_{k, k\neq i}^K e^z_k }) $$

Here we can treat $e^z_j$ and $ \Sigma_{k, k\neq i}^K e^z_k$ as constant. We compute derivatives over $i$ index, over $e^z_i$. So this could be viewed as calculating a simple derivative:

$$ \frac{\partial }{\partial x} (\frac{contst_1}{e^x + constant_2}) = const_1 \cdot \frac{\partial }{\partial x} (\frac{1}{e^x + constant_2}) $$

We can again use a quotient rule, but it is going to be much simpler:

$$ ... = const_1 \cdot \frac{ 0 \cdot (e^x + constant_2) - 1 \cdot e^x }{(e^x + constant_2 )^2} $$

so: $$ \frac{\partial softmax(z_j))}{\partial z_i} = e^z_j \cdot \frac{1}{\partial z_i} (\frac{1}{e^z_i + \Sigma_{k, k\neq i}^K e^z_k}) = ...$$

$$ ... = e^z_j \cdot \frac{ \frac{1}{\partial z_i}(1) \cdot (e^z_i + \Sigma_{k, k\neq i}^K e^z_k) - 1 \frac{1}{\partial z_i}(e^z_i + \Sigma_{k, k\neq i}^K e^z_k) }{(e^z_i + \Sigma_{k, k\neq i}^K e^z_k)^2 } = ... $$$$ ... = e^z_j \cdot \frac{ 0 - e^z_i }{(e^z_i + \Sigma_{k, k\neq i}^K e^z_k)^2} = \frac{ - e^z_j \cdot e^z_i }{(e^z_i + \Sigma_{k, k\neq i}^K e^z_k)^2} = \frac{ - e^z_j \cdot e^z_i }{( \Sigma_{k}^K e^z_k)^2} = - softmax(z_i) \cdot softmax(z_j)$$

Alright, just to summarize our results and have a clear reference:

  • if $i=j$ $$ \frac{\partial softmax(z_j))}{\partial z_i} = softmax(z_j) - softmax^2(z_j), \text{ }\text{ }\text{ if } i=j $$
  • if $i \neq j$ $$\frac{\partial softmax(z_j))}{\partial z_i} = - softmax(z_i) \cdot softmax(z_j), \text{ }\text{ }\text{ if } i\neq j$$
b) Calculating: $\frac{\partial L_y}{\partial W_{ya}}$

Our goal is to compute derivatives of the cost function $L$ for a single cell with relation to weights. Let's compute the first derivative $ \frac{\partial L_t}{W_{ya}}$.

Let's recall the functions that we'd be derivating over:

  1. $L_t = - y^{<t>} log(\hat{y}^{<t>})$
  2. $\hat{y}^{<t>} = softmax (z )$
  3. $z = W_{ya} a^{<t>} + b_y$

Following a simple chain rule we can write:

$$ \frac{\partial L_t}{\partial W_{ya}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot \frac{\partial z}{\partial W_{ya}} $$

We can already see that: $\frac{\partial z}{\partial W_{ya}} = a^{<t>}$. We also already computed derivative of $softmax$. Let's compute: $\frac{\partial L_t}{\partial \hat{y}^{<t>}}$

It is nice to remember that $ \frac{1}{\partial x} (log(x)) = \frac{1}{x} $. With that we are properly equipped to solve:

$$\frac{\partial L_t}{\partial \hat{y}^{<t>}} = \frac{\partial}{\partial \hat{y}^{<t>}} (- y^{<t>} log(\hat{y}^{<t>})) = \frac{- y^{<t>}}{\hat{y}^{<t>}}$$

Let's put everything here together:

$$ \frac{\partial L_t}{\partial W_{ya}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot \frac{\partial z}{\partial W_{ya}} = \frac{- y^{<t>}}{\hat{y}^{<t>}} \cdot \Big( ( softmax(z_i) - softmax^2(z_j) ) + (- softmax(z_i)\cdot softmax(z_j)) \Big) \cdot a^{<t>} $$

But first let's make sure we have proper notation. What is $\hat{y}^{<t>}$? It is our model's prediction. And this prediction is made with $softmax$ of course. Hence, we have to unify our notation:

$$ \frac{\partial L_t}{\partial z} = \frac{- y^{<t_i>}}{\hat{y}^{<t_i>}} \cdot (\hat{y}^{<t_i>} - \hat{y}^{2<t_i>}) + \Sigma^K_{k,k\neq i} (\frac{- y^{<t_k>}}{\hat{y}^{<t_k>}}) \cdot (- \hat{y}^{<t_k>} \cdot \hat{y}^{<t_i>}) ... $$

Which is simplified as

$$ ... = - y^{<t_i>} + y^{<t_i>}\hat{y}^{<t_l>} + \Sigma^K_{k,k\neq i} y^{<t_k>}\hat{y}^{<t_i>}$$

We see the sum $\Sigma^K_{k,k\neq i}$ that goes over every element except $k=l$, but there is also one term $y^{<t_i>}\hat{y}^{<t_l>}$ that adds this "except" term. We can thus simplify:

$$ \frac{\partial L_t}{\partial z} = -y^{<t_l>} + \hat{y}^{<t_l>} \Sigma_k^K y^{<t_k>} $$

So, just to recap. $\hat{y}$ are predictions, they take values between 0 and 1, and sum up to one. But, $y$ (without this funny hat) are 'ground truth' labels. Essentially $y$ is one-hot coded vector that has one $'1'$ and the rest $'0'$. It also sums to one obviously. Thus, we can further simplify:

$$ \frac{\partial L_t}{\partial z} = -y^{<t_l>} + \hat{y}^{<t_l>} = \hat{y}^{<t_l>} -y^{<t_l>} $$

So essentially prediction $\hat{y}$ minus the true label.

$$ \frac{\partial L_t}{\partial W_{ya}} = \frac{\partial L_t}{\partial z} \frac{\partial z}{\partial W_{ya}} = (\hat{y}^{<t_l>} -y^{<t_l>} ) \cdot a^{<t>}$$

Thus in our code in we write

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    gradients['dWya'] += np.dot(dy, a.T)
    ...

We add this gradients for each time-step $t$. As our total cost function is a sum of costs for each time step:

$$ L = \Sigma^T_{i=0} L_i $$

so:

$$ \frac{\partial L}{\partial W_{ya}} = \Sigma^T_{i=0} \frac{\partial L_i}{\partial W_{ya}} $$
c) Calculating: $\frac{\partial L_y}{\partial b_y}$

This will be simple as we've already computed most of derivatives:

$$ \frac{\partial L_t}{\partial b_{y}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot \frac{\partial z}{\partial b_y} $$

We essentialy don't have the last component which is equal to 1, as $z = W_{ya} a^{<t>} + b_y$, so $\frac{\partial z}{\partial b_y} = 1$

So we're left with $$ \frac{\partial L_t}{\partial b_{y}} = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z} \cdot 1 = \frac{\partial L_t}{\partial z} = \hat{y}^{<t_l>} -y^{<t_l>} $$

Hence in our code we'll write

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):

    #...
    gradients['dby'] += dy
    #...
d) Calculating: $W_{aa}$ (time) gradients

Let's recap our RNN cell:

And our equations (I added a bit of helper notations, $p^{<t>}$ and $z^{<t>}$, in order to split them into several equations for the clarity):

  1. $L_t = - y^{<t>} log(\hat{y}^{<t>})$
  2. $\hat{y}^{<t>} = softmax(z)$
  3. $z^{<t>}= W_{ya} a^{<t>} + b_y$
  4. $a^{<t>} = tanh(p)$
  5. $p^{<t>} = W_{ax} x^{<t>} + W_{aa}a^{<t-1>} + b_a$

What we want to compute now is $\frac{\partial a^{<t>}}{\partial W_{aa}}$.

Lets use (again) a chain rule for our equations:

$$ \frac{\partial a^{<t>}}{\partial W_{aa}} = \frac{ \partial a^{<t>} }{\partial p} \frac{\partial p}{\partial W_{aa}} $$

We already know $\frac{ \partial a^{<t>} }{\partial p} $ is just a derivative of tanh function we've computed before: $$\frac{\partial tanh(z)}{\partial z} = 1 - tanh^2(z) $$

And $\frac{\partial p}{\partial W_{aa}} $ is a simple derivative: $$ \frac{\partial p}{\partial W_{aa}} = a^{<t-1>} $$

Combining these we get:

$$ \frac{\partial a^{<t>}}{\partial W_{aa}} = (1 - tanh^2(p^{<t>}) \cdot a^{<t-1>} = (1-a^{2<t>})a^{<t-1>} $$

So we write in our code:

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    #...
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    #...

We can now easily compute other gradient $\frac{\partial a^{<t>}}{\partial b_a}$ from the go:

$$\frac{\partial a^{<t>}}{\partial b_a} = \frac{\partial a^{<t>}}{\partial p} \cdot \frac{\partial p}{\partial b_a} = (1-a^{2<t>})$$

Thus in our code we write:

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    #....
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['db'] += daraw
    #...
e) Calculating: $W_{ax}$ (input) gradients

Most of calculations are done. This doesn't need much explanation:

$$\frac{\partial a^{<t>}}{\partial W{ax}} = \frac{\partial a^{<t>}}{\partial p} \cdot \frac{\partial p}{\partial W_{ax}} = (1-a^{2<t>}) \cdot x^{<t>}$$

Hence, our code has:

def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    #...
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    #...
    gradients['dWax'] += np.dot(daraw, x.T)
    #...

The final step is to compute gradients with of our loss function for a single example $L_t$ but with respect to $a^{<t>}$.

Again, I am going to recall our equations for clarity:

  1. $L_t = - y^{<t>} log(\hat{y}^{<t>})$
  2. $\hat{y}^{<t>} = softmax(z^{<t>})$
  3. $z^{<t>}= W_{ya} a^{<t>} + b_y$
  4. $a^{<t>} = tanh(p)$
  5. $p^{<t>} = W_{ax} x^{<t>} + W_{aa}a^{<t-1>} + b_a$
$$ \frac{\partial L_t}{\partial a_{<t>} } = \frac{\partial L_t}{\partial \hat{y}^{<t>}} \cdot \frac{\partial \hat{y}^{<t>}}{\partial z^{<t>} } \cdot \frac{\partial z^{<t>} }{\partial a_{<t>}} $$

The first two derivatives we have already solved.

$$ \frac{\partial L_t}{\partial z} = \hat{y}^{<t_l>} -y^{<t_l>} = `dy` \text{ in our code } $$$$ \frac{\partial z^{<t>} }{\partial a^{<t>}} = W_{ya}$$

So we get: $$ \frac{\partial L_t}{\partial W_{ya} } = (\hat{y}^{<t_l>} -y^{<t_l>}) W_{ya} $$

Which we implement in code as:

da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h

And now, the final (!) derivative:

$$ \frac{\partial a^{<t>}}{\partial a^{<t-1>}} = \frac{\partial a^{<t>} }{\partial p } \cdot \frac{\partial p}{\partial a^{<t-1>} }$$

The first derivative is a derivative over tanh, the second is pretty simple

$$ \frac{\partial a^{<t>}}{\partial a_{<t-1>}} = (1 - tanh^2(p^{<t>}) \cdot W_{aa} = (1-a^{2<t>})W_{aa} $$

Which we refer to our code as:

daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)

Ufff... derivatives are easy, but take time, patients and energy. No wonder major frameworks have this implemented it "under the hood" so people can only create computation graphs.


In [27]:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    
    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    return gradients

2.2.4. Backward pass through a network

After we've implemented our backward pass through the network for a single cell. Now we just have to repeat this for our network.


In [28]:
def update_parameters(parameters, gradients, lr):

    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b']  += -lr * gradients['db']
    parameters['by']  += -lr * gradients['dby']
    return parameters


def rnn_backward(X, Y, parameters, cache):
    # Initialize gradients as an empty dictionary
    gradients = {}
    
    # Retrieve from cache and parameters
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    # each one should be initialized to zeros of the same dimension as its corresponding parameter
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])
    
    # Backpropagate through time
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])

    
    return gradients, a

2.2.5. Gradient clipping

Our "flow" throughout the network is as follows:

  1. Forward pass throught the network
  2. Loss computation
  3. Backward propagation - computing gradients
  4. Parameter update

Between points 3 and 4 we'd be far better of with gradient clipping to prevent "exploding gradients", i.e. a situation of numberical overflow that would often produced NaNs as gradients take enormously huge number. In order to avoid it we just specify na maxValue. If a gradients goes above or below this value we clip it, i.e. set to this maxValue. Pretty simple:


In [33]:
def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    # Unpacking gradient for clarity
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    # clipping to mitigate exploding gradients
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue,+maxValue,gradient) #we overwrite current 'gradient' with clipped one.

    # We pack gradient to a ditionary
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    #and return it.
    return gradients

III. Creating our RNN architecture

Overview

Our architecture...

Sampling

Sampling is the process of generating novel observation, in our example new letters that would constitue a made-up bacterial genera name.

If our network is trained we can then pass a zero vector $\vec{0}$ as input hidden state $a^{<0>}$. Then we perform propagation through the first unit of our NN. So we obtain next hidden state $a^{<t+1>}$ and prediction vector $\hat{y}^{<t+1>}$ that represents the probabilities of each character $i$:

$$ a^{<t+1>} = tanh(W_{ax}x^{<t>} + W_{aa}a^{<t>} + b) $$$$ \hat{y}^{<t+1>} = softmax( W_{ya}a^{<t+1>} + b_y ) $$

Having computed a vector of probabilities $\hat{y}^{<t+1>}$ we now perform sampling procedure. We do not pick just the highest probability, this would in turn generate the same results each time for a given dataset. We do not want to pick our characters randomly, as results would become random, and all the architecture build would become useless. The key is to select (i.e. sample) from our $\hat{y}^{<t+1>}$ distribution based on that distribution.

In other words, we'll pick the index $i$ (remember that we have a one-hot encoded our alphabeth) with the probability encoded by the $i$-th index in $\hat{y}^{<t+1>}$ matrix. We'll use numpy random.choice to achieve this.

import numpy as np
np.random.choice([list of possible values], p = flattened-probability-distribution)

The final step is to overwrite the variable x^{<t+1>} with our predicted one-hot encoding of selected index $i$ from the previous step. This is represented as red arrow on the picture below:

If our network is trained we can then pass a zero vector $\vec{0}$ as input hidden state $a^{<0>}$. Then we perform propagation through the first unit of our NN. So we obtain next hidden state $a^{<t+1>}$ and prediction $\hat{y}^{<t+1>}$ that represents the probability that the character indexed by $i$ will become the next character:

$$ a^{<t+1>} = tanh(W_{ax}x^{<t>} + W_{aa}a^{<t>} + b) $$$$ \hat{y}^{<t+1>} = softmax( W_{ya}a^{<t+1>} + b_y ) $$

Having computed a vector of probabilities $\hat{y}^{<t+1>}$ we now perform sampling procedure. We do not pick just the highest probability, this would in turn generate the same results each time for a given dataset. We do not want to pick our characters randomly, as results would become random, and all the architecture build would become useless. The key is to select (i.e. sample) from our $\hat{y}^{<t+1>}$ distribution based on that distribution.

In other words, we'll pick the index $i$ (remember that we have a one-hot encoded our alphabeth) with the probability encoded by the $i$-th index in $\hat{y}^{<t+1>}$ matrix. We'll use numpy random.choice to achieve this.

import numpy as np
np.random.choice([list of possible values], p = flattened-probability-distribution)

The final step is to overwrite the variable x^{<t+1>} with our predicted one-hot encoding of selected index $i$ from the previous step. This is represented as red arrow on the picture below:

We'll continue this propagation until we reach end of the line character \n. Then our generation is finished, and we can print our result. If something goes wrong, then we additionally limit ourselves to 50 character limit as indicated by counter variable.


In [34]:
def sample(parameters, char_to_ix):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.


    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    # Step 1: Create the one-hot vector x for the first character (initializing the sequence generation)
    x = np.zeros((vocab_size,1))
    # Step 1': Initialize a_prev as zeros
    a_prev = np.zeros((n_a,1))
    
    # An empty list of indices, this is the list which will contain the list of indices of the characters to generate 
    indices = []
    
    # Idx is a flag to detect a newline character, we initialize it to -1
    idx = -1 
    
    # Loop over time-steps t. At each time-step, sample a character from a probability distribution and append 
    # its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well 
    # trained model), which helps debugging and prevents entering an infinite loop. 
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward propagate x using the above equations
        a = np.tanh(np.dot(Wax,x)+np.dot(Waa,a_prev)+b)
        z = np.dot(Wya,a)+by
        y = softmax(z)
        
        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        idx = np.random.choice(range(vocab_size), p = y.ravel()) # numpy 'ravel' flattens our list.
      
        # Append the index to "indices"
        indices.append(idx)
        
        # Step 4: Overwrite the input character as the one corresponding to the sampled index.
        x = np.zeros((vocab_size,1))
        x[idx] = 1 #this is one hot encoding. All zeros, then set index as '1'.
        
        # Update "a_prev" to be "a"
        a_prev = a

        counter +=1

    # if something goes wrong, we'll stop at 50th index.
    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

Building a language model

Stochastic Gradient Descent optimization

In our model we'll use stochastic gradient descent, meaning we'll loop over one example at a time.

Function optimize will forward propagate through our network, compute and clip gradients, and update parameters according to provided learning rate. Here we are going to use already implemented functions:


In [35]:
def optimize(X, Y, a_prev, parameters, learning_rate = 0.01,maxGradient=5):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    

    # Forward propagate through time 
    loss, cache =  rnn_forward(X, Y, a_prev, parameters)
    
    # Backpropagate through time
    gradients, a = rnn_backward(X, Y, parameters, cache)
    
    # Clip your gradients between -maxGradient (min) and maxGradient (max) 
    gradients = clip(gradients, maxGradient)
    
    # Update parameters
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    
    return loss, gradients, a[len(X)-1]

Training the model


In [36]:
def model(filename, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, genera_names = 7, vocab_size = vocab_size,learning_rate=0.01,params=None,printing=True):
    """
    Trains the model and generates novel genera names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    genera_names -- number of bacterial names you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text, size of the vocabulary
    
    Returns:
    parameters -- learned parameters
    """
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    if params==None:
        parameters = initialize_parameters(n_a, n_x, n_y)
    else:
        parameters=params
    
    # Initialize loss (this is required because we want to smooth our loss)
    loss = get_initial_loss(vocab_size, genera_names)
    
    # Build list of all bacterial names (training examples).
    with open(filename) as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples] #we convert names to lowercase and create a list of genera names
    
    # Shuffle list of all bacterial names
    np.random.shuffle(examples)
    
    # Initialize the hidden state
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):

        # Define one training example (X,Y) (≈ 2 lines)
        index = j % len(examples) # we don't want our index to exceed the number of iterations, so we always refer to a valid example
        
        X = [None] + [char_to_ix[ch] for ch in examples[index]] 
            # [None] at the beginning will be interpreted as zero vector by a function `rnn_forward`
        Y = X[1:] + [char_to_ix["\n"]]
            # Y won't have [None] but will be shifted to the right and have end of sentence sign.
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate = learning_rate)
        

        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if (j % 2000==0 and printing==True):
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            # The number of genera names to print
            for name in range(genera_names):
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix)
                print_sample(sampled_indices, ix_to_char)

            print('\n')
        
    return parameters

In [37]:
parameters = model('genera.txt', ix_to_char, char_to_ix, num_iterations = 85000,params=None,printing=False)

In [38]:
parameters = model('genera.txt', ix_to_char, char_to_ix, num_iterations = 8000,params=parameters, genera_names=10)


Iteration: 0, Loss: 32.940970

Caryhepherea
Xiderlis
Daluriphodycapideaestrerous
Hyoellochus
Agrerodancherus
Sardaera
Talkita
Gonisaia
Verodopoda
Pydodya


Iteration: 2000, Loss: 25.390203

Oelytilon
Hropsium
Rypesploma
Dicocenchomenum
Calnima
Sestera
Hingia
Horgopis
Helficarmus
Sylotes


Iteration: 4000, Loss: 23.936103

Salanillus
Thraxtnopla
Ehndcospira
Pyrropotropes
Testesyllus
Phrbodopibala
Lentromingonibactimum
Rytticoccus
Zylocrocera
Dopyora


Iteration: 6000, Loss: 23.759958

Obyraptorium
Gopediodrida
Sardobella
Teles
Acorospa
Decerocolponia
Hlosta
Nonea
Tidrdillla
Alacophivintesia


Results:

Below I show some of the results that you cannot find in Silva database, but for me sound probable and really cool:

  • Nitronella
  • Sebacter
  • Vetia
  • Setinelfonax
  • Vestaphylococcus
  • Setonas
  • Nembacterium
  • Pioclococclus
  • Detiptonus
  • Frreptococcus
  • Teeutomonas
  • Fetiphylococcus
  • Blunna
  • Alococella
  • Tantatum
  • Cublia
  • Palibacter
  • Arstrosa
  • Glymia
  • Actoboctellibacterium