Problem 1: Language Modeling with RNNs

  • Learning Objective: In this problem, you are going to implement simple recurrent neural networks to deeply understand how RNNs works.
  • Provided Code: We provide the skeletons of classes you need to complete. Forward checking and gradient checkings are provided for verifying your implementation as well.
  • TODOs: you will implement a LSTM and use them it to train a model that can generate text using your own text source (novel, lyrics etc). Also please do not forget to answer to the two inline questions before LSTM.

In [3]:
from lib.rnn import *
from lib.layer_utils import *
from lib.grad_check import *
from lib.optim import *
from lib.train import *
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Recurrent Neural Networks

We will use recurrent neural network (RNN) language models for text generation. The file lib/layer_utils.py contains implementations of different layer types that are needed for recurrent neural networks, and the file lib/rnn.py uses these layers to implement an text generation model.

We will implement LSTM layers in lib/layer_utils.py. As a reference, you are given complete codes for other layers including a vanilla RNN. Let's first look through the vanilla RNN, and other layers you may need for language modeling. The first part doesn't involve any coding. You can simply check the codes and run to make sure everything works as you expect.

Vanilla RNN: step forward

Open the file lib/layer_utils.py. This file implements the forward and backward passes for different types of layers that are commonly used in recurrent neural networks.

First check the implementation of the function step_forward which implements the forward pass for a single timestep of a vanilla recurrent neural network. We provide this function for you. After doing so run the following code. You should see errors less than 1e-8.


In [4]:
N, D, H = 3, 10, 4

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")
x = np.linspace(-0.4, 0.7, num=N*D).reshape(N, D)
prev_h = np.linspace(-0.2, 0.5, num=N*H).reshape(N, H)

rnn.params[rnn.wx_name] = np.linspace(-0.1, 0.9, num=D*H).reshape(D, H)
rnn.params[rnn.wh_name] = np.linspace(-0.3, 0.7, num=H*H).reshape(H, H)
rnn.params[rnn.b_name] = np.linspace(-0.2, 0.4, num=H)

next_h, _ = rnn.step_forward(x, prev_h)
expected_next_h = np.asarray([
  [-0.58172089, -0.50182032, -0.41232771, -0.31410098],
  [ 0.66854692,  0.79562378,  0.87755553,  0.92795967],
  [ 0.97934501,  0.99144213,  0.99646691,  0.99854353]])

print('next_h error: ', rel_error(expected_next_h, next_h))


('next_h error: ', 6.2924214264710366e-09)

Vanilla RNN: step backward

In the VanillaRNN class in the file lib/layer_utils.py check the step_backward function. After doing so run the following to numerically gradient check the implementation. You should see errors less than 1e-8.


In [5]:
np.random.seed(599)
N, D, H = 4, 5, 6

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")

x = np.random.randn(N, D)
h = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)

rnn.params[rnn.wx_name] = Wx
rnn.params[rnn.wh_name] = Wh
rnn.params[rnn.b_name] = b

out, meta = rnn.step_forward(x, h)

dnext_h = np.random.randn(*out.shape)

dx_num = eval_numerical_gradient_array(lambda x: rnn.step_forward(x, h)[0], x, dnext_h)
dprev_h_num = eval_numerical_gradient_array(lambda h: rnn.step_forward(x, h)[0], h, dnext_h)
dWx_num = eval_numerical_gradient_array(lambda Wx: rnn.step_forward(x, h)[0], Wx, dnext_h)
dWh_num = eval_numerical_gradient_array(lambda Wh: rnn.step_forward(x, h)[0], Wh, dnext_h)
db_num = eval_numerical_gradient_array(lambda b: rnn.step_forward(x, h)[0], b, dnext_h)

dx, dprev_h, dWx, dWh, db = rnn.step_backward(dnext_h, meta)

print('dx error: ', rel_error(dx_num, dx))
print('dprev_h error: ', rel_error(dprev_h_num, dprev_h))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))


('dx error: ', 1.0572038136587296e-10)
('dprev_h error: ', 3.5393379712705807e-10)
('dWx error: ', 3.8546248896806929e-10)
('dWh error: ', 1.5834356040887039e-10)
('db error: ', 2.0171824713729017e-11)

Vanilla RNN: forward

Now that you have checked the forward and backward passes for a single timestep of a vanilla RNN, you will see how they are combined to implement a RNN that process an entire sequence of data.

In the VanillaRNN class in the file lib/layer_utils.py, check the function forward. We provide this function for you. This is implemented using the step_forward function that you defined above. After doing so run the following to check the implementation. You should see errors less than 1e-7.


In [6]:
N, T, D, H = 2, 3, 4, 5

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")

x = np.linspace(-0.1, 0.3, num=N*T*D).reshape(N, T, D)
h0 = np.linspace(-0.3, 0.1, num=N*H).reshape(N, H)
Wx = np.linspace(-0.2, 0.4, num=D*H).reshape(D, H)
Wh = np.linspace(-0.4, 0.1, num=H*H).reshape(H, H)
b = np.linspace(-0.7, 0.1, num=H)

rnn.params[rnn.wx_name] = Wx
rnn.params[rnn.wh_name] = Wh
rnn.params[rnn.b_name] = b

h = rnn.forward(x, h0)
expected_h = np.asarray([
  [
    [-0.42070749, -0.27279261, -0.11074945,  0.05740409,  0.22236251],
    [-0.39525808, -0.22554661, -0.0409454,   0.14649412,  0.32397316],
    [-0.42305111, -0.24223728, -0.04287027,  0.15997045,  0.35014525],
  ],
  [
    [-0.55857474, -0.39065825, -0.19198182,  0.02378408,  0.23735671],
    [-0.27150199, -0.07088804,  0.13562939,  0.33099728,  0.50158768],
    [-0.51014825, -0.30524429, -0.06755202,  0.17806392,  0.40333043]]])
print('h error: ', rel_error(expected_h, h))


('h error: ', 7.7284661583051643e-08)

Vanilla RNN: backward

In the file lib/layer_utils.py, check the backward pass for a vanilla RNN in the function backward in the VanillaRNN class. We provide this function for you. This runs back-propagation over the entire sequence, calling into the step_backward function defined above. You should see errors less than 5e-7.


In [7]:
np.random.seed(599)

N, D, T, H = 2, 3, 10, 5

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")

x = np.random.randn(N, T, D)
h0 = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)

rnn.params[rnn.wx_name] = Wx
rnn.params[rnn.wh_name] = Wh
rnn.params[rnn.b_name] = b

out = rnn.forward(x, h0)

dout = np.random.randn(*out.shape)

dx, dh0 = rnn.backward(dout)

dx_num = eval_numerical_gradient_array(lambda x: rnn.forward(x, h0), x, dout)
dh0_num = eval_numerical_gradient_array(lambda h0: rnn.forward(x, h0), h0, dout)
dWx_num = eval_numerical_gradient_array(lambda Wx: rnn.forward(x, h0), Wx, dout)
dWh_num = eval_numerical_gradient_array(lambda Wh: rnn.forward(x, h0), Wh, dout)
db_num = eval_numerical_gradient_array(lambda b: rnn.forward(x, h0), b, dout)

dWx = rnn.grads[rnn.wx_name]
dWh = rnn.grads[rnn.wh_name]
db = rnn.grads[rnn.b_name]

print('dx error: ', rel_error(dx_num, dx))
print('dh0 error: ', rel_error(dh0_num, dh0))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))


('dx error: ', 7.7799301042286723e-10)
('dh0 error: ', 1.0070785373693291e-10)
('dWx error: ', 4.9188023498158563e-09)
('dWh error: ', 1.1713005188891452e-09)
('db error: ', 2.4727869972841184e-10)

Word embedding: forward

In deep learning systems, we commonly represent words using vectors. Each word of the vocabulary will be associated with a vector, and these vectors will be learned jointly with the rest of the system.

In the file lib/layer_utils.py, check the function forward in the word_embedding class to convert words (represented by integers) into vectors. We provide this function for you. Run the following to check the implementation. You should see error around 1e-8.


In [8]:
N, T, V, D = 2, 4, 5, 3

we = word_embedding(V, D, name="we")

x = np.asarray([[0, 3, 1, 2], [2, 1, 0, 3]])
W = np.linspace(0, 1, num=V*D).reshape(V, D)

we.params[we.w_name] = W

out = we.forward(x)
expected_out = np.asarray([
 [[ 0.,          0.07142857,  0.14285714],
  [ 0.64285714,  0.71428571,  0.78571429],
  [ 0.21428571,  0.28571429,  0.35714286],
  [ 0.42857143,  0.5,         0.57142857]],
 [[ 0.42857143,  0.5,         0.57142857],
  [ 0.21428571,  0.28571429,  0.35714286],
  [ 0.,          0.07142857,  0.14285714],
  [ 0.64285714,  0.71428571,  0.78571429]]])

print('out error: ', rel_error(expected_out, out))


('out error: ', 1.0000000094736443e-08)

Word embedding: backward

Check the backward pass for the word embedding function in the function backward in the word_embedding class. We provide this function for you. After doing so run the following to numerically gradient check the implementation. You should see errors less than 1e-11.


In [9]:
np.random.seed(599)

N, T, V, D = 50, 3, 5, 6

we = word_embedding(V, D, name="we")

x = np.random.randint(V, size=(N, T))
W = np.random.randn(V, D)

we.params[we.w_name] = W

out = we.forward(x)
dout = np.random.randn(*out.shape)
we.backward(dout)

dW = we.grads[we.w_name]

f = lambda W: we.forward(x)
dW_num = eval_numerical_gradient_array(f, W, dout)

print('dW error: ', rel_error(dW, dW_num))


('dW error: ', 3.2759325224156931e-12)

Inline Question: Why do we want to represent words using word embeddings instead of one hot vector ( https://en.wikipedia.org/wiki/One-hot )? Provide one advantage of word embeddings.

Ans:

One-hot vector encoding implies thes words are independent. However, the word-embeddings representation captures the dependence between words. For example , dependence like 'a' will never be followed by 'the'. Also, one-hot encoding of a large volcabulary will result in a large sparse matrix for input which is known to have problems while generalizing.

Temporal Fully Connected layer

At every timestep we use an affine function to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary. Because this is very similar to the fully connected layer that you implemented in assignment 1, we have provided this function for you in the forward and backward functions in the file lib/layer_util.py. Run the following to perform numeric gradient checking on the implementation. You should see errors less than 1e-9.


In [10]:
np.random.seed(599)

# Gradient check for temporal affine layer
N, T, D, M = 2, 3, 4, 5

t_fc = temporal_fc(D, M, init_scale=0.02, name='test_t_fc')

x = np.random.randn(N, T, D)
w = np.random.randn(D, M)
b = np.random.randn(M)

t_fc.params[t_fc.w_name] = w
t_fc.params[t_fc.b_name] = b

out = t_fc.forward(x)

dout = np.random.randn(*out.shape)

dx_num = eval_numerical_gradient_array(lambda x: t_fc.forward(x), x, dout)
dw_num = eval_numerical_gradient_array(lambda w: t_fc.forward(x), w, dout)
db_num = eval_numerical_gradient_array(lambda b: t_fc.forward(x), b, dout)

dx = t_fc.backward(dout)
dw = t_fc.grads[t_fc.w_name]
db = t_fc.grads[t_fc.b_name]

print('dx error: ', rel_error(dx_num, dx))
print('dw error: ', rel_error(dw_num, dw))
print('db error: ', rel_error(db_num, db))


('dx error: ', 1.0589570959816868e-09)
('dw error: ', 1.3273531472902296e-10)
('db error: ', 1.8540488325611036e-11)

Temporal Softmax loss

In an RNN language model, at every timestep we produce a score for each word in the vocabulary. We know the ground-truth word at each timestep, so we use a softmax loss function to compute loss and gradient at each timestep. We sum the losses over time and average them over the minibatch.

We provide this loss function for you; look at the temporal_softmax_loss function in the file lib/layer_utils.py.

Run the following cell to sanity check the loss and perform numeric gradient checking on the function. You should see an error for dx less than 1e-7.


In [11]:
loss_func = temporal_softmax_loss()

# Sanity check for temporal softmax loss
N, T, V = 100, 1, 10

def check_loss(N, T, V, p):
    x = 0.001 * np.random.randn(N, T, V)
    y = np.random.randint(V, size=(N, T))
    mask = np.random.rand(N, T) <= p
    print(loss_func.forward(x, y, mask))
  
check_loss(100, 1, 10, 1.0)   # Should be about 2.3
check_loss(100, 10, 10, 1.0)  # Should be about 23
check_loss(5000, 10, 10, 0.1) # Should be about 2.3

# Gradient check for temporal softmax loss
N, T, V = 7, 8, 9

x = np.random.randn(N, T, V)
y = np.random.randint(V, size=(N, T))
mask = (np.random.rand(N, T) > 0.5)

loss = loss_func.forward(x, y, mask)
dx = loss_func.backward()

dx_num = eval_numerical_gradient(lambda x: loss_func.forward(x, y, mask), x, verbose=False)

print('dx error: ', rel_error(dx, dx_num))


2.3026437533
23.0261456673
2.30211012501
('dx error: ', 3.7356355317189332e-08)

Inline Question: Using softmax function over vocabulary for word prediction is common in language modeling. However, this technique is not perfect, what do you think are the major disadvantages of it? Pleaes provide one disadvantage of softmax function over vocabulary.

Ans:

A larger vocabulary will lead to increased training time, since the denominator requires summing over all terms in the vocabulary.

RNN for language modeling

Now that you have the necessary layers, you can combine them to build an language modeling model. Open the file lib/rnn.py and look at the TestRNN class.

Check the forward and backward pass of the model in the loss function. For now you only see the implementation of the case where cell_type='rnn' for vanialla RNNs; you will implement the LSTM case later. After doing so, run the following to check the forward pass using a small test case; you should see error less than 1e-10.


In [12]:
N, D, H = 10, 20, 40
V = 4
T = 13

model = TestRNN(D, H, cell_type='rnn')
loss_func = temporal_softmax_loss()

# Set all model parameters to fixed values
for k, v in model.params.items():
    model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)
model.assign_params()

features = np.linspace(-1.5, 0.3, num=(N * D * T)).reshape(N, T, D)
h0 = np.linspace(-1.5, 0.5, num=(N*H)).reshape(N, H)
labels = (np.arange(N * T) % V).reshape(N, T)

pred = model.forward(features, h0)

# You'll need this
mask = np.ones((N, T))

loss = loss_func.forward(pred, labels, mask)
dLoss = loss_func.backward()

expected_loss = 51.0949189134

print('loss: ', loss)
print('expected loss: ', expected_loss)
print('difference: ', abs(loss - expected_loss))


('loss: ', 51.094918913361184)
('expected loss: ', 51.0949189134)
('difference: ', 3.8816949654574273e-11)

Run the following cell to perform numeric gradient checking on the TestRNN class; you should errors around 1e-7 or less.


In [13]:
np.random.seed(599)

batch_size = 2
timesteps = 3
input_dim = 4
hidden_dim = 6
label_size = 4

labels = np.random.randint(label_size, size=(batch_size, timesteps))
features = np.random.randn(batch_size, timesteps, input_dim)
h0 = np.random.randn(batch_size, hidden_dim)

model = TestRNN(input_dim, hidden_dim, cell_type='rnn')
loss_func = temporal_softmax_loss()

pred = model.forward(features, h0)

# You'll need this
mask = np.ones((batch_size, timesteps))

loss = loss_func.forward(pred, labels, mask)
dLoss = loss_func.backward()

dout, dh0 = model.backward(dLoss)

grads = model.grads

for param_name in sorted(grads):
    
    f = lambda _: loss_func.forward(model.forward(features, h0), labels, mask)
    param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)
    e = rel_error(param_grad_num, grads[param_name])
    print('%s relative error: %e' % (param_name, e))


vanilla_rnn_b relative error: 1.728230e-08
vanilla_rnn_wh relative error: 3.659114e-07
vanilla_rnn_wx relative error: 1.780015e-09

LSTM

Vanilla RNNs can be tough to train on long sequences due to vanishing and exploding gradiants. LSTMs solve this problem by replacing the simple update rule of the vanilla RNN with a gating mechanism as follows.

Similar to the vanilla RNN, at each timestep we receive an input $x_t\in\mathbb{R}^D$ and the previous hidden state $h_{t-1}\in\mathbb{R}^H$; what is different in the LSTM is to maintains an $H$-dimensional cell state, so we also receive the previous cell state $c_{t-1}\in\mathbb{R}^H$. The learnable parameters of the LSTM are an input-to-hidden matrix $W_x\in\mathbb{R}^{4H\times D}$, a hidden-to-hidden matrix $W_h\in\mathbb{R}^{4H\times H}$ and a bias vector $b\in\mathbb{R}^{4H}$.

At each timestep we first compute an activation vector $a\in\mathbb{R}^{4H}$ as $a=W_xx_t + W_hh_{t-1}+b$. We then divide this into four vectors $a_i,a_f,a_o,a_g\in\mathbb{R}^H$ where $a_i$ consists of the first $H$ elements of $a$, $a_f$ is the next $H$ elements of $a$, etc. We then compute the input gate $g\in\mathbb{R}^H$, forget gate $f\in\mathbb{R}^H$, output gate $o\in\mathbb{R}^H$ and block input $g\in\mathbb{R}^H$ as

$$ \begin{align*} i = \sigma(a_i) \hspace{2pc} f = \sigma(a_f) \hspace{2pc} o = \sigma(a_o) \hspace{2pc} g = \tanh(a_g) \end{align*} $$

where $\sigma$ is the sigmoid function and $\tanh$ is the hyperbolic tangent, both applied elementwise.

Finally we compute the next cell state $c_t$ and next hidden state $h_t$ as

$$ c_{t} = f\odot c_{t-1} + i\odot g \hspace{4pc} h_t = o\odot\tanh(c_t) $$

where $\odot$ is the elementwise product of vectors.

In the rest of the notebook we will implement the LSTM update rule and apply it to the text generation task.

In the code, we assume that data is stored in batches so that $X_t \in \mathbb{R}^{N\times D}$, and will work with transposed versions of the parameters: $W_x \in \mathbb{R}^{D \times 4H}$, $W_h \in \mathbb{R}^{H\times 4H}$ so that activations $A \in \mathbb{R}^{N\times 4H}$ can be computed efficiently as $A = X_t W_x + H_{t-1} W_h$

LSTM: step forward

Implement the forward pass for a single timestep of an LSTM in the step_forward function in the file lib/layer_utils.py. This should be similar to the step_forward function that you implemented above, but using the LSTM update rule instead.

Once you are done, run the following to perform a simple test of your implementation. You should see errors around 1e-8 or less.


In [15]:
N, D, H = 3, 4, 5

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.linspace(-0.4, 1.2, num=N*D).reshape(N, D)
prev_h = np.linspace(-0.3, 0.7, num=N*H).reshape(N, H)
prev_c = np.linspace(-0.4, 0.9, num=N*H).reshape(N, H)
Wx = np.linspace(-2.1, 1.3, num=4*D*H).reshape(D, 4 * H)
Wh = np.linspace(-0.7, 2.2, num=4*H*H).reshape(H, 4 * H)
b = np.linspace(0.3, 0.7, num=4*H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

next_h, next_c, cache = lstm.step_forward(x, prev_h, prev_c)

expected_next_h = np.asarray([
    [ 0.24635157,  0.28610883,  0.32240467,  0.35525807,  0.38474904],
    [ 0.49223563,  0.55611431,  0.61507696,  0.66844003,  0.7159181 ],
    [ 0.56735664,  0.66310127,  0.74419266,  0.80889665,  0.858299  ]])
expected_next_c = np.asarray([
    [ 0.32986176,  0.39145139,  0.451556,    0.51014116,  0.56717407],
    [ 0.66382255,  0.76674007,  0.87195994,  0.97902709,  1.08751345],
    [ 0.74192008,  0.90592151,  1.07717006,  1.25120233,  1.42395676]])

print('next_h error: ', rel_error(expected_next_h, next_h))
print('next_c error: ', rel_error(expected_next_c, next_c))


('next_h error: ', 5.7054131967097955e-09)
('next_c error: ', 5.8143123088804145e-09)

LSTM: step backward

Implement the backward pass for a single LSTM timestep in the function step_backward in the file lib/layer_utils.py. Once you are done, run the following to perform numeric gradient checking on your implementation. You should see errors around 1e-6 or less.


In [16]:
np.random.seed(599)

N, D, H = 4, 5, 6

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.random.randn(N, D)
prev_h = np.random.randn(N, H)
prev_c = np.random.randn(N, H)
Wx = np.random.randn(D, 4 * H)
Wh = np.random.randn(H, 4 * H)
b = np.random.randn(4 * H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

next_h, next_c, cache = lstm.step_forward(x, prev_h, prev_c)

dnext_h = np.random.randn(*next_h.shape)
dnext_c = np.random.randn(*next_c.shape)

fx_h = lambda x: lstm.step_forward(x, prev_h, prev_c)[0]
fh_h = lambda h: lstm.step_forward(x, prev_h, prev_c)[0]
fc_h = lambda c: lstm.step_forward(x, prev_h, prev_c)[0]
fWx_h = lambda Wx: lstm.step_forward(x, prev_h, prev_c)[0]
fWh_h = lambda Wh: lstm.step_forward(x, prev_h, prev_c)[0]
fb_h = lambda b: lstm.step_forward(x, prev_h, prev_c)[0]

fx_c = lambda x: lstm.step_forward(x, prev_h, prev_c)[1]
fh_c = lambda h: lstm.step_forward(x, prev_h, prev_c)[1]
fc_c = lambda c: lstm.step_forward(x, prev_h, prev_c)[1]
fWx_c = lambda Wx: lstm.step_forward(x, prev_h, prev_c)[1]
fWh_c = lambda Wh: lstm.step_forward(x, prev_h, prev_c)[1]
fb_c = lambda b: lstm.step_forward(x, prev_h, prev_c)[1]

num_grad = eval_numerical_gradient_array

dx_num = num_grad(fx_h, x, dnext_h) + num_grad(fx_c, x, dnext_c)
dh_num = num_grad(fh_h, prev_h, dnext_h) + num_grad(fh_c, prev_h, dnext_c)
dc_num = num_grad(fc_h, prev_c, dnext_h) + num_grad(fc_c, prev_c, dnext_c)
dWx_num = num_grad(fWx_h, Wx, dnext_h) + num_grad(fWx_c, Wx, dnext_c)
dWh_num = num_grad(fWh_h, Wh, dnext_h) + num_grad(fWh_c, Wh, dnext_c)
db_num = num_grad(fb_h, b, dnext_h) + num_grad(fb_c, b, dnext_c)

dx, dh, dc, dWx, dWh, db = lstm.step_backward(dnext_h, dnext_c, cache)

print('dx error: ', rel_error(dx_num, dx))
print('dh error: ', rel_error(dh_num, dh))
print('dc error: ', rel_error(dc_num, dc))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))


('dx error: ', 2.2946366283874835e-10)
('dh error: ', 2.4947448337233013e-10)
('dc error: ', 3.3749720436309407e-10)
('dWx error: ', 4.1155492102291343e-09)
('dWh error: ', 3.9916828842447042e-09)
('db error: ', 3.28181173198853e-10)

LSTM: forward

In the class lstm in the file lib/layer_utils.py, implement the forward function to run an LSTM forward on an entire timeseries of data.

When you are done, run the following to check your implementation. You should see an error around 1e-7.


In [17]:
N, D, H, T = 2, 5, 4, 3

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.linspace(-0.4, 0.6, num=N*T*D).reshape(N, T, D)
h0 = np.linspace(-0.4, 0.8, num=N*H).reshape(N, H)
Wx = np.linspace(-0.2, 0.9, num=4*D*H).reshape(D, 4 * H)
Wh = np.linspace(-0.3, 0.6, num=4*H*H).reshape(H, 4 * H)
b = np.linspace(0.2, 0.7, num=4*H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

h = lstm.forward(x, h0)

expected_h = np.asarray([
 [[ 0.01764008,  0.01823233,  0.01882671,  0.0194232 ],
  [ 0.11287491,  0.12146228,  0.13018446,  0.13902939],
  [ 0.31358768,  0.33338627,  0.35304453,  0.37250975]],
 [[ 0.45767879,  0.4761092,   0.4936887,   0.51041945],
  [ 0.6704845,   0.69350089,  0.71486014,  0.7346449 ],
  [ 0.81733511,  0.83677871,  0.85403753,  0.86935314]]])

print('h error: ', rel_error(expected_h, h))


('h error: ', 8.6105374521066237e-08)

LSTM: backward

Implement the backward pass for an LSTM over an entire timeseries of data in the function backward in the lstm class in the file lib/layer_utils.py. When you are done, run the following to perform numeric gradient checking on your implementation. You should see errors around 1e-7 or less.


In [19]:
np.random.seed(599)

N, D, T, H = 2, 3, 10, 6

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.random.randn(N, T, D)
h0 = np.random.randn(N, H)
Wx = np.random.randn(D, 4 * H)
Wh = np.random.randn(H, 4 * H)
b = np.random.randn(4 * H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

out = lstm.forward(x, h0)

dout = np.random.randn(*out.shape)

dx, dh0 = lstm.backward(dout)
dWx = lstm.grads[lstm.wx_name] 
dWh = lstm.grads[lstm.wh_name]
db = lstm.grads[lstm.b_name]

dx_num = eval_numerical_gradient_array(lambda x: lstm.forward(x, h0), x, dout)
dh0_num = eval_numerical_gradient_array(lambda h0: lstm.forward(x, h0), h0, dout)
dWx_num = eval_numerical_gradient_array(lambda Wx: lstm.forward(x, h0), Wx, dout)
dWh_num = eval_numerical_gradient_array(lambda Wh: lstm.forward(x, h0), Wh, dout)
db_num = eval_numerical_gradient_array(lambda b: lstm.forward(x, h0), b, dout)

print('dx error: ', rel_error(dx_num, dx))
print('dh0 error: ', rel_error(dh0_num, dh0))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))


('dx error: ', 1.2350616505712788e-09)
('dh0 error: ', 7.2170665908619868e-09)
('dWx error: ', 3.0736103206610719e-09)
('dWh error: ', 2.3989797381822739e-08)
('db error: ', 1.4990423406995592e-08)

LSTM model

Now that you have implemented an LSTM, update the initialization of the TestRNN class in the file lib/rnn.py to handle the case where self.cell_type is lstm. This should require adding only one line of codes.

Once you have done so, run the following to check your implementation. You should see a difference of less than 1e-10.


In [20]:
N, D, H = 10, 20, 40
V = 4
T = 13

model = TestRNN(D, H, cell_type='lstm')
loss_func = temporal_softmax_loss()

# Set all model parameters to fixed values
for k, v in model.params.items():
    model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)
model.assign_params()

features = np.linspace(-1.5, 0.3, num=(N * D * T)).reshape(N, T, D)
h0 = np.linspace(-1.5, 0.5, num=(N*H)).reshape(N, H)
labels = (np.arange(N * T) % V).reshape(N, T)

pred = model.forward(features, h0)

# You'll need this
mask = np.ones((N, T))

loss = loss_func.forward(pred, labels, mask)
dLoss = loss_func.backward()

expected_loss = 49.2140256354

print('loss: ', loss)
print('expected loss: ', expected_loss)
print('difference: ', abs(loss - expected_loss))


('loss: ', 49.21402563544293)
('expected loss: ', 49.2140256354)
('difference: ', 4.2930992094625253e-11)

Let's have some fun!!

Now you have everything you need for language modeling. You will work on text generation using RNNs from any text source (novel, lyrics). The network is trained to predict what word is coming next given a previous word. Once you train the model, by looping the network, you can keep generating a new text which is mimicing the original text source. Let's first put your source text you want to model in the following text box!

Notice: in order to run next cell, paste your own text words into the form and hit Enter. Do not use notebook's own 'run cell' since it wouldn't read in anything.


In [21]:
from ipywidgets import widgets, interact
from IPython.display import display
input_text = widgets.Text()
input_text.value = "Paste your own text words here and hit Enter."
def f(x):
    print('set!!')
    print(x.value)
input_text.on_submit(f)
input_text
# copy paste your text source in the box below and hit enter.
# If you don't have any preference, 
# you can copy paste the lyrics from here https://www.azlyrics.com/lyrics/ylvis/thefox.html


set!!
Dog goes woof, cat goes meow. Bird goes tweet, and mouse goes squeak. Cow goes moo. Frog goes croak, and the elephant goes toot. Ducks say quack and fish go blub, and the seal goes ow ow ow. But there's one sound that no one knows... What does the fox say?  Ring-ding-ding-ding-dingeringeding! Gering-ding-ding-ding-dingeringeding! Gering-ding-ding-ding-dingeringeding! What the fox say? Wa-pa-pa-pa-pa-pa-pow! Wa-pa-pa-pa-pa-pa-pow! Wa-pa-pa-pa-pa-pa-pow! What the fox say? Hatee-hatee-hatee-ho! Hatee-hatee-hatee-ho! Hatee-hatee-hatee-ho! What the fox say? Joff-tchoff-tchoff-tchoffo-tchoffo-tchoff! Joff-tchoff-tchoff-tchoffo-tchoffo-tchoff! Joff-tchoff-tchoff-tchoffo-tchoffo-tchoff! What the fox say?  Big blue eyes, pointy nose, chasing mice, and digging holes. Tiny paws, up the hill, suddenly you're standing still.  Your fur is red, so beautiful, like an angel in disguise. But if you meet a friendly horse, will you communicate by mo-o-o-o-orse, mo-o-o-o-orse, mo-o-o-o-orse? How will you speak to that h-o-o-orse, h-o-o-orse, h-o-o-orse? What does the fox say?!  Jacha-chacha-chacha-chow! Jacha-chacha-chacha-chow! Jacha-chacha-chacha-chow! What the fox say? Fraka-kaka-kaka-kaka-kow! Fraka-kaka-kaka-kaka-kow! Fraka-kaka-kaka-kaka-kow! What the fox say? A-hee-ahee ha-hee! A-hee-ahee ha-hee! A-hee-ahee ha-hee! What the fox say? A-oo-oo-oo-ooo! Woo-oo-oo-ooo! What does the fox say?!  The secret of the fox, ancient mystery. Somewhere deep in the woods, I know you're hiding. What is your sound? Will we ever know? Will always be a mystery what do you say?  You're my guardian angel hiding in the woods. What is your sound?  A-bubu-duh-bubu-dwee-dum a-bubu-duh-bubu-dwee-dum Will we ever know?  A-bubu-duh-bubu-dwee-dum I want to, I want to, I want to know!  A-bubu-duh-bubu-dwee-dum Bay-buh-day bum-bum bay-dum

simply run the following code to construct training dataset


In [22]:
import re

text = re.split(' |\n',input_text.value.lower()) # all words are converted into lower case
outputSize = len(text)
word_list = list(set(text))
dataSize = len(word_list)
output = np.zeros(outputSize)
for i in range(0, outputSize):
    index = np.where(np.asarray(word_list) == text[i])
    output[i] = index[0]
data_labels = output.astype(np.int)
gt_labels = data_labels[1:]
data_labels = data_labels[:-1]

print('Input text size: %s' % outputSize)
print('Input word number: %s' % dataSize)


Input text size: 243
Input word number: 124

We defined a LanguageModelRNN class for you to fill in the TODO block in rnn.py.

  • Here design a recurrent neutral network consisting of a word_embedding layer, recurrent unit, and temporal fully connected layers so that they match the provided dimentions.
  • Please read the train.py under lib directory carefully and complete the TODO blocks in the train_net function first.

In [27]:
# you can change the following parameters.
D = 10 # input dimention
H = 20 # hidden space dimention
T = 50 # timesteps
N = 10 # batch size
max_epoch = 100 # max epoch size

loss_func = temporal_softmax_loss()
# you can change the cell_type between 'rnn' and 'lstm'.
model = LanguageModelRNN(dataSize, D, H, cell_type='rnn')
optimizer = Adam(model, 5e-4)

data = { 'data_train': data_labels, 'labels_train': gt_labels }

results = train_net(data, model, loss_func, optimizer, timesteps=T, batch_size=N, max_epochs=max_epoch, verbose=True)


(Iteration 1 / 2400) loss: 240.948200334
bast performance 14.0495867769%
(Epoch 1 / 100) Training Accuracy: 0.140495867769
(Epoch 2 / 100) Training Accuracy: 0.111570247934
(Epoch 3 / 100) Training Accuracy: 0.0371900826446
(Epoch 4 / 100) Training Accuracy: 0.140495867769
(Iteration 101 / 2400) loss: 206.438885202
bast performance 19.4214876033%
(Epoch 5 / 100) Training Accuracy: 0.194214876033
bast performance 24.7933884298%
(Epoch 6 / 100) Training Accuracy: 0.247933884298
bast performance 29.7520661157%
(Epoch 7 / 100) Training Accuracy: 0.297520661157
bast performance 35.5371900826%
(Epoch 8 / 100) Training Accuracy: 0.355371900826
(Iteration 201 / 2400) loss: 154.456066367
bast performance 38.4297520661%
(Epoch 9 / 100) Training Accuracy: 0.384297520661
bast performance 40.9090909091%
(Epoch 10 / 100) Training Accuracy: 0.409090909091
bast performance 44.2148760331%
(Epoch 11 / 100) Training Accuracy: 0.442148760331
bast performance 46.694214876%
(Epoch 12 / 100) Training Accuracy: 0.46694214876
(Iteration 301 / 2400) loss: 116.392241041
bast performance 49.5867768595%
(Epoch 13 / 100) Training Accuracy: 0.495867768595
bast performance 52.0661157025%
(Epoch 14 / 100) Training Accuracy: 0.520661157025
bast performance 54.958677686%
(Epoch 15 / 100) Training Accuracy: 0.54958677686
bast performance 58.6776859504%
(Epoch 16 / 100) Training Accuracy: 0.586776859504
(Iteration 401 / 2400) loss: 83.3635896424
bast performance 60.3305785124%
(Epoch 17 / 100) Training Accuracy: 0.603305785124
bast performance 62.3966942149%
(Epoch 18 / 100) Training Accuracy: 0.623966942149
bast performance 64.0495867769%
(Epoch 19 / 100) Training Accuracy: 0.640495867769
bast performance 66.1157024793%
(Epoch 20 / 100) Training Accuracy: 0.661157024793
(Iteration 501 / 2400) loss: 76.551585709
bast performance 67.7685950413%
(Epoch 21 / 100) Training Accuracy: 0.677685950413
bast performance 69.4214876033%
(Epoch 22 / 100) Training Accuracy: 0.694214876033
bast performance 69.8347107438%
(Epoch 23 / 100) Training Accuracy: 0.698347107438
(Epoch 24 / 100) Training Accuracy: 0.698347107438
bast performance 70.2479338843%
(Epoch 25 / 100) Training Accuracy: 0.702479338843
(Iteration 601 / 2400) loss: 62.1009644769
bast performance 71.0743801653%
(Epoch 26 / 100) Training Accuracy: 0.710743801653
bast performance 72.3140495868%
(Epoch 27 / 100) Training Accuracy: 0.723140495868
bast performance 72.7272727273%
(Epoch 28 / 100) Training Accuracy: 0.727272727273
bast performance 74.7933884298%
(Epoch 29 / 100) Training Accuracy: 0.747933884298
(Iteration 701 / 2400) loss: 47.9225882739
bast performance 77.2727272727%
(Epoch 30 / 100) Training Accuracy: 0.772727272727
bast performance 78.5123966942%
(Epoch 31 / 100) Training Accuracy: 0.785123966942
(Epoch 32 / 100) Training Accuracy: 0.780991735537
bast performance 80.1652892562%
(Epoch 33 / 100) Training Accuracy: 0.801652892562
(Iteration 801 / 2400) loss: 37.9668435702
bast performance 80.5785123967%
(Epoch 34 / 100) Training Accuracy: 0.805785123967
bast performance 81.8181818182%
(Epoch 35 / 100) Training Accuracy: 0.818181818182
bast performance 82.2314049587%
(Epoch 36 / 100) Training Accuracy: 0.822314049587
(Epoch 37 / 100) Training Accuracy: 0.822314049587
(Iteration 901 / 2400) loss: 32.1713138271
(Epoch 38 / 100) Training Accuracy: 0.818181818182
bast performance 82.6446280992%
(Epoch 39 / 100) Training Accuracy: 0.826446280992
bast performance 83.8842975207%
(Epoch 40 / 100) Training Accuracy: 0.838842975207
bast performance 84.2975206612%
(Epoch 41 / 100) Training Accuracy: 0.842975206612
(Iteration 1001 / 2400) loss: 33.5184739182
bast performance 84.7107438017%
(Epoch 42 / 100) Training Accuracy: 0.847107438017
bast performance 85.1239669421%
(Epoch 43 / 100) Training Accuracy: 0.851239669421
bast performance 85.5371900826%
(Epoch 44 / 100) Training Accuracy: 0.855371900826
bast performance 85.9504132231%
(Epoch 45 / 100) Training Accuracy: 0.859504132231
(Iteration 1101 / 2400) loss: 23.8747487804
bast performance 86.3636363636%
(Epoch 46 / 100) Training Accuracy: 0.863636363636
(Epoch 47 / 100) Training Accuracy: 0.863636363636
bast performance 86.7768595041%
(Epoch 48 / 100) Training Accuracy: 0.867768595041
(Epoch 49 / 100) Training Accuracy: 0.867768595041
bast performance 87.1900826446%
(Epoch 50 / 100) Training Accuracy: 0.871900826446
(Iteration 1201 / 2400) loss: 26.6933439366
(Epoch 51 / 100) Training Accuracy: 0.871900826446
(Epoch 52 / 100) Training Accuracy: 0.871900826446
(Epoch 53 / 100) Training Accuracy: 0.871900826446
(Epoch 54 / 100) Training Accuracy: 0.871900826446
(Iteration 1301 / 2400) loss: 19.2160611844
(Epoch 55 / 100) Training Accuracy: 0.871900826446
bast performance 88.0165289256%
(Epoch 56 / 100) Training Accuracy: 0.880165289256
(Epoch 57 / 100) Training Accuracy: 0.880165289256
(Epoch 58 / 100) Training Accuracy: 0.880165289256
(Iteration 1401 / 2400) loss: 20.1950392124
(Epoch 59 / 100) Training Accuracy: 0.880165289256
(Epoch 60 / 100) Training Accuracy: 0.880165289256
(Epoch 61 / 100) Training Accuracy: 0.880165289256
(Epoch 62 / 100) Training Accuracy: 0.880165289256
(Iteration 1501 / 2400) loss: 19.8410973224
(Epoch 63 / 100) Training Accuracy: 0.880165289256
(Epoch 64 / 100) Training Accuracy: 0.880165289256
(Epoch 65 / 100) Training Accuracy: 0.880165289256
bast performance 88.4297520661%
(Epoch 66 / 100) Training Accuracy: 0.884297520661
(Iteration 1601 / 2400) loss: 18.2522435393
(Epoch 67 / 100) Training Accuracy: 0.884297520661
(Epoch 68 / 100) Training Accuracy: 0.884297520661
(Epoch 69 / 100) Training Accuracy: 0.884297520661
(Epoch 70 / 100) Training Accuracy: 0.884297520661
(Iteration 1701 / 2400) loss: 17.8943193664
(Epoch 71 / 100) Training Accuracy: 0.884297520661
(Epoch 72 / 100) Training Accuracy: 0.884297520661
(Epoch 73 / 100) Training Accuracy: 0.884297520661
(Epoch 74 / 100) Training Accuracy: 0.884297520661
bast performance 89.2561983471%
(Epoch 75 / 100) Training Accuracy: 0.892561983471
(Iteration 1801 / 2400) loss: 15.9935118939
(Epoch 76 / 100) Training Accuracy: 0.892561983471
bast performance 89.6694214876%
(Epoch 77 / 100) Training Accuracy: 0.896694214876
(Epoch 78 / 100) Training Accuracy: 0.896694214876
(Epoch 79 / 100) Training Accuracy: 0.896694214876
(Iteration 1901 / 2400) loss: 16.2722013036
bast performance 90.0826446281%
(Epoch 80 / 100) Training Accuracy: 0.900826446281
(Epoch 81 / 100) Training Accuracy: 0.900826446281
(Epoch 82 / 100) Training Accuracy: 0.900826446281
(Epoch 83 / 100) Training Accuracy: 0.900826446281
(Iteration 2001 / 2400) loss: 15.6248589112
(Epoch 84 / 100) Training Accuracy: 0.900826446281
(Epoch 85 / 100) Training Accuracy: 0.900826446281
(Epoch 86 / 100) Training Accuracy: 0.900826446281
(Epoch 87 / 100) Training Accuracy: 0.900826446281
(Iteration 2101 / 2400) loss: 15.1308384798
(Epoch 88 / 100) Training Accuracy: 0.900826446281
(Epoch 89 / 100) Training Accuracy: 0.900826446281
(Epoch 90 / 100) Training Accuracy: 0.900826446281
bast performance 90.4958677686%
(Epoch 91 / 100) Training Accuracy: 0.904958677686
(Iteration 2201 / 2400) loss: 10.6012486324
(Epoch 92 / 100) Training Accuracy: 0.904958677686
(Epoch 93 / 100) Training Accuracy: 0.904958677686
(Epoch 94 / 100) Training Accuracy: 0.904958677686
(Epoch 95 / 100) Training Accuracy: 0.904958677686
(Iteration 2301 / 2400) loss: 11.8674254837
(Epoch 96 / 100) Training Accuracy: 0.904958677686
(Epoch 97 / 100) Training Accuracy: 0.904958677686
(Epoch 98 / 100) Training Accuracy: 0.904958677686
(Epoch 99 / 100) Training Accuracy: 0.904958677686
(Epoch 100 / 100) Training Accuracy: 0.904958677686

Simply run the following code block to check the loss and accuracy curve.


In [28]:
opt_params, loss_hist, train_acc_hist = results

# Plot the learning curves
plt.subplot(2, 1, 1)
plt.title('Training loss')
loss_hist_ = loss_hist[1::100] # sparse the curve a bit
plt.plot(loss_hist_, '-o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(train_acc_hist, '-o', label='Training')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)

plt.show()


Now you can generate a text using the trained model. You can also start from a specific word in the original text. If you trained your model with "The Fox", you can check how well it is modeled by starting from "dog", "cat", etc.


In [29]:
# you can change the generated text length below.
text_length = 100

idx = 0
# you also can start from specific word. 
# since the words are all converted into lower case, make sure you put lower case below.
idx = int(np.where(np.asarray(word_list) == 'dog')[0])

# sample from the trained model
words = model.sample(idx, text_length-1)

# convert indices into words
output = [ word_list[i] for i in words]
print(' '.join(output))


dog what does the fox say?  big blue eyes, pointy nose, chasing mice, and digging holes. tiny paws, up the hill, suddenly you're standing still.  your fur is red, so beautiful, like an angel in disguise. but if you meet a friendly horse, will you communicate by mo-o-o-o-orse, mo-o-o-o-orse, mo-o-o-o-orse? how will you speak to that h-o-o-orse, h-o-o-orse, h-o-o-orse? what does the fox say?  big blue eyes, pointy nose, chasing mice, and digging holes. tiny paws, up the hill, suddenly you're standing still.  your fur is red, so beautiful, like an angel in disguise. but if

In [30]:
# you can change the following parameters.
D = 10 # input dimention
H = 20 # hidden space dimention
T = 50 # timesteps
N = 10 # batch size
max_epoch = 100 # max epoch size

loss_func = temporal_softmax_loss()
# you can change the cell_type between 'rnn' and 'lstm'.
model = LanguageModelRNN(dataSize, D, H, cell_type='rnn')
optimizer = Adam(model, 5e-4)

data = { 'data_train': data_labels, 'labels_train': gt_labels }

results = train_net(data, model, loss_func, optimizer, timesteps=T, batch_size=N, max_epochs=max_epoch, verbose=True)

text_length = 100

idx = 0
# you also can start from specific word. 
# since the words are all converted into lower case, make sure you put lower case below.
idx = int(np.where(np.asarray(word_list) == 'dog')[0])

# sample from the trained model
words = model.sample(idx, text_length-1)

# convert indices into words
output = [ word_list[i] for i in words]
print(' '.join(output))


(Iteration 1 / 2400) loss: 241.009299207
bast performance 15.2892561983%
(Epoch 1 / 100) Training Accuracy: 0.152892561983
(Epoch 2 / 100) Training Accuracy: 0.144628099174
(Epoch 3 / 100) Training Accuracy: 0.0702479338843
(Epoch 4 / 100) Training Accuracy: 0.0785123966942
(Iteration 101 / 2400) loss: 201.861487105
(Epoch 5 / 100) Training Accuracy: 0.136363636364
(Epoch 6 / 100) Training Accuracy: 0.148760330579
bast performance 21.4876033058%
(Epoch 7 / 100) Training Accuracy: 0.214876033058
bast performance 24.3801652893%
(Epoch 8 / 100) Training Accuracy: 0.243801652893
(Iteration 201 / 2400) loss: 178.322637964
bast performance 27.6859504132%
(Epoch 9 / 100) Training Accuracy: 0.276859504132
bast performance 32.2314049587%
(Epoch 10 / 100) Training Accuracy: 0.322314049587
bast performance 37.1900826446%
(Epoch 11 / 100) Training Accuracy: 0.371900826446
bast performance 42.1487603306%
(Epoch 12 / 100) Training Accuracy: 0.421487603306
(Iteration 301 / 2400) loss: 138.368568359
bast performance 45.4545454545%
(Epoch 13 / 100) Training Accuracy: 0.454545454545
bast performance 48.7603305785%
(Epoch 14 / 100) Training Accuracy: 0.487603305785
bast performance 50.826446281%
(Epoch 15 / 100) Training Accuracy: 0.50826446281
bast performance 52.0661157025%
(Epoch 16 / 100) Training Accuracy: 0.520661157025
(Iteration 401 / 2400) loss: 99.0103268836
(Epoch 17 / 100) Training Accuracy: 0.520661157025
bast performance 54.132231405%
(Epoch 18 / 100) Training Accuracy: 0.54132231405
bast performance 56.1983471074%
(Epoch 19 / 100) Training Accuracy: 0.561983471074
bast performance 58.2644628099%
(Epoch 20 / 100) Training Accuracy: 0.582644628099
(Iteration 501 / 2400) loss: 87.2303514744
(Epoch 21 / 100) Training Accuracy: 0.582644628099
(Epoch 22 / 100) Training Accuracy: 0.582644628099
bast performance 59.9173553719%
(Epoch 23 / 100) Training Accuracy: 0.599173553719
bast performance 61.9834710744%
(Epoch 24 / 100) Training Accuracy: 0.619834710744
bast performance 64.0495867769%
(Epoch 25 / 100) Training Accuracy: 0.640495867769
(Iteration 601 / 2400) loss: 61.3914063432
bast performance 65.2892561983%
(Epoch 26 / 100) Training Accuracy: 0.652892561983
bast performance 68.1818181818%
(Epoch 27 / 100) Training Accuracy: 0.681818181818
bast performance 69.0082644628%
(Epoch 28 / 100) Training Accuracy: 0.690082644628
bast performance 69.8347107438%
(Epoch 29 / 100) Training Accuracy: 0.698347107438
(Iteration 701 / 2400) loss: 53.1526264987
bast performance 71.0743801653%
(Epoch 30 / 100) Training Accuracy: 0.710743801653
bast performance 71.4876033058%
(Epoch 31 / 100) Training Accuracy: 0.714876033058
(Epoch 32 / 100) Training Accuracy: 0.714876033058
bast performance 73.1404958678%
(Epoch 33 / 100) Training Accuracy: 0.731404958678
(Iteration 801 / 2400) loss: 44.4484329921
(Epoch 34 / 100) Training Accuracy: 0.731404958678
bast performance 74.7933884298%
(Epoch 35 / 100) Training Accuracy: 0.747933884298
bast performance 76.0330578512%
(Epoch 36 / 100) Training Accuracy: 0.760330578512
bast performance 76.8595041322%
(Epoch 37 / 100) Training Accuracy: 0.768595041322
(Iteration 901 / 2400) loss: 38.2834773118
(Epoch 38 / 100) Training Accuracy: 0.764462809917
(Epoch 39 / 100) Training Accuracy: 0.768595041322
bast performance 78.0991735537%
(Epoch 40 / 100) Training Accuracy: 0.780991735537
bast performance 79.3388429752%
(Epoch 41 / 100) Training Accuracy: 0.793388429752
(Iteration 1001 / 2400) loss: 36.7710567606
(Epoch 42 / 100) Training Accuracy: 0.789256198347
bast performance 79.7520661157%
(Epoch 43 / 100) Training Accuracy: 0.797520661157
bast performance 80.5785123967%
(Epoch 44 / 100) Training Accuracy: 0.805785123967
bast performance 80.9917355372%
(Epoch 45 / 100) Training Accuracy: 0.809917355372
(Iteration 1101 / 2400) loss: 34.9026823002
(Epoch 46 / 100) Training Accuracy: 0.809917355372
bast performance 81.8181818182%
(Epoch 47 / 100) Training Accuracy: 0.818181818182
(Epoch 48 / 100) Training Accuracy: 0.814049586777
bast performance 82.2314049587%
(Epoch 49 / 100) Training Accuracy: 0.822314049587
(Epoch 50 / 100) Training Accuracy: 0.822314049587
(Iteration 1201 / 2400) loss: 33.1067711089
bast performance 82.6446280992%
(Epoch 51 / 100) Training Accuracy: 0.826446280992
bast performance 83.0578512397%
(Epoch 52 / 100) Training Accuracy: 0.830578512397
bast performance 83.4710743802%
(Epoch 53 / 100) Training Accuracy: 0.834710743802
bast performance 84.2975206612%
(Epoch 54 / 100) Training Accuracy: 0.842975206612
(Iteration 1301 / 2400) loss: 28.9865681565
bast performance 85.1239669421%
(Epoch 55 / 100) Training Accuracy: 0.851239669421
bast performance 85.5371900826%
(Epoch 56 / 100) Training Accuracy: 0.855371900826
(Epoch 57 / 100) Training Accuracy: 0.855371900826
bast performance 85.9504132231%
(Epoch 58 / 100) Training Accuracy: 0.859504132231
(Iteration 1401 / 2400) loss: 25.281354305
bast performance 86.3636363636%
(Epoch 59 / 100) Training Accuracy: 0.863636363636
bast performance 86.7768595041%
(Epoch 60 / 100) Training Accuracy: 0.867768595041
(Epoch 61 / 100) Training Accuracy: 0.867768595041
(Epoch 62 / 100) Training Accuracy: 0.867768595041
(Iteration 1501 / 2400) loss: 21.5926245356
(Epoch 63 / 100) Training Accuracy: 0.867768595041
(Epoch 64 / 100) Training Accuracy: 0.867768595041
(Epoch 65 / 100) Training Accuracy: 0.867768595041
(Epoch 66 / 100) Training Accuracy: 0.867768595041
(Iteration 1601 / 2400) loss: 23.3614621348
(Epoch 67 / 100) Training Accuracy: 0.867768595041
bast performance 87.6033057851%
(Epoch 68 / 100) Training Accuracy: 0.876033057851
bast performance 88.4297520661%
(Epoch 69 / 100) Training Accuracy: 0.884297520661
(Epoch 70 / 100) Training Accuracy: 0.884297520661
(Iteration 1701 / 2400) loss: 24.1392224933
bast performance 88.8429752066%
(Epoch 71 / 100) Training Accuracy: 0.888429752066
(Epoch 72 / 100) Training Accuracy: 0.888429752066
bast performance 89.2561983471%
(Epoch 73 / 100) Training Accuracy: 0.892561983471
bast performance 89.6694214876%
(Epoch 74 / 100) Training Accuracy: 0.896694214876
(Epoch 75 / 100) Training Accuracy: 0.896694214876
(Iteration 1801 / 2400) loss: 15.4121952957
(Epoch 76 / 100) Training Accuracy: 0.896694214876
bast performance 90.0826446281%
(Epoch 77 / 100) Training Accuracy: 0.900826446281
(Epoch 78 / 100) Training Accuracy: 0.900826446281
(Epoch 79 / 100) Training Accuracy: 0.900826446281
(Iteration 1901 / 2400) loss: 15.9503617666
(Epoch 80 / 100) Training Accuracy: 0.900826446281
(Epoch 81 / 100) Training Accuracy: 0.900826446281
(Epoch 82 / 100) Training Accuracy: 0.900826446281
(Epoch 83 / 100) Training Accuracy: 0.900826446281
(Iteration 2001 / 2400) loss: 17.5366131086
(Epoch 84 / 100) Training Accuracy: 0.900826446281
(Epoch 85 / 100) Training Accuracy: 0.900826446281
(Epoch 86 / 100) Training Accuracy: 0.900826446281
(Epoch 87 / 100) Training Accuracy: 0.900826446281
(Iteration 2101 / 2400) loss: 16.8978550408
bast performance 90.4958677686%
(Epoch 88 / 100) Training Accuracy: 0.904958677686
(Epoch 89 / 100) Training Accuracy: 0.900826446281
bast performance 90.9090909091%
(Epoch 90 / 100) Training Accuracy: 0.909090909091
(Epoch 91 / 100) Training Accuracy: 0.909090909091
(Iteration 2201 / 2400) loss: 15.7221155294
(Epoch 92 / 100) Training Accuracy: 0.909090909091
(Epoch 93 / 100) Training Accuracy: 0.909090909091
(Epoch 94 / 100) Training Accuracy: 0.909090909091
(Epoch 95 / 100) Training Accuracy: 0.909090909091
(Iteration 2301 / 2400) loss: 11.5662807602
(Epoch 96 / 100) Training Accuracy: 0.909090909091
(Epoch 97 / 100) Training Accuracy: 0.904958677686
(Epoch 98 / 100) Training Accuracy: 0.909090909091
bast performance 91.3223140496%
(Epoch 99 / 100) Training Accuracy: 0.913223140496
bast performance 91.7355371901%
(Epoch 100 / 100) Training Accuracy: 0.917355371901
dog goes ow ow ow. but there's one sound that no one knows... what does the fox say?  ring-ding-ding-ding-dingeringeding! gering-ding-ding-ding-dingeringeding! gering-ding-ding-ding-dingeringeding! what the fox say?  big blue eyes, pointy nose, chasing mice, and digging holes. tiny paws, up the hill, suddenly you're standing still.  your fur is red, so beautiful, like an angel in disguise. but if you meet a friendly horse, will you communicate by mo-o-o-o-orse, mo-o-o-o-orse, mo-o-o-o-orse? how will you speak to that h-o-o-orse, h-o-o-orse, h-o-o-orse? what does the fox say?  ring-ding-ding-ding-dingeringeding! gering-ding-ding-ding-dingeringeding! gering-ding-ding-ding-dingeringeding! what the fox say?  big blue eyes, pointy nose,

Inline Question: Play around with different settings to get better understanding of its behavior and describe your observation. Make sure at least you cover the following points:

  • Vanilla RNN vs LSTM (you can set different time steps and test with longer texts.)
  • Problems in these approaches (there's no unique answer. just explain your own opinion from experiments.) #### Ans: LSTM seems to 'remember' phrases over longer durations while RNN often fail to model the long term dependencies. In this particular case, use of LSTM resulted in certain sentences being repeated over and over while RNN seemed to generate shorter sentences not necessarily repeated. LSTM learns a longer context and hence seems to generate copied sentences from the original text.

In [ ]: