Deep Learning

Assignment 6

After training a skip-gram model in 5_word2vec.ipynb, the goal of this notebook is to train a LSTM character model over Text8 data.


In [3]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [4]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

In [5]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))


Data size 100000000

Create a small validation set.


In [6]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])


99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl

Utility functions to map characters to vocabulary IDs and back.

python string


In [7]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))


Unexpected character: ï
1 26 0 0
a z  

Function to generate a training batch for the LSTM model.

python zip

numpy.argmax(a, axis=None, out=None) Returns the indices of the maximum values along an axis.


In [8]:
batch_size=64 #how many parts the dataset is divided into 
# if len(dataset)=20, batch_size=5, then len(words_in_each_batch)=4
# 'batches' has (num_unrolling+1) many 'batch'. For each 'batch', it has batch_size many words.
# the first 'batch' has the first word from 4 parts
# the second 'batch' has the second word from 4 parts
# the (num_unrolling+1)-th 'batch' has the (num_unrolling+1)-th word from 4 parts.
# parts are the addresses given by dataset and batch_size.
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    num_segment = self._text_size // batch_size
    self._cursor = [ offset * num_segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
    #i-th batch has the i-th word from each part. number_of_parts = num_segment
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch] # batches initializes with one batch.
    for step in range(self._num_unrollings): # add _num_unrollings(=10) batches then.
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  #probabilities.shape=(64,27) or (1,27)
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  #print([id2char(c) for c in np.argmax(probabilities, 1)]) # length = 64
  return [id2char(c) for c in np.argmax(probabilities, 1)] #the length of the list is 64

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  #len(batches)=11
  #batches[0].shape = (64,27)
  s = [''] * batches[0].shape[0] #generate a list with batches[0].shape[0] many elements, all the elements are ''
  for b in batches:
    # b.shape = (64,27)
    #print(characters(b),'\n----') # characters(b) shape is (64,27) 
    s = [''.join(x) for x in zip(s, characters(b))] 
    # len(s)=64, len(characters(b))=64
    # s[i]=s[i]+characters(b)[i], recursively add character from b to each element of s.
    # len(batches) = 11, so len(s[i])=11. There are 11 characters in one batch.
    #zip(seq1,seq2,...)
    #zip return a list of tuples, each tuple have one element from each sequence. The length of the list depends on the 
    #length of the shortest sequence.
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1) # one part with (1+1) characters

batches = train_batches.next()
s = batches2string(batches)
print(len(s))
print(len(s[0]))
print(batches[0].shape)
print(len(batches))
print(len(batches[0]))
print(len(batches[0][0]))
print(s,'\n----')
print(batches2string(train_batches.next()),'\n----')
print(batches2string(valid_batches.next()),'\n----')
print(batches2string(valid_batches.next()),'\n----')


64
11
(64, 27)
11
64
27
['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad'] 
----
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev'] 
----
[' a'] 
----
['an'] 
----

In [9]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  # predictions is a ndarray
  # this method is numpy filter
  # prediction.shape = (640,27)
  # labels.shape = (640,27)
  logp = np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
  # np.multiply is a element_wise multiplication
  # element_wise_mutiply(the_row_of_predictions,the_row_of_labels)
  return logp

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  #distribution is a list with 27 normalized elements
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  # return one-hot coded ndarray with the shape of (1,27)
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  #print((b/np.sum(b,1)).shape)
  # return normalized random ndarray with the shape of (1,27)
  return b/np.sum(b, 1)[:,None]  #None?

rd = random_distribution()
print(rd)
print(sample(rd))


[[  3.76761210e-02   6.15627347e-05   9.40562246e-03   4.26964711e-03
    8.42732825e-02   2.16338208e-02   1.19446564e-02   7.11687962e-02
    3.53702821e-02   5.73091035e-02   9.00484884e-02   2.74734404e-03
    5.28624851e-02   2.97007381e-02   5.96875612e-03   5.39539160e-02
    1.84609966e-02   3.61508972e-02   7.49045094e-02   3.99968388e-02
    3.58305493e-02   9.02696237e-02   2.41201320e-02   8.29060030e-03
    1.22287475e-02   1.66857230e-02   7.46667597e-02]]

Simple LSTM Model.


In [10]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # bias is added to each batch, the row of the tensor
  # bias is the tensor with one row
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size])) # add the same biases to each batch
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell."""
    # i is the input of the current cell, a tensor with the shape of (batch_size,27)
    # o is the output of the current cell, a tensor with the shape of (batch_size, num_nodes)
    # state is the input state of the current cell, or output state of the previous cell,
    # state is a tensor with the shape of (batch_size, num_nodes)
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    o = output_gate * tf.tanh(state)
    return o, state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    # i is input, a tensor with the shape of (batch_size,27)
    output, state = lstm_cell(i, output, state)
    outputs.append(output)
    # eventually, outputs has 10 predictioned words for each batch. There are 64 batches inside.

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    #print(len(outputs))
    #print(outputs[0])
    #print(tf.concat(0, outputs)) # concatenate 10 words together, shape = (640,27)
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    # help(tf.nn.xw_plus_b)
    # so the number of total predicted words is 64*10 = 640
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
  #help(tf.train.exponential_decay)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  # class tf.train.GradientDescentOptimizer(...), help(tf.train.GradientDescentOptimizer)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  # zip(*) unpacking the argument lists
  # help(tf.train.GradientDescentOptimizer.compute_gradients)
  # returns a list of (gradient, variable) pairs.
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25) # modify the gradients
  optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

  # Predictions.
  #print(logits) #has shape of (640,27)
  train_prediction = tf.nn.softmax(logits)
  #print(train_prediction) #has shape of (640,27)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  #tf.group & group.run?
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))


Tensor("Softmax:0", shape=(640, 27), dtype=float32)

In [58]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      # list(batches)[1:], the labels are the last 10 characters, while the inputs are the first 10
      #np.concatenate((seq1,seq2),axis=0)
      #len(list(batches)[1:])=10, there are 10 elements in the list
      #these 10 elements are piped up in one array
      #labels.shape is (640,27)
      print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution()) # one-hot coded word
          sentence = characters(feed)[0] # covert the one-hot coded word to character
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
          #print('\n')
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.296303 learning rate: 10.000000
Minibatch perplexity: 27.01
================================================================================
wjvif arofdemieak cwe cr ks pczsm yzzzdfccixdjshsteimqth e qez xmotreq otfaesfmg
vvqqozsckquncrplhexc mknlifhh s hsoimna  ehzqnvqhadldauosrte ktdebeq cm u a o r 
ozlk ao iddewzkmr urkpmprllwhbhteoeqeqek iggehfz l es reeosmst apy vsynv ee   cj
yaktejijetfe tbxdeqpcpiiyhagigrsehnuixcnoccyihdsbsaqimwqewrt aieaxsbiac mampk kd
mtokg ajdisw seaei ligpoclnglekl murq drmsczntlievly zshetvcqacmxsspaaeesyapmx s
================================================================================
Validation set perplexity: 19.96
Average loss at step 100: 2.584825 learning rate: 10.000000
Minibatch perplexity: 10.70
Validation set perplexity: 10.34
Average loss at step 200: 2.243187 learning rate: 10.000000
Minibatch perplexity: 8.67
Validation set perplexity: 8.92
Average loss at step 300: 2.097972 learning rate: 10.000000
Minibatch perplexity: 7.72
Validation set perplexity: 7.96
Average loss at step 400: 2.002255 learning rate: 10.000000
Minibatch perplexity: 6.90
Validation set perplexity: 7.68
Average loss at step 500: 1.934838 learning rate: 10.000000
Minibatch perplexity: 6.81
Validation set perplexity: 7.12
Average loss at step 600: 1.912429 learning rate: 10.000000
Minibatch perplexity: 6.91
Validation set perplexity: 6.67
Average loss at step 700: 1.853145 learning rate: 10.000000
Minibatch perplexity: 5.76
Validation set perplexity: 6.56
Average loss at step 800: 1.823466 learning rate: 10.000000
Minibatch perplexity: 6.30
Validation set perplexity: 6.45
Average loss at step 900: 1.833213 learning rate: 10.000000
Minibatch perplexity: 5.55
Validation set perplexity: 6.15
Average loss at step 1000: 1.822704 learning rate: 10.000000
Minibatch perplexity: 5.18
================================================================================
x over of the frempee for the the ant opsist celeer with as it oven dizay be sta
f have how appealal bocmem hilds are sucking be thed be ave fre atsing from seve
x oon grodidare ban the all it revent to and mad divercim rexenth of the suathes
s are s stear gum and faciw muchiar hits dy after bedis brid one nine zero thre 
but the refurth so compuse imad at the frigon boots s sik to seh peance of the r
================================================================================
Validation set perplexity: 5.98
Average loss at step 1100: 1.777270 learning rate: 10.000000
Minibatch perplexity: 4.79
Validation set perplexity: 5.82
Average loss at step 1200: 1.758461 learning rate: 10.000000
Minibatch perplexity: 5.35
Validation set perplexity: 5.54
Average loss at step 1300: 1.730266 learning rate: 10.000000
Minibatch perplexity: 6.35
Validation set perplexity: 5.61
Average loss at step 1400: 1.746055 learning rate: 10.000000
Minibatch perplexity: 6.09
Validation set perplexity: 5.63
Average loss at step 1500: 1.737022 learning rate: 10.000000
Minibatch perplexity: 5.71
Validation set perplexity: 5.61
Average loss at step 1600: 1.744694 learning rate: 10.000000
Minibatch perplexity: 5.41
Validation set perplexity: 5.38
Average loss at step 1700: 1.708168 learning rate: 10.000000
Minibatch perplexity: 5.63
Validation set perplexity: 5.41
Average loss at step 1800: 1.669712 learning rate: 10.000000
Minibatch perplexity: 4.88
Validation set perplexity: 5.12
Average loss at step 1900: 1.651607 learning rate: 10.000000
Minibatch perplexity: 5.78
Validation set perplexity: 5.11
Average loss at step 2000: 1.694565 learning rate: 10.000000
Minibatch perplexity: 6.03
================================================================================
quententan be by its atter appection of the head it isolla the relivity sachlif 
wings for eight of recoprom othein germen becencian prodicitifiest for the epraw
riciia lives experioad are lake times home snive const gurch chile unico ruled a
xish valevative hidwowly obsortgli and yoncids in his gold he torsour aro or and
ridara mithall theire time vara and produakbleg webpal peppected as his potines 
================================================================================
Validation set perplexity: 5.09
Average loss at step 2100: 1.683977 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.97
Average loss at step 2200: 1.677370 learning rate: 10.000000
Minibatch perplexity: 5.06
Validation set perplexity: 5.09
Average loss at step 2300: 1.639935 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.94
Average loss at step 2400: 1.661272 learning rate: 10.000000
Minibatch perplexity: 6.12
Validation set perplexity: 4.87
Average loss at step 2500: 1.680709 learning rate: 10.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.75
Average loss at step 2600: 1.660281 learning rate: 10.000000
Minibatch perplexity: 6.11
Validation set perplexity: 4.86
Average loss at step 2700: 1.652278 learning rate: 10.000000
Minibatch perplexity: 4.57
Validation set perplexity: 4.87
Average loss at step 2800: 1.651252 learning rate: 10.000000
Minibatch perplexity: 4.95
Validation set perplexity: 4.70
Average loss at step 2900: 1.654573 learning rate: 10.000000
Minibatch perplexity: 5.66
Validation set perplexity: 4.74
Average loss at step 3000: 1.648074 learning rate: 10.000000
Minibatch perplexity: 4.57
================================================================================
vies mpitarism devalor in the eur eight zero land and banopendentadia et trees i
zeromsed of the groar grours natort and of the playsca take numbear their olf th
gural ail s a culin ticus that calling the obsign is cortimes the for morn that 
ges commich which the notral be years shoot ajers one considenters lun theoret i
peticas for holish personal seve the any the eight nine six eight the eight thre
================================================================================
Validation set perplexity: 4.75
Average loss at step 3100: 1.633791 learning rate: 10.000000
Minibatch perplexity: 5.60
Validation set perplexity: 4.74
Average loss at step 3200: 1.638552 learning rate: 10.000000
Minibatch perplexity: 4.81
Validation set perplexity: 4.76
Average loss at step 3300: 1.646115 learning rate: 10.000000
Minibatch perplexity: 5.40
Validation set perplexity: 4.56
Average loss at step 3400: 1.669005 learning rate: 10.000000
Minibatch perplexity: 5.12
Validation set perplexity: 4.74
Average loss at step 3500: 1.653305 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 4.77
Average loss at step 3600: 1.663515 learning rate: 10.000000
Minibatch perplexity: 5.32
Validation set perplexity: 4.56
Average loss at step 3700: 1.650930 learning rate: 10.000000
Minibatch perplexity: 5.01
Validation set perplexity: 4.61
Average loss at step 3800: 1.640308 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.62
Average loss at step 3900: 1.638592 learning rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 4.61
Average loss at step 4000: 1.651019 learning rate: 10.000000
Minibatch perplexity: 5.41
================================================================================
jeed in the poethank popters and wezbors for hised city a phication that interna
jakami allowing the zero zero one and unic s a exters one six with the benel red
uft with posits the accept and defcentury at the rose equenori on and under cati
bers and neores and that fielding street are between tank one et rought hisso wi
ment himsis feer six zero mal happonomian revollidi graft of gold was become the
================================================================================
Validation set perplexity: 4.67
Average loss at step 4100: 1.629230 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.65
Average loss at step 4200: 1.634512 learning rate: 10.000000
Minibatch perplexity: 4.95
Validation set perplexity: 4.62
Average loss at step 4300: 1.611550 learning rate: 10.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.50
Average loss at step 4400: 1.603863 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.41
Average loss at step 4500: 1.619395 learning rate: 10.000000
Minibatch perplexity: 5.43
Validation set perplexity: 4.48
Average loss at step 4600: 1.615956 learning rate: 10.000000
Minibatch perplexity: 5.32
Validation set perplexity: 4.48
Average loss at step 4700: 1.620425 learning rate: 10.000000
Minibatch perplexity: 4.64
Validation set perplexity: 4.55
Average loss at step 4800: 1.631623 learning rate: 10.000000
Minibatch perplexity: 5.43
Validation set perplexity: 4.53
Average loss at step 4900: 1.632746 learning rate: 10.000000
Minibatch perplexity: 5.23
Validation set perplexity: 4.64
Average loss at step 5000: 1.607743 learning rate: 1.000000
Minibatch perplexity: 5.29
================================================================================
notaliances of pactually the is contean the inotrian referres a magilly namines 
batix if more withinies beach residesticies see rannasmlered two six sebanibusio
quing illow extensed a moralous il and and untal is dublind initeal plutive a fo
balline in soeroft indiane the puipment strystsaytelus with as ind othitel r the
m place doe of the differently larguent wind of simplest to to namaze inviswarce
================================================================================
Validation set perplexity: 4.53
Average loss at step 5100: 1.597836 learning rate: 1.000000
Minibatch perplexity: 4.73
Validation set perplexity: 4.41
Average loss at step 5200: 1.588398 learning rate: 1.000000
Minibatch perplexity: 5.14
Validation set perplexity: 4.39
Average loss at step 5300: 1.575263 learning rate: 1.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.39
Average loss at step 5400: 1.579441 learning rate: 1.000000
Minibatch perplexity: 4.93
Validation set perplexity: 4.36
Average loss at step 5500: 1.564512 learning rate: 1.000000
Minibatch perplexity: 4.94
Validation set perplexity: 4.36
Average loss at step 5600: 1.581061 learning rate: 1.000000
Minibatch perplexity: 4.67
Validation set perplexity: 4.38
Average loss at step 5700: 1.566242 learning rate: 1.000000
Minibatch perplexity: 4.58
Validation set perplexity: 4.36
Average loss at step 5800: 1.576978 learning rate: 1.000000
Minibatch perplexity: 4.04
Validation set perplexity: 4.38
Average loss at step 5900: 1.572138 learning rate: 1.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.34
Average loss at step 6000: 1.547296 learning rate: 1.000000
Minibatch perplexity: 5.04
================================================================================
qual causele implective would actor bists metbael dimatorial invalt only and hel
jechene directed mode genter advical soccess the offercy of actramenia rick six 
s i tine his one nine joes this to as a milifiancefuthy full of the stroman cath
curation isllef wentive included to bothamittle bs years or son lass that hel hi
y the cateorixal liffenters labes crailmogh warspuban and its explainting the pr
================================================================================
Validation set perplexity: 4.34
Average loss at step 6100: 1.558320 learning rate: 1.000000
Minibatch perplexity: 4.18
Validation set perplexity: 4.33
Average loss at step 6200: 1.536800 learning rate: 1.000000
Minibatch perplexity: 4.59
Validation set perplexity: 4.33
Average loss at step 6300: 1.543636 learning rate: 1.000000
Minibatch perplexity: 4.68
Validation set perplexity: 4.32
Average loss at step 6400: 1.538562 learning rate: 1.000000
Minibatch perplexity: 4.56
Validation set perplexity: 4.31
Average loss at step 6500: 1.558417 learning rate: 1.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.29
Average loss at step 6600: 1.593465 learning rate: 1.000000
Minibatch perplexity: 4.39
Validation set perplexity: 4.29
Average loss at step 6700: 1.581616 learning rate: 1.000000
Minibatch perplexity: 5.34
Validation set perplexity: 4.29
Average loss at step 6800: 1.605099 learning rate: 1.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.30
Average loss at step 6900: 1.578855 learning rate: 1.000000
Minibatch perplexity: 5.40
Validation set perplexity: 4.33
Average loss at step 7000: 1.575019 learning rate: 1.000000
Minibatch perplexity: 4.63
================================================================================
ark rich in marraniis armed to post the mayf priming extreme for jasmol also gro
 war to chargurace to anout three two ligy an ownes but lack in flyi shape in th
zage bodla culy and as disgents about one six six eight while sterentics computi
vary pateption undervious had tederine influm i oving a rabests which had be sal
sest cash magrekan sometly she mal rather musically enstome the retyronities whe
================================================================================
Validation set perplexity: 4.32

Problem 1

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.



Problem 2

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this article.



Problem 3

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

the quick brown fox

the model should attempt to output:

eht kciuq nworb xof

Refer to the lecture on how to put together a sequence-to-sequence model, as well as this article for best practices.