Deep Learning

Assignment 6

After training a skip-gram model in 5_word2vec.ipynb, the goal of this notebook is to train a LSTM character model over Text8 data.


In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

In [3]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))


Data size 100000000

Create a small validation set.


In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])


99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl

Utility functions to map characters to vocabulary IDs and back.


In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))


Unexpected character: ï
1 26 0 0
a z  

Function to generate a training batch for the LSTM model.


In [6]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursors = [offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursors[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))


['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
[' a']
['an']

In [7]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.


In [8]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [9]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.295792 learning rate: 10.000000
Minibatch perplexity: 27.00
================================================================================
 rdsrxsqqgilaaer eia imotweasffurhfh tg   li hml  tdhwssxelk cvr rrfffzx h yiyro
w ztcie futqqpaignttga  fdjodgmxveswhq l  rhhotwmozdbsppnps us wyle qzoqxyk co l
kljiodknibkgco pru atam ouedjewbml wlxgg wcn j ha pvgdhlbhaegr ed qap k ipagesnv
vde f ae bexgfor plpd kernsseezel nlgxevlj faoh sihvnmy d swg  ztestfyer cslairb
zjwledknhgerfvhrirxmyrotv tsjpxhmoti pwvyhyzuyuewq asxzvmiach cynhyv ou lghexzge
================================================================================
Validation set perplexity: 20.12
Average loss at step 100: 2.596648 learning rate: 10.000000
Minibatch perplexity: 10.96
Validation set perplexity: 10.41
Average loss at step 200: 2.249224 learning rate: 10.000000
Minibatch perplexity: 8.74
Validation set perplexity: 8.81
Average loss at step 300: 2.095656 learning rate: 10.000000
Minibatch perplexity: 7.34
Validation set perplexity: 8.02
Average loss at step 400: 1.996738 learning rate: 10.000000
Minibatch perplexity: 7.57
Validation set perplexity: 7.81
Average loss at step 500: 1.932317 learning rate: 10.000000
Minibatch perplexity: 6.49
Validation set perplexity: 7.08
Average loss at step 600: 1.908573 learning rate: 10.000000
Minibatch perplexity: 6.13
Validation set perplexity: 6.95
Average loss at step 700: 1.855236 learning rate: 10.000000
Minibatch perplexity: 6.40
Validation set perplexity: 6.68
Average loss at step 800: 1.817282 learning rate: 10.000000
Minibatch perplexity: 5.99
Validation set perplexity: 6.38
Average loss at step 900: 1.828141 learning rate: 10.000000
Minibatch perplexity: 6.98
Validation set perplexity: 6.23
Average loss at step 1000: 1.823758 learning rate: 10.000000
Minibatch perplexity: 5.58
================================================================================
det tabeys in de deuning aloge of the erumbing that to two the will hich befign 
ke of derigely is nenge the betered as a as lees imoogns veres eight two zero te
preding centirate he late indoinv creed bernate kowhized gopul mercesser zero kn
ice one five litcem and terns of several teventity the interve of vidiemory in y
wer or ripict is leated to lial in and brics of origs were hoel be dovee eleca r
================================================================================
Validation set perplexity: 5.91
Average loss at step 1100: 1.775033 learning rate: 10.000000
Minibatch perplexity: 5.72
Validation set perplexity: 5.81
Average loss at step 1200: 1.752767 learning rate: 10.000000
Minibatch perplexity: 5.04
Validation set perplexity: 5.59
Average loss at step 1300: 1.731053 learning rate: 10.000000
Minibatch perplexity: 5.78
Validation set perplexity: 5.58
Average loss at step 1400: 1.745728 learning rate: 10.000000
Minibatch perplexity: 5.93
Validation set perplexity: 5.48
Average loss at step 1500: 1.734965 learning rate: 10.000000
Minibatch perplexity: 4.75
Validation set perplexity: 5.40
Average loss at step 1600: 1.744152 learning rate: 10.000000
Minibatch perplexity: 5.42
Validation set perplexity: 5.30
Average loss at step 1700: 1.712228 learning rate: 10.000000
Minibatch perplexity: 5.69
Validation set perplexity: 5.29
Average loss at step 1800: 1.672705 learning rate: 10.000000
Minibatch perplexity: 5.33
Validation set perplexity: 5.15
Average loss at step 1900: 1.644643 learning rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 5.17
Average loss at step 2000: 1.696387 learning rate: 10.000000
Minibatch perplexity: 5.59
================================================================================
livbivity worlde sounces assuba carlainer and univer desply and rodication to mo
itanie econity of jandless ling desses of stading of many be desuda lenaing keno
foylagein producty new the work cristival gymessian and plensis state on atport 
on transbrodiosrand and pendrists now stitct efolioskated is velseing timisiav w
wing see but indistan raus services string and it verres n strock on dinnan as r
================================================================================
Validation set perplexity: 5.24
Average loss at step 2100: 1.683414 learning rate: 10.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.99
Average loss at step 2200: 1.676071 learning rate: 10.000000
Minibatch perplexity: 6.31
Validation set perplexity: 4.97
Average loss at step 2300: 1.639147 learning rate: 10.000000
Minibatch perplexity: 4.99
Validation set perplexity: 4.87
Average loss at step 2400: 1.658051 learning rate: 10.000000
Minibatch perplexity: 5.22
Validation set perplexity: 4.85
Average loss at step 2500: 1.680003 learning rate: 10.000000
Minibatch perplexity: 5.46
Validation set perplexity: 4.70
Average loss at step 2600: 1.649103 learning rate: 10.000000
Minibatch perplexity: 5.69
Validation set perplexity: 4.67
Average loss at step 2700: 1.660427 learning rate: 10.000000
Minibatch perplexity: 4.63
Validation set perplexity: 4.73
Average loss at step 2800: 1.650916 learning rate: 10.000000
Minibatch perplexity: 5.32
Validation set perplexity: 4.64
Average loss at step 2900: 1.650066 learning rate: 10.000000
Minibatch perplexity: 5.60
Validation set perplexity: 4.69
Average loss at step 3000: 1.651664 learning rate: 10.000000
Minibatch perplexity: 5.11
================================================================================
ty defers emperorwy sangrageta president revolutional dr but actor one op rillee
qay of the perment one nine seven two six sizes of second of is ttworner ps and 
ather oirian with as soles s ngni divicity records one nine seven zero thaniti m
ver improgen apiliament language in old genery of formulets dymalney of un toll 
kingatively first an armulal dong in one nine five wester ho factine of topery c
================================================================================
Validation set perplexity: 4.71
Average loss at step 3100: 1.626728 learning rate: 10.000000
Minibatch perplexity: 5.69
Validation set perplexity: 4.64
Average loss at step 3200: 1.646412 learning rate: 10.000000
Minibatch perplexity: 5.49
Validation set perplexity: 4.63
Average loss at step 3300: 1.636407 learning rate: 10.000000
Minibatch perplexity: 5.21
Validation set perplexity: 4.56
Average loss at step 3400: 1.667707 learning rate: 10.000000
Minibatch perplexity: 5.41
Validation set perplexity: 4.65
Average loss at step 3500: 1.655677 learning rate: 10.000000
Minibatch perplexity: 5.58
Validation set perplexity: 4.68
Average loss at step 3600: 1.671829 learning rate: 10.000000
Minibatch perplexity: 4.54
Validation set perplexity: 4.58
Average loss at step 3700: 1.644357 learning rate: 10.000000
Minibatch perplexity: 5.06
Validation set perplexity: 4.59
Average loss at step 3800: 1.641354 learning rate: 10.000000
Minibatch perplexity: 5.51
Validation set perplexity: 4.70
Average loss at step 3900: 1.636343 learning rate: 10.000000
Minibatch perplexity: 5.35
Validation set perplexity: 4.62
Average loss at step 4000: 1.648725 learning rate: 10.000000
Minibatch perplexity: 4.83
================================================================================
japhied in simpe  tive sbainan atted und the res rublighal governing so wast occ
efferm end increasees the formed syqay iderphant of the udaod of the discudy len
hodenced chezin engester with schillian commedicics whire denue of heron demecti
cele attacmary d hamen begins tooks reprimed used aftented remorger of etcarded 
y indibers reputia ove lifitly been overzer to the nentrent phocor oweply is wif
================================================================================
Validation set perplexity: 4.61
Average loss at step 4100: 1.631429 learning rate: 10.000000
Minibatch perplexity: 5.28
Validation set perplexity: 4.72
Average loss at step 4200: 1.635360 learning rate: 10.000000
Minibatch perplexity: 5.35
Validation set perplexity: 4.49
Average loss at step 4300: 1.614094 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.46
Average loss at step 4400: 1.609142 learning rate: 10.000000
Minibatch perplexity: 4.81
Validation set perplexity: 4.31
Average loss at step 4500: 1.618525 learning rate: 10.000000
Minibatch perplexity: 5.34
Validation set perplexity: 4.53
Average loss at step 4600: 1.618416 learning rate: 10.000000
Minibatch perplexity: 5.02
Validation set perplexity: 4.51
Average loss at step 4700: 1.620394 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 4.55
Average loss at step 4800: 1.627924 learning rate: 10.000000
Minibatch perplexity: 4.30
Validation set perplexity: 4.46
Average loss at step 4900: 1.637506 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 4.69
Average loss at step 5000: 1.605237 learning rate: 1.000000
Minibatch perplexity: 4.45
================================================================================
kard and rasa oven sazn games popky a sources finglly bioses is as the ansing fo
h and system effection in frophy se birtish contrepace one scoel was acimicelley
se acceated to be edr patem used assemi musica two nine at tro but the dvadism g
wated than this emirvels by used for gomemba ravicly there axtraisly to br is th
sts hapts of the thim b a not rel ginoushy a finalty the fremonment are repadian
================================================================================
Validation set perplexity: 4.71
Average loss at step 5100: 1.602898 learning rate: 1.000000
Minibatch perplexity: 4.99
Validation set perplexity: 4.49
Average loss at step 5200: 1.586549 learning rate: 1.000000
Minibatch perplexity: 4.57
Validation set perplexity: 4.39
Average loss at step 5300: 1.578758 learning rate: 1.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.37
Average loss at step 5400: 1.581640 learning rate: 1.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.36
Average loss at step 5500: 1.565376 learning rate: 1.000000
Minibatch perplexity: 4.93
Validation set perplexity: 4.33
Average loss at step 5600: 1.579949 learning rate: 1.000000
Minibatch perplexity: 4.90
Validation set perplexity: 4.32
Average loss at step 5700: 1.567220 learning rate: 1.000000
Minibatch perplexity: 4.59
Validation set perplexity: 4.33
Average loss at step 5800: 1.579927 learning rate: 1.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.34
Average loss at step 5900: 1.570427 learning rate: 1.000000
Minibatch perplexity: 5.04
Validation set perplexity: 4.34
Average loss at step 6000: 1.541319 learning rate: 1.000000
Minibatch perplexity: 4.91
================================================================================
ball an final game this appecialied the ears this was they as clussions ram it o
an has range bign election late lists is smotest photated nust color the most to
ce but repeace socievee tut from two earingine snormed and is li also orner to e
ly of parmine this albanic been prince for lowing what was lowars apable is numb
verioust juman raques borkschesoriage extindation da rejided with introds in com
================================================================================
Validation set perplexity: 4.33
Average loss at step 6100: 1.566637 learning rate: 1.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.29
Average loss at step 6200: 1.536465 learning rate: 1.000000
Minibatch perplexity: 4.83
Validation set perplexity: 4.29
Average loss at step 6300: 1.545390 learning rate: 1.000000
Minibatch perplexity: 5.08
Validation set perplexity: 4.26
Average loss at step 6400: 1.535579 learning rate: 1.000000
Minibatch perplexity: 4.45
Validation set perplexity: 4.30
Average loss at step 6500: 1.551872 learning rate: 1.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.28
Average loss at step 6600: 1.596169 learning rate: 1.000000
Minibatch perplexity: 4.94
Validation set perplexity: 4.27
Average loss at step 6700: 1.577819 learning rate: 1.000000
Minibatch perplexity: 4.96
Validation set perplexity: 4.29
Average loss at step 6800: 1.605585 learning rate: 1.000000
Minibatch perplexity: 4.67
Validation set perplexity: 4.28
Average loss at step 6900: 1.581885 learning rate: 1.000000
Minibatch perplexity: 4.84
Validation set perplexity: 4.31
Average loss at step 7000: 1.575325 learning rate: 1.000000
Minibatch perplexity: 4.92
================================================================================
horographonoun the but beagated by known one eight over revigis of king of the t
blexing some actived storions brave stores maxan and support pass their deligs s
ard the modern apport of relea eight four chrnoss ubspons three eight deobers th
lyshally two five a marasopt constaft the travill well shissii billc origin is w
er emifed to mongery yorn imbouted on pobses indestem with were all uirely the b
================================================================================
Validation set perplexity: 4.26

Problem 1

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.



Problem 2

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this article.



Problem 3

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

the quick brown fox

the model should attempt to output:

eht kciuq nworb xof

Refer to the lecture on how to put together a sequence-to-sequence model, as well as this article for best practices.