Deep Learning with TensorFlow

Credits: Forked from TensorFlow by Google

Setup

Refer to the setup instructions.

Exercise 6

After training a skip-gram model in 5_word2vec.ipynb, the goal of this exercise is to train a LSTM character model over Text8 data.


In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
import os
import numpy as np
import random
import string
import tensorflow as tf
import urllib
import zipfile

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urllib.urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print 'Found and verified', filename
  else:
    print statinfo.st_size
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

In [3]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return f.read(name)
  f.close()
  
text = read_data(filename)
print "Data size", len(text)


Data size 100000000

Create a small validation set.


In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print train_size, train_text[:64]
print valid_size, valid_text[:64]


99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl

Utility functions to map characters to vocabulary IDs and back.


In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print 'Unexpected character:', char
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print char2id('a'), char2id('z'), char2id(' '), char2id('ï')
print id2char(1), id2char(26), id2char(0)


1 26 0 Unexpected character: ï
0
a z  

Function to generate a training batch for the LSTM model.


In [8]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size / batch_size
    self._cursor = [ offset * segment for offset in xrange(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in xrange(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in xrange(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (mostl likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print batches2string(train_batches.next())
print batches2string(train_batches.next())
print batches2string(valid_batches.next())
print batches2string(valid_batches.next())


['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
[' a']
['an']

In [9]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in xrange(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.


In [10]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in xrange(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [11]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print 'Initialized'
  mean_loss = 0
  for step in xrange(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in xrange(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print 'Average loss at step', step, ':', mean_loss, 'learning rate:', lr
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print 'Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels)))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print '=' * 80
        for _ in xrange(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in xrange(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print sentence
        print '=' * 80
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in xrange(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print 'Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size))


WARNING:tensorflow:From <ipython-input-11-9884d381da05>:5 in <module>.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
Initialized
Average loss at step 0 : 3.29715776443 learning rate: 10.0
Minibatch perplexity: 27.04
================================================================================
xbhto  swo   wxztghakoyswcwrdmn epkgm  shfnb szln lk deho re vl whwhso jvboeepti
saj bhyorh otdavonje ketrkiwpuszgmckjuictnoycheu hbtjiekc  plzot eeutk  xfiizjwc
wrvajni otdypvmta ujubj p ndean wzhocjund fq fh cqoenwswj qb ithtknoo vcijtooq  
nxieosraokzeetxw owitssvsxucraenivqawpoweembgatoytvheloxznvv vxagyzcwzaoftgwrr r
dyxpetdoyiqesjra vgalroaho av nayv uwsav jta qa spskomx wgl u yeraahsrtwvnznqh m
================================================================================
Validation set perplexity: 20.33
Average loss at step 100 : 2.63658575773 learning rate: 10.0
Minibatch perplexity: 11.03
Validation set perplexity: 10.24
Average loss at step 200 : 2.25891527176 learning rate: 10.0
Minibatch perplexity: 8.63
Validation set perplexity: 8.57
Average loss at step 300 : 2.1055544889 learning rate: 10.0
Minibatch perplexity: 7.57
Validation set perplexity: 8.01
Average loss at step 400 : 1.9997539854 learning rate: 10.0
Minibatch perplexity: 7.53
Validation set perplexity: 7.88
Average loss at step 500 : 1.93555634379 learning rate: 10.0
Minibatch perplexity: 6.30
Validation set perplexity: 7.11
Average loss at step 600 : 1.90505887389 learning rate: 10.0
Minibatch perplexity: 6.28
Validation set perplexity: 6.86
Average loss at step 700 : 1.85572583199 learning rate: 10.0
Minibatch perplexity: 6.45
Validation set perplexity: 6.52
Average loss at step 800 : 1.81726102352 learning rate: 10.0
Minibatch perplexity: 6.00
Validation set perplexity: 6.21
Average loss at step 900 : 1.82515558124 learning rate: 10.0
Minibatch perplexity: 6.97
Validation set perplexity: 6.29
Average loss at step 1000 : 1.82426697731 learning rate: 10.0
Minibatch perplexity: 5.77
================================================================================
gics of who the lupe becriate moce of limany novee a wusmres to seppresion of th
will and the sall consu nine are dinge on sose kinnetion the resoved the whing e
vatation the centern mill morcher comon smbile in qus phejed a list offefice he 
age innomthroate in esquly anish the edhrinter listherby ornister offyersive as 
ber in been and near there fan linms the sort time hlidivery luch out to pelopri
================================================================================
Validation set perplexity: 6.12
Average loss at step 1100 : 1.77617163539 learning rate: 10.0
Minibatch perplexity: 5.69
Validation set perplexity: 5.76
Average loss at step 1200 : 1.75387057781 learning rate: 10.0
Minibatch perplexity: 5.15
Validation set perplexity: 5.53
Average loss at step 1300 : 1.73050318599 learning rate: 10.0
Minibatch perplexity: 5.79
Validation set perplexity: 5.45
Average loss at step 1400 : 1.74720084429 learning rate: 10.0
Minibatch perplexity: 6.06
Validation set perplexity: 5.44
Average loss at step 1500 : 1.74058589101 learning rate: 10.0
Minibatch perplexity: 4.69
Validation set perplexity: 5.39
Average loss at step 1600 : 1.74598360777 learning rate: 10.0
Minibatch perplexity: 5.68
Validation set perplexity: 5.35
Average loss at step 1700 : 1.71617349386 learning rate: 10.0
Minibatch perplexity: 5.76
Validation set perplexity: 5.25
Average loss at step 1800 : 1.67667043447 learning rate: 10.0
Minibatch perplexity: 5.40
Validation set perplexity: 5.23
Average loss at step 1900 : 1.64806430459 learning rate: 10.0
Minibatch perplexity: 5.25
Validation set perplexity: 5.25
Average loss at step 2000 : 1.69783878326 learning rate: 10.0
Minibatch perplexity: 5.76
================================================================================
us that divinks the propresonage into injongeent wimbla for wothin special s cou
bring with kinsas shittic clud mann is a reases copedoving singer consencion was
k in one five seven modure themsiz kinghy one nine eight four three nine six sev
mind in kingsages taken in the spileging ameria semped amening that stin they to
puse placeb evet full instatla also sides govers jumeran doune of states and tha
================================================================================
Validation set perplexity: 5.10
Average loss at step 2100 : 1.68529169679 learning rate: 10.0
Minibatch perplexity: 5.20
Validation set perplexity: 4.85
Average loss at step 2200 : 1.68543538094 learning rate: 10.0
Minibatch perplexity: 6.35
Validation set perplexity: 4.99
Average loss at step 2300 : 1.64014799476 learning rate: 10.0
Minibatch perplexity: 4.92
Validation set perplexity: 4.90
Average loss at step 2400 : 1.66046670675 learning rate: 10.0
Minibatch perplexity: 5.03
Validation set perplexity: 4.81
Average loss at step 2500 : 1.67823691964 learning rate: 10.0
Minibatch perplexity: 5.44
Validation set perplexity: 4.70
Average loss at step 2600 : 1.65396907449 learning rate: 10.0
Minibatch perplexity: 5.86
Validation set perplexity: 4.68
Average loss at step 2700 : 1.6577311933 learning rate: 10.0
Minibatch perplexity: 4.52
Validation set perplexity: 4.63
Average loss at step 2800 : 1.64541370869 learning rate: 10.0
Minibatch perplexity: 5.56
Validation set perplexity: 4.64
Average loss at step 2900 : 1.65069376707 learning rate: 10.0
Minibatch perplexity: 5.52
Validation set perplexity: 4.68
Average loss at step 3000 : 1.64925897717 learning rate: 10.0
Minibatch perplexity: 5.05
================================================================================
mantpyens risof roxewion of muderaa allianny variolics unifived industrotive hyp
ull take aci addrespabin accorrated decity another of a adarchy dies scounds roc
que susplearn have beelds a fiction havay states prodical canding ourpedetes was
uftone ataup scanuatal sycting bearbal and an number and the leftly refugling ot
thers solt and construnce the bars d was heos ezpricted in the roter of callestr
================================================================================
Validation set perplexity: 4.60
Average loss at step 3100 : 1.62646449327 learning rate: 10.0
Minibatch perplexity: 5.89
Validation set perplexity: 4.56
Average loss at step 3200 : 1.63974208117 learning rate: 10.0
Minibatch perplexity: 5.46
Validation set perplexity: 4.60
Average loss at step 3300 : 1.63494459748 learning rate: 10.0
Minibatch perplexity: 5.08
Validation set perplexity: 4.62
Average loss at step 3400 : 1.66837506771 learning rate: 10.0
Minibatch perplexity: 5.68
Validation set perplexity: 4.60
Average loss at step 3500 : 1.65663881302 learning rate: 10.0
Minibatch perplexity: 5.71
Validation set perplexity: 4.53
Average loss at step 3600 : 1.66707651854 learning rate: 10.0
Minibatch perplexity: 4.50
Validation set perplexity: 4.54
Average loss at step 3700 : 1.64505961418 learning rate: 10.0
Minibatch perplexity: 5.12
Validation set perplexity: 4.58
Average loss at step 3800 : 1.64164466023 learning rate: 10.0
Minibatch perplexity: 5.58
Validation set perplexity: 4.58
Average loss at step 3900 : 1.6349717617 learning rate: 10.0
Minibatch perplexity: 5.28
Validation set perplexity: 4.45
Average loss at step 4000 : 1.65029353619 learning rate: 10.0
Minibatch perplexity: 4.80
================================================================================
y have moneraley atter ranishics cabectein three eight to memendatio thookive of
h of been fragges ts been culturly the homing translabe of greeys on armunic the
ven mety one nine nine four zero an antron jlang of fimpre gy ictegry the perpea
ish varile was wether atnamables to monum the processive abalso eluted tralso of
m to ali with of electional turnegions or is carroming to almayyza by charies pr
================================================================================
Validation set perplexity: 4.56
Average loss at step 4100 : 1.63086620212 learning rate: 10.0
Minibatch perplexity: 5.25
Validation set perplexity: 4.65
Average loss at step 4200 : 1.63471940517 learning rate: 10.0
Minibatch perplexity: 5.29
Validation set perplexity: 4.51
Average loss at step 4300 : 1.61264855146 learning rate: 10.0
Minibatch perplexity: 4.96
Validation set perplexity: 4.55
Average loss at step 4400 : 1.6061510098 learning rate: 10.0
Minibatch perplexity: 4.93
Validation set perplexity: 4.45
Average loss at step 4500 : 1.61641805649 learning rate: 10.0
Minibatch perplexity: 5.06
Validation set perplexity: 4.53
Average loss at step 4600 : 1.61362475991 learning rate: 10.0
Minibatch perplexity: 4.89
Validation set perplexity: 4.65
Average loss at step 4700 : 1.6269264853 learning rate: 10.0
Minibatch perplexity: 5.37
Validation set perplexity: 4.49
Average loss at step 4800 : 1.63323410869 learning rate: 10.0
Minibatch perplexity: 4.53
Validation set perplexity: 4.47
Average loss at step 4900 : 1.63575899601 learning rate: 10.0
Minibatch perplexity: 5.36
Validation set perplexity: 4.53
Average loss at step 5000 : 1.60971468687 learning rate: 1.0
Minibatch perplexity: 4.68
================================================================================
ing vissions resulver completes as sel poregones with was fow life title indepen
ine often o cantutys eacisted invelitivity worgder somes descrictive artyry notk
ent ille origin segon efremonshents vath hi exchilinge mays and ld was has born 
vers to six simon by he and masurably a somante all atmarl en orly kene miners p
nes puplines it cole is to the not redivilia alvale of vrogy musician tranf croo
================================================================================
Validation set perplexity: 4.60
Average loss at step 5100 : 1.60548645258 learning rate: 1.0
Minibatch perplexity: 4.88
Validation set perplexity: 4.41
Average loss at step 5200 : 1.59507003427 learning rate: 1.0
Minibatch perplexity: 4.63
Validation set perplexity: 4.35
Average loss at step 5300 : 1.5773662138 learning rate: 1.0
Minibatch perplexity: 4.63
Validation set perplexity: 4.35
Average loss at step 5400 : 1.58308276772 learning rate: 1.0
Minibatch perplexity: 5.06
Validation set perplexity: 4.33
Average loss at step 5500 : 1.56804048896 learning rate: 1.0
Minibatch perplexity: 5.01
Validation set perplexity: 4.28
Average loss at step 5600 : 1.57936667919 learning rate: 1.0
Minibatch perplexity: 4.89
Validation set perplexity: 4.30
Average loss at step 5700 : 1.56777412772 learning rate: 1.0
Minibatch perplexity: 4.47
Validation set perplexity: 4.31
Average loss at step 5800 : 1.57779150367 learning rate: 1.0
Minibatch perplexity: 4.78
Validation set perplexity: 4.31
Average loss at step 5900 : 1.57033186793 learning rate: 1.0
Minibatch perplexity: 5.05
Validation set perplexity: 4.31
Average loss at step 6000 : 1.54599629641 learning rate: 1.0
Minibatch perplexity: 4.95
================================================================================
y alandistous autivity forein rulthing that appts ablector denogy instruction to
port talls garen froning in two zero zero zero when marilacks closing theory omi
y bultest wimini exwal of destribullym one like a linkly flubery tapan comment t
madked some for starhiam panched arty orgenotwom and dorthish abber kermiurs hig
legae thus one eight furth the denrossi and name some as hissols changed of infi
================================================================================
Validation set perplexity: 4.31
Average loss at step 6100 : 1.56180515409 learning rate: 1.0
Minibatch perplexity: 5.02
Validation set perplexity: 4.28
Average loss at step 6200 : 1.53518264532 learning rate: 1.0
Minibatch perplexity: 4.79
Validation set perplexity: 4.28
Average loss at step 6300 : 1.54382590771 learning rate: 1.0
Minibatch perplexity: 5.26
Validation set perplexity: 4.25
Average loss at step 6400 : 1.53907721639 learning rate: 1.0
Minibatch perplexity: 4.28
Validation set perplexity: 4.25
Average loss at step 6500 : 1.55467021942 learning rate: 1.0
Minibatch perplexity: 4.73
Validation set perplexity: 4.24
Average loss at step 6600 : 1.59579352021 learning rate: 1.0
Minibatch perplexity: 4.84
Validation set perplexity: 4.23
Average loss at step 6700 : 1.58001167417 learning rate: 1.0
Minibatch perplexity: 4.97
Validation set perplexity: 4.25
Average loss at step 6800 : 1.60200345993 learning rate: 1.0
Minibatch perplexity: 4.68
Validation set perplexity: 4.25
Average loss at step 6900 : 1.57681329846 learning rate: 1.0
Minibatch perplexity: 4.65
Validation set perplexity: 4.27
Average loss at step 7000 : 1.57127949357 learning rate: 1.0
Minibatch perplexity: 4.86
================================================================================
fed inflation which releases bakes tranwars of a bear kenoins croile she user in
 mokt is to perional one jarabe s don have are would ratedia cleations b between
ties borning anza stand standations independen for tho assect of three eight thr
verally bindshough lastingres of exa was fight on world than pasio one nine de u
mueach those baveer discovclonos raw mern souses which got leams quepeddy fast c
================================================================================
Validation set perplexity: 4.26

Problem 1

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.



Problem 2

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this article.



Problem 3

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

the quick brown fox

the model should attempt to output:

eht kciuq nworb xof

Reference: http://arxiv.org/abs/1409.3215