6_lstm


Deep Learning

Assignment 6

After training a skip-gram model in 5_word2vec.ipynb, the goal of this notebook is to train a LSTM character model over Text8 data.


In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

In [3]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))


Data size 100000000

Create a small validation set.


In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])


99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl

Utility functions to map characters to vocabulary IDs and back.


In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))


Unexpected character: ï
1 26 0 0
a z  

Function to generate a training batch for the LSTM model.


In [6]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))


['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
[' a']
['an']

I always find useful to display the shape or the content of the variables to better understand their structure:


In [7]:
print(train_batches.next()[1].shape)
print(len(train_text) // batch_size)
print(len(string.ascii_lowercase))
print(np.zeros(shape=(2, 4), dtype=np.float))


(64, 27)
1562484
26
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]

In [8]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.


In [9]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [10]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.296481 learning rate: 10.000000
Minibatch perplexity: 27.02
================================================================================
ysbunengslppeocc gagvepjeqaabtjaazieotn vnyiqvp a ie rwr  m gifvxgrvmrt lanxmytk
w oemaiiwforms sxiemlr gnktx eekuauapvvmspaztiezgewieao eirr a kszns me zxgsozsw
wsqeqzcxenft  utetqpxqc etnz  at  sb  jfol a   tlcaoeqs  amcjseanr  rna biavplpm
bunhys s nh  mcqzstbotrbabi eblnns iqezcbknlevnhpoafbi xrie oze m r tu shosrttd 
e ote nfivhaamiphqxxragw re aontpagnhpwqxrx theoty ow qovmc x bam nza uerihctfie
================================================================================
Validation set perplexity: 19.96
Average loss at step 100: 2.590483 learning rate: 10.000000
Minibatch perplexity: 10.32
Validation set perplexity: 10.51
Average loss at step 200: 2.245892 learning rate: 10.000000
Minibatch perplexity: 9.40
Validation set perplexity: 8.99
Average loss at step 300: 2.096432 learning rate: 10.000000
Minibatch perplexity: 7.45
Validation set perplexity: 7.64
Average loss at step 400: 2.005932 learning rate: 10.000000
Minibatch perplexity: 7.59
Validation set perplexity: 7.47
Average loss at step 500: 1.937780 learning rate: 10.000000
Minibatch perplexity: 6.38
Validation set perplexity: 7.11
Average loss at step 600: 1.908713 learning rate: 10.000000
Minibatch perplexity: 6.30
Validation set perplexity: 6.86
Average loss at step 700: 1.861192 learning rate: 10.000000
Minibatch perplexity: 5.56
Validation set perplexity: 6.74
Average loss at step 800: 1.820630 learning rate: 10.000000
Minibatch perplexity: 6.03
Validation set perplexity: 6.62
Average loss at step 900: 1.828721 learning rate: 10.000000
Minibatch perplexity: 7.15
Validation set perplexity: 6.20
Average loss at step 1000: 1.823988 learning rate: 10.000000
Minibatch perplexity: 5.85
================================================================================
gereng phs ciptiple two primed counts and the in mecielsticativen flors made at 
ling time two firvitial hivs disty is fiveding uct has fic zero yxpwarch of thei
adiving sidyian of the inframes for a indiblatity highen plich bakioral is hine 
iver by thea conpraces tichar the one nine seven five one suib with the nations 
relenes tre mintist it sidlers coorches whill stitiatan grecture trans the benoy
================================================================================
Validation set perplexity: 6.04
Average loss at step 1100: 1.777235 learning rate: 10.000000
Minibatch perplexity: 5.52
Validation set perplexity: 6.01
Average loss at step 1200: 1.750621 learning rate: 10.000000
Minibatch perplexity: 6.13
Validation set perplexity: 5.64
Average loss at step 1300: 1.728072 learning rate: 10.000000
Minibatch perplexity: 5.69
Validation set perplexity: 5.68
Average loss at step 1400: 1.740794 learning rate: 10.000000
Minibatch perplexity: 4.81
Validation set perplexity: 5.66
Average loss at step 1500: 1.736469 learning rate: 10.000000
Minibatch perplexity: 6.02
Validation set perplexity: 5.47
Average loss at step 1600: 1.743287 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 5.38
Average loss at step 1700: 1.709299 learning rate: 10.000000
Minibatch perplexity: 4.63
Validation set perplexity: 5.39
Average loss at step 1800: 1.675454 learning rate: 10.000000
Minibatch perplexity: 5.02
Validation set perplexity: 5.34
Average loss at step 1900: 1.648865 learning rate: 10.000000
Minibatch perplexity: 5.84
Validation set perplexity: 5.26
Average loss at step 2000: 1.693725 learning rate: 10.000000
Minibatch perplexity: 4.84
================================================================================
ing the potters indeass melans the darker stnut prejaling the guallu singagianis
zer was cear in systillp entineash hemper one zero five gatentable tod whet thre
retien one three to herevent it yohape plevites in there pardicus and densi and 
k of effeccially thet of morolips thimelabiner the fursited six three zero four 
gersings when cluse in his falted to banion from higs hevered deame overen in ui
================================================================================
Validation set perplexity: 5.18
Average loss at step 2100: 1.686851 learning rate: 10.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.95
Average loss at step 2200: 1.680103 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 5.05
Average loss at step 2300: 1.640816 learning rate: 10.000000
Minibatch perplexity: 5.62
Validation set perplexity: 4.87
Average loss at step 2400: 1.659278 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 4.84
Average loss at step 2500: 1.676898 learning rate: 10.000000
Minibatch perplexity: 5.91
Validation set perplexity: 4.63
Average loss at step 2600: 1.651337 learning rate: 10.000000
Minibatch perplexity: 5.78
Validation set perplexity: 4.79
Average loss at step 2700: 1.651017 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 4.59
Average loss at step 2800: 1.649643 learning rate: 10.000000
Minibatch perplexity: 4.96
Validation set perplexity: 4.54
Average loss at step 2900: 1.647356 learning rate: 10.000000
Minibatch perplexity: 4.63
Validation set perplexity: 4.55
Average loss at step 3000: 1.649903 learning rate: 10.000000
Minibatch perplexity: 4.76
================================================================================
fication thas periov in yeras of aprrombay companishist des ty cirstic usring ye
ularid of s inhe with that however one seven two praces tyondy and roal see air 
zed it old b withising one trial and contory writh howeven s become abreptions t
bbec of have and staten scountating the greaser or a that becount with thy ho li
plitic and augu playes to every revist more new transfartial clutht epoce awsist
================================================================================
Validation set perplexity: 4.71
Average loss at step 3100: 1.631064 learning rate: 10.000000
Minibatch perplexity: 5.73
Validation set perplexity: 4.65
Average loss at step 3200: 1.646096 learning rate: 10.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.61
Average loss at step 3300: 1.640934 learning rate: 10.000000
Minibatch perplexity: 5.73
Validation set perplexity: 4.46
Average loss at step 3400: 1.669624 learning rate: 10.000000
Minibatch perplexity: 6.08
Validation set perplexity: 4.61
Average loss at step 3500: 1.654194 learning rate: 10.000000
Minibatch perplexity: 5.64
Validation set perplexity: 4.58
Average loss at step 3600: 1.666263 learning rate: 10.000000
Minibatch perplexity: 4.99
Validation set perplexity: 4.46
Average loss at step 3700: 1.645868 learning rate: 10.000000
Minibatch perplexity: 5.56
Validation set perplexity: 4.46
Average loss at step 3800: 1.643696 learning rate: 10.000000
Minibatch perplexity: 4.88
Validation set perplexity: 4.57
Average loss at step 3900: 1.637875 learning rate: 10.000000
Minibatch perplexity: 6.21
Validation set perplexity: 4.59
Average loss at step 4000: 1.643843 learning rate: 10.000000
Minibatch perplexity: 4.74
================================================================================
perments are meforchoun portable goedromor and madrics prasticlion would tendes 
in virlaitura of the oventry firsis intenled gid found and the hivican live six 
most it emportinuston at calawen laits the intean ase indation chactuge berog so
king with peocual literial bejum so elfest compliar profaction have its are howe
d in only albathed and as the tases noctien a progreated order of the loza were 
================================================================================
Validation set perplexity: 4.54
Average loss at step 4100: 1.630176 learning rate: 10.000000
Minibatch perplexity: 5.27
Validation set perplexity: 4.71
Average loss at step 4200: 1.633497 learning rate: 10.000000
Minibatch perplexity: 4.92
Validation set perplexity: 4.46
Average loss at step 4300: 1.614631 learning rate: 10.000000
Minibatch perplexity: 5.19
Validation set perplexity: 4.52
Average loss at step 4400: 1.608853 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.29
Average loss at step 4500: 1.610827 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.43
Average loss at step 4600: 1.611605 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.44
Average loss at step 4700: 1.622293 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 4.42
Average loss at step 4800: 1.622890 learning rate: 10.000000
Minibatch perplexity: 5.01
Validation set perplexity: 4.48
Average loss at step 4900: 1.631913 learning rate: 10.000000
Minibatch perplexity: 4.90
Validation set perplexity: 4.56
Average loss at step 5000: 1.606248 learning rate: 1.000000
Minibatch perplexity: 5.40
================================================================================
ecally shan into mahk that the one foughtibn of spirp under of aid and more defi
land but line hore which conventer of the committorchen bus king despire of two 
d the animan portaing on skesside over jastu this bay westorder isilor one six t
que following not mihs often statistophis they sore a mbschishap foul made a pen
ricted widst are is pathosed four the letter shied unetratid ade deficien sammeg
================================================================================
Validation set perplexity: 4.58
Average loss at step 5100: 1.603605 learning rate: 1.000000
Minibatch perplexity: 4.99
Validation set perplexity: 4.38
Average loss at step 5200: 1.591440 learning rate: 1.000000
Minibatch perplexity: 4.68
Validation set perplexity: 4.34
Average loss at step 5300: 1.577866 learning rate: 1.000000
Minibatch perplexity: 5.60
Validation set perplexity: 4.31
Average loss at step 5400: 1.575936 learning rate: 1.000000
Minibatch perplexity: 5.34
Validation set perplexity: 4.29
Average loss at step 5500: 1.567288 learning rate: 1.000000
Minibatch perplexity: 5.07
Validation set perplexity: 4.27
Average loss at step 5600: 1.573526 learning rate: 1.000000
Minibatch perplexity: 4.56
Validation set perplexity: 4.26
Average loss at step 5700: 1.563047 learning rate: 1.000000
Minibatch perplexity: 4.66
Validation set perplexity: 4.26
Average loss at step 5800: 1.579404 learning rate: 1.000000
Minibatch perplexity: 5.01
Validation set perplexity: 4.26
Average loss at step 5900: 1.572157 learning rate: 1.000000
Minibatch perplexity: 4.48
Validation set perplexity: 4.25
Average loss at step 6000: 1.546779 learning rate: 1.000000
Minibatch perplexity: 4.89
================================================================================
vincy whiles kew had goftrated bayi for by alper people conswardinams is draphs 
zer one nine five six six as a cadi with is abrict of two five whone syrbuc the 
wer thenry newpereeches music s was on a tupk the nine him attembly a marmary mo
can a penorty includes for  edections that the one six six five nine nine two on
hull b englingle are a proposted hu had and one nine seven five liberidary and c
================================================================================
Validation set perplexity: 4.24
Average loss at step 6100: 1.564810 learning rate: 1.000000
Minibatch perplexity: 4.16
Validation set perplexity: 4.21
Average loss at step 6200: 1.539832 learning rate: 1.000000
Minibatch perplexity: 5.40
Validation set perplexity: 4.22
Average loss at step 6300: 1.543234 learning rate: 1.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.19
Average loss at step 6400: 1.539422 learning rate: 1.000000
Minibatch perplexity: 4.94
Validation set perplexity: 4.20
Average loss at step 6500: 1.553795 learning rate: 1.000000
Minibatch perplexity: 4.40
Validation set perplexity: 4.19
Average loss at step 6600: 1.592802 learning rate: 1.000000
Minibatch perplexity: 4.74
Validation set perplexity: 4.21
Average loss at step 6700: 1.575369 learning rate: 1.000000
Minibatch perplexity: 4.41
Validation set perplexity: 4.20
Average loss at step 6800: 1.601404 learning rate: 1.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.19
Average loss at step 6900: 1.578345 learning rate: 1.000000
Minibatch perplexity: 4.60
Validation set perplexity: 4.22
Average loss at step 7000: 1.572104 learning rate: 1.000000
Minibatch perplexity: 4.82
================================================================================
quishes and as an ourcifics be azardaing incoporsion is shows lought that slow b
x msd cut yight chradgen people eduslieds and got washards proless cop profess i
warding practonism of the onles docyx locking the first part the volomownada rea
q him hand hey have american had after about purriess between margos remarnity w
h programmatius by the bortination astronary and samitudy is impreversings makpa
================================================================================
Validation set perplexity: 4.19

Problem 1

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.



In [11]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Concatenate parameters  
  sx = tf.concat(1, [ix, fx, cx, ox])
  sm = tf.concat(1, [im, fm, cm, om])
  sb = tf.concat(1, [ib, fb, cb, ob])
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    smatmul = tf.matmul(i, sx) + tf.matmul(o, sm) + sb
    smatmul_input, smatmul_forget, update, smatmul_output = tf.split(1, 4, smatmul)
    input_gate = tf.sigmoid(smatmul_input)
    forget_gate = tf.sigmoid(smatmul_forget)
    output_gate = tf.sigmoid(smatmul_output)
    #input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    #forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    #update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    #output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [12]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.297115 learning rate: 10.000000
Minibatch perplexity: 27.03
================================================================================
yafqiklmzuicdll tyqzeqmblto juwh knmeuy  jt et  loqezts kave qleevefbsegririkidu
ah   xo c ufe dre y ai knq rc lf ugleeninvedxkhfkzo tyfheeeczltkso e ooedncbepgk
wcpeal bscdbpaeeh de ixgequ hyeiabbxvseyeyezkhlxiisemcnqahfxcoprtnyvir oceeaeyv 
fvaesz tbat ssokiqn  xnnpoz it isisgxdzqjni teyieangbnrep ldsjg ghxufx  gxe mebv
khoseotwxo eov hdcewtq zj olqahxfdfnld e mtnh  qhaqfvsfiggebbtoen miowodtnihg th
================================================================================
Validation set perplexity: 20.20
Average loss at step 100: 2.596444 learning rate: 10.000000
Minibatch perplexity: 10.63
Validation set perplexity: 11.01
Average loss at step 200: 2.256601 learning rate: 10.000000
Minibatch perplexity: 9.29
Validation set perplexity: 9.01
Average loss at step 300: 2.091136 learning rate: 10.000000
Minibatch perplexity: 7.43
Validation set perplexity: 8.21
Average loss at step 400: 2.033858 learning rate: 10.000000
Minibatch perplexity: 6.94
Validation set perplexity: 7.70
Average loss at step 500: 1.979009 learning rate: 10.000000
Minibatch perplexity: 6.80
Validation set perplexity: 7.15
Average loss at step 600: 1.891969 learning rate: 10.000000
Minibatch perplexity: 6.55
Validation set perplexity: 6.85
Average loss at step 700: 1.865489 learning rate: 10.000000
Minibatch perplexity: 7.15
Validation set perplexity: 6.51
Average loss at step 800: 1.863540 learning rate: 10.000000
Minibatch perplexity: 6.81
Validation set perplexity: 6.43
Average loss at step 900: 1.840830 learning rate: 10.000000
Minibatch perplexity: 6.57
Validation set perplexity: 6.26
Average loss at step 1000: 1.834599 learning rate: 10.000000
Minibatch perplexity: 6.26
================================================================================
s dnocigat dwilvac liltanimally  as arter progugatary of in are one pateal chole
 aikinger not man jochial stalt geot caire these carrehts one one five even zero
ce prenife out fill signary wlloson num nine in gridth b lith and home over one 
it caletical a proqued to to palbodchd thut uinsinelord as eahn but dadachser th
z a prochral quring retiric turears wresh it alto six virrows not prolept bustak
================================================================================
Validation set perplexity: 6.04
Average loss at step 1100: 1.792052 learning rate: 10.000000
Minibatch perplexity: 5.63
Validation set perplexity: 6.01
Average loss at step 1200: 1.767267 learning rate: 10.000000
Minibatch perplexity: 6.20
Validation set perplexity: 5.95
Average loss at step 1300: 1.756417 learning rate: 10.000000
Minibatch perplexity: 5.67
Validation set perplexity: 5.79
Average loss at step 1400: 1.759507 learning rate: 10.000000
Minibatch perplexity: 6.03
Validation set perplexity: 5.66
Average loss at step 1500: 1.740990 learning rate: 10.000000
Minibatch perplexity: 5.66
Validation set perplexity: 5.40
Average loss at step 1600: 1.728305 learning rate: 10.000000
Minibatch perplexity: 6.28
Validation set perplexity: 5.59
Average loss at step 1700: 1.711466 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 5.42
Average loss at step 1800: 1.685569 learning rate: 10.000000
Minibatch perplexity: 4.52
Validation set perplexity: 5.37
Average loss at step 1900: 1.696628 learning rate: 10.000000
Minibatch perplexity: 4.67
Validation set perplexity: 5.23
Average loss at step 2000: 1.677520 learning rate: 10.000000
Minibatch perplexity: 5.62
================================================================================
linopogte womeffice iss nota inclutising gudaticalibed s bubloupa and tent of on
kin to itreless is user wornnoliteci file keach shobiopors repomits in fromestin
dinatic placiles bot diost froulh one nine interness one nines one five seven ni
mouth was whele coen with act lerger there altixe ive imbolison as histoxami com
renly in trle its orevanional nation from atticus nation assomentaliant of letti
================================================================================
Validation set perplexity: 5.35
Average loss at step 2100: 1.686659 learning rate: 10.000000
Minibatch perplexity: 5.71
Validation set perplexity: 5.18
Average loss at step 2200: 1.699840 learning rate: 10.000000
Minibatch perplexity: 5.03
Validation set perplexity: 5.04
Average loss at step 2300: 1.701436 learning rate: 10.000000
Minibatch perplexity: 5.61
Validation set perplexity: 5.04
Average loss at step 2400: 1.678378 learning rate: 10.000000
Minibatch perplexity: 4.78
Validation set perplexity: 4.98
Average loss at step 2500: 1.690642 learning rate: 10.000000
Minibatch perplexity: 5.21
Validation set perplexity: 4.99
Average loss at step 2600: 1.667896 learning rate: 10.000000
Minibatch perplexity: 5.33
Validation set perplexity: 4.96
Average loss at step 2700: 1.678243 learning rate: 10.000000
Minibatch perplexity: 5.61
Validation set perplexity: 4.99
Average loss at step 2800: 1.674157 learning rate: 10.000000
Minibatch perplexity: 4.39
Validation set perplexity: 5.13
Average loss at step 2900: 1.672686 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 5.01
Average loss at step 3000: 1.683866 learning rate: 10.000000
Minibatch perplexity: 5.16
================================================================================
h ditished is akeriad inspolusin anirace vinzubctivilics procists populanca yaad
yons frequentyes augheved funtorn the distant in the loseros of sillow beganilat
wert prenfs kand where tograte subam advenction islee is permany educional pario
ungany severative the among that to sevent the sleasings units clanded agries st
ppholed be refective that nims meike frenchona textliff forcesendor in pre natem
================================================================================
Validation set perplexity: 4.93
Average loss at step 3100: 1.647754 learning rate: 10.000000
Minibatch perplexity: 5.99
Validation set perplexity: 4.98
Average loss at step 3200: 1.629427 learning rate: 10.000000
Minibatch perplexity: 4.52
Validation set perplexity: 4.97
Average loss at step 3300: 1.643257 learning rate: 10.000000
Minibatch perplexity: 5.44
Validation set perplexity: 4.81
Average loss at step 3400: 1.623889 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 4.96
Average loss at step 3500: 1.669602 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 4.96
Average loss at step 3600: 1.647169 learning rate: 10.000000
Minibatch perplexity: 5.29
Validation set perplexity: 4.75
Average loss at step 3700: 1.650060 learning rate: 10.000000
Minibatch perplexity: 5.65
Validation set perplexity: 4.91
Average loss at step 3800: 1.653223 learning rate: 10.000000
Minibatch perplexity: 5.78
Validation set perplexity: 4.79
Average loss at step 3900: 1.647023 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 4.91
Average loss at step 4000: 1.636907 learning rate: 10.000000
Minibatch perplexity: 5.31
================================================================================
kipage lazin russive acis fative ufic doe one six two s han all seritace one hom
tional argent that it ik transparaed somee thing many course for the archerages 
ight with the new frisone a famoustal for cles ambio conseng one five zero to a 
is ganadua dianchumamar irams trae recognite and wigh in award folloms ra swarth
fucting the famout of ecesparahs thus both statide serving by pociluted candarg 
================================================================================
Validation set perplexity: 4.78
Average loss at step 4100: 1.616329 learning rate: 10.000000
Minibatch perplexity: 4.73
Validation set perplexity: 4.54
Average loss at step 4200: 1.609911 learning rate: 10.000000
Minibatch perplexity: 5.27
Validation set perplexity: 4.73
Average loss at step 4300: 1.617847 learning rate: 10.000000
Minibatch perplexity: 5.09
Validation set perplexity: 4.75
Average loss at step 4400: 1.605607 learning rate: 10.000000
Minibatch perplexity: 5.89
Validation set perplexity: 4.71
Average loss at step 4500: 1.637116 learning rate: 10.000000
Minibatch perplexity: 5.23
Validation set perplexity: 4.77
Average loss at step 4600: 1.622026 learning rate: 10.000000
Minibatch perplexity: 5.78
Validation set perplexity: 4.70
Average loss at step 4700: 1.616178 learning rate: 10.000000
Minibatch perplexity: 4.87
Validation set perplexity: 4.72
Average loss at step 4800: 1.609194 learning rate: 10.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.70
Average loss at step 4900: 1.616821 learning rate: 10.000000
Minibatch perplexity: 5.67
Validation set perplexity: 4.51
Average loss at step 5000: 1.609718 learning rate: 1.000000
Minibatch perplexity: 4.52
================================================================================
an resuld of the own they cosm the one nine eight iria such are prited discuting
jests frbm with the zaudjist waters are art and sate denariol and belend on give
one us speeshremen contenturus durious s innucting as the emb distinne is ervint
ed on the develoost of anstalored great which tigan statari in host rotem in tra
f two onising two zero zero six flegj one nine nine six duepmor one six smesp to
================================================================================
Validation set perplexity: 4.73
Average loss at step 5100: 1.588180 learning rate: 1.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.57
Average loss at step 5200: 1.589828 learning rate: 1.000000
Minibatch perplexity: 5.13
Validation set perplexity: 4.54
Average loss at step 5300: 1.588764 learning rate: 1.000000
Minibatch perplexity: 4.47
Validation set perplexity: 4.54
Average loss at step 5400: 1.586965 learning rate: 1.000000
Minibatch perplexity: 4.81
Validation set perplexity: 4.52
Average loss at step 5500: 1.583831 learning rate: 1.000000
Minibatch perplexity: 4.63
Validation set perplexity: 4.49
Average loss at step 5600: 1.554910 learning rate: 1.000000
Minibatch perplexity: 4.73
Validation set perplexity: 4.46
Average loss at step 5700: 1.572744 learning rate: 1.000000
Minibatch perplexity: 5.59
Validation set perplexity: 4.45
Average loss at step 5800: 1.593644 learning rate: 1.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.47
Average loss at step 5900: 1.578733 learning rate: 1.000000
Minibatch perplexity: 4.78
Validation set perplexity: 4.48
Average loss at step 6000: 1.577548 learning rate: 1.000000
Minibatch perplexity: 4.55
================================================================================
frimiarsed readinall distordingah labon janes to the milatives suffichanged s ba
ry henre triev files in the explivarity of the orpan to the urchicial pifsion co
 puble enown despror has finds whethest of air me one amproron plants avamon to 
an lared paranneveder pubrancy awerates and in hungudge in the yous to s of the 
s huldia to a canal in the distance gamenal da set occologen be workfication bas
================================================================================
Validation set perplexity: 4.45
Average loss at step 6100: 1.574233 learning rate: 1.000000
Minibatch perplexity: 5.21
Validation set perplexity: 4.50
Average loss at step 6200: 1.585383 learning rate: 1.000000
Minibatch perplexity: 4.81
Validation set perplexity: 4.53
Average loss at step 6300: 1.587027 learning rate: 1.000000
Minibatch perplexity: 5.95
Validation set perplexity: 4.53
Average loss at step 6400: 1.566589 learning rate: 1.000000
Minibatch perplexity: 4.85
Validation set perplexity: 4.52
Average loss at step 6500: 1.554185 learning rate: 1.000000
Minibatch perplexity: 4.61
Validation set perplexity: 4.53
Average loss at step 6600: 1.600431 learning rate: 1.000000
Minibatch perplexity: 5.99
Validation set perplexity: 4.51
Average loss at step 6700: 1.564465 learning rate: 1.000000
Minibatch perplexity: 4.71
Validation set perplexity: 4.50
Average loss at step 6800: 1.574971 learning rate: 1.000000
Minibatch perplexity: 5.34
Validation set perplexity: 4.55
Average loss at step 6900: 1.566624 learning rate: 1.000000
Minibatch perplexity: 4.85
Validation set perplexity: 4.48
Average loss at step 7000: 1.584673 learning rate: 1.000000
Minibatch perplexity: 5.10
================================================================================
phout one nine alone nine from he he burb with the sropes ver united out then we
gubinising albitt druma paptional good known of speciprenublations aga widhabans
ques now romairs that have e cipatical necelogated fighte pain bolly mag named s
bolsce butnines to the four zero five three one nine nine estomogopeoge instire 
dure on the origin lingla in the hardogn to the was imborly univeory so shunt am
================================================================================
Validation set perplexity: 4.50

Problem 2

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this article.


Let's first adapt the LSTM for a single character input with embeddings. The feed_dict is unchanged, the embeddings are looked up from the inputs. Note that the output is an array probability for the possible characters, not an embedding.


In [13]:
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  vocabulary_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, tf.argmax(i, dimension=1))
    output, state = lstm_cell(i_embed, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  sample_input_embedding = tf.nn.embedding_lookup(vocabulary_embeddings, tf.argmax(sample_input, dimension=1))
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [14]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.298660 learning rate: 10.000000
Minibatch perplexity: 27.08
================================================================================
qnh vrumdgy alikrxhfi sungvt jebthempdekvu aavrrqm kl ntlvpjwlcyjiybizt ashgw t 
uz em krrdw  pje segode uffvzeendn e eosaltpkrisuhxvlykx xaofjstdh milcxnoksgoae
w  cxhylratk v  pe o grftepc tey meefamtrmpstkn jbibfttht of gcgltje nccxlenegag
wonlqmdc lpetrfw  je ofdrq xhnhz n  les eryttqjqjdt sfye l geonuckifmvoeluikswar
d  qoyrps  dsh tbs phfdponfketsnmtnvebyfkaoftfntctvxtymr wokates byxcubadc fhaaj
================================================================================
Validation set perplexity: 18.92
Average loss at step 100: 2.281275 learning rate: 10.000000
Minibatch perplexity: 8.62
Validation set perplexity: 8.51
Average loss at step 200: 2.023276 learning rate: 10.000000
Minibatch perplexity: 6.84
Validation set perplexity: 7.78
Average loss at step 300: 1.923201 learning rate: 10.000000
Minibatch perplexity: 6.25
Validation set perplexity: 6.69
Average loss at step 400: 1.866552 learning rate: 10.000000
Minibatch perplexity: 6.35
Validation set perplexity: 6.67
Average loss at step 500: 1.889677 learning rate: 10.000000
Minibatch perplexity: 5.88
Validation set perplexity: 6.34
Average loss at step 600: 1.818804 learning rate: 10.000000
Minibatch perplexity: 6.18
Validation set perplexity: 6.14
Average loss at step 700: 1.802237 learning rate: 10.000000
Minibatch perplexity: 5.30
Validation set perplexity: 6.11
Average loss at step 800: 1.793037 learning rate: 10.000000
Minibatch perplexity: 5.87
Validation set perplexity: 5.95
Average loss at step 900: 1.788941 learning rate: 10.000000
Minibatch perplexity: 5.15
Validation set perplexity: 5.71
Average loss at step 1000: 1.723339 learning rate: 10.000000
Minibatch perplexity: 5.42
================================================================================
wing was to bairage s up distlicutions of or land occoscion pracdryug diectional
ments fity highed famal reportibialy used of s prignes on to plart in sege boint
am of was opbigificaly tray in commin formcationally viing represents timin of p
iga actation or highunger parrar cordinical tinaturester if arminically as as re
zin to theirs while and the u one nifger six two smeven six bosh main instuts ca
================================================================================
Validation set perplexity: 5.59
Average loss at step 1100: 1.706271 learning rate: 10.000000
Minibatch perplexity: 6.31
Validation set perplexity: 5.82
Average loss at step 1200: 1.731498 learning rate: 10.000000
Minibatch perplexity: 6.18
Validation set perplexity: 5.92
Average loss at step 1300: 1.712198 learning rate: 10.000000
Minibatch perplexity: 5.64
Validation set perplexity: 5.60
Average loss at step 1400: 1.691341 learning rate: 10.000000
Minibatch perplexity: 4.88
Validation set perplexity: 5.52
Average loss at step 1500: 1.689228 learning rate: 10.000000
Minibatch perplexity: 6.23
Validation set perplexity: 5.45
Average loss at step 1600: 1.685235 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 5.43
Average loss at step 1700: 1.715750 learning rate: 10.000000
Minibatch perplexity: 5.47
Validation set perplexity: 5.24
Average loss at step 1800: 1.679230 learning rate: 10.000000
Minibatch perplexity: 5.54
Validation set perplexity: 5.39
Average loss at step 1900: 1.682144 learning rate: 10.000000
Minibatch perplexity: 6.14
Validation set perplexity: 5.44
Average loss at step 2000: 1.688184 learning rate: 10.000000
Minibatch perplexity: 5.13
================================================================================
ure is of masii applica phorth jould phan streapwark carderriors a the recena di
jacy the mine of hitroducarn life to daira dice activablic directict i for the t
man a fortuombent mesord ordwollding the d saver the is chancom basix five onle 
milies oven markn n mok baying cares fortactations variabrite varis that atton t
s of na die daction what etight syre glow profict be basqainfin haman mare that 
================================================================================
Validation set perplexity: 5.47
Average loss at step 2100: 1.683748 learning rate: 10.000000
Minibatch perplexity: 5.77
Validation set perplexity: 5.26
Average loss at step 2200: 1.649001 learning rate: 10.000000
Minibatch perplexity: 5.21
Validation set perplexity: 5.31
Average loss at step 2300: 1.665940 learning rate: 10.000000
Minibatch perplexity: 4.96
Validation set perplexity: 5.17
Average loss at step 2400: 1.666254 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 5.15
Average loss at step 2500: 1.692078 learning rate: 10.000000
Minibatch perplexity: 4.98
Validation set perplexity: 5.16
Average loss at step 2600: 1.661568 learning rate: 10.000000
Minibatch perplexity: 5.44
Validation set perplexity: 5.12
Average loss at step 2700: 1.677059 learning rate: 10.000000
Minibatch perplexity: 5.34
Validation set perplexity: 5.01
Average loss at step 2800: 1.641564 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 5.28
Average loss at step 2900: 1.650296 learning rate: 10.000000
Minibatch perplexity: 5.17
Validation set perplexity: 4.96
Average loss at step 3000: 1.654272 learning rate: 10.000000
Minibatch perplexity: 5.93
================================================================================
nical time the east wide varues sore eithern the after majord tauk than explanev
minate in the celosn was sucring uses in the opposeal princametics in batking di
vict three seven the defishuanium spartinatheral ideas the increze first german 
more the he page as waif u states of the sayash the apriat systemman the mil fol
vel who the knights anivers which weilie in the callent may the segally red the 
================================================================================
Validation set perplexity: 5.08
Average loss at step 3100: 1.648604 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 4.97
Average loss at step 3200: 1.647973 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 4.88
Average loss at step 3300: 1.634662 learning rate: 10.000000
Minibatch perplexity: 5.65
Validation set perplexity: 5.08
Average loss at step 3400: 1.635991 learning rate: 10.000000
Minibatch perplexity: 4.93
Validation set perplexity: 4.92
Average loss at step 3500: 1.626289 learning rate: 10.000000
Minibatch perplexity: 5.08
Validation set perplexity: 4.95
Average loss at step 3600: 1.629943 learning rate: 10.000000
Minibatch perplexity: 5.18
Validation set perplexity: 5.06
Average loss at step 3700: 1.631718 learning rate: 10.000000
Minibatch perplexity: 6.13
Validation set perplexity: 4.99
Average loss at step 3800: 1.623823 learning rate: 10.000000
Minibatch perplexity: 5.37
Validation set perplexity: 4.80
Average loss at step 3900: 1.623366 learning rate: 10.000000
Minibatch perplexity: 5.01
Validation set perplexity: 5.00
Average loss at step 4000: 1.624305 learning rate: 10.000000
Minibatch perplexity: 4.74
================================================================================
gna and awayar than dnears unting the newhalkima stough mainft asso ledits compe
dest the pent supernishus calleviobabitustion often the region dvteing regues on
ced the consexiss exums a deferation nating mility termering the ame one four ze
new he poliman game wing one nine nine eight one eight rusht diclude karsonh a i
ment havy the supersions the waiteds broxs the me that in the wem these sevent n
================================================================================
Validation set perplexity: 5.09
Average loss at step 4100: 1.627792 learning rate: 10.000000
Minibatch perplexity: 5.34
Validation set perplexity: 5.04
Average loss at step 4200: 1.613134 learning rate: 10.000000
Minibatch perplexity: 4.17
Validation set perplexity: 4.97
Average loss at step 4300: 1.601257 learning rate: 10.000000
Minibatch perplexity: 4.81
Validation set perplexity: 5.09
Average loss at step 4400: 1.629355 learning rate: 10.000000
Minibatch perplexity: 5.85
Validation set perplexity: 5.10
Average loss at step 4500: 1.638179 learning rate: 10.000000
Minibatch perplexity: 4.87
Validation set perplexity: 4.91
Average loss at step 4600: 1.641622 learning rate: 10.000000
Minibatch perplexity: 5.57
Validation set perplexity: 4.86
Average loss at step 4700: 1.612023 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 5.02
Average loss at step 4800: 1.598413 learning rate: 10.000000
Minibatch perplexity: 5.06
Validation set perplexity: 5.05
Average loss at step 4900: 1.611767 learning rate: 10.000000
Minibatch perplexity: 4.95
Validation set perplexity: 4.84
Average loss at step 5000: 1.639902 learning rate: 1.000000
Minibatch perplexity: 5.98
================================================================================
dara states whas syrwiving gainsite exicrant where amedinagejurtz deat lokee pac
ell as extempersics forms ferches is entire of saq is of mithers other braptrawi
nication bylah constamminetolard ware ressuestand that the like the vicials fure
male to kus undhemical one nine towile lay calvuil durposes of casmers nasbe of 
n by the elective law deventuctrtion writinger in the companfy offer teas the in
================================================================================
Validation set perplexity: 4.90
Average loss at step 5100: 1.623560 learning rate: 1.000000
Minibatch perplexity: 4.71
Validation set perplexity: 4.64
Average loss at step 5200: 1.605692 learning rate: 1.000000
Minibatch perplexity: 4.74
Validation set perplexity: 4.58
Average loss at step 5300: 1.576967 learning rate: 1.000000
Minibatch perplexity: 4.78
Validation set perplexity: 4.59
Average loss at step 5400: 1.573950 learning rate: 1.000000
Minibatch perplexity: 4.85
Validation set perplexity: 4.56
Average loss at step 5500: 1.561505 learning rate: 1.000000
Minibatch perplexity: 5.13
Validation set perplexity: 4.59
Average loss at step 5600: 1.589052 learning rate: 1.000000
Minibatch perplexity: 4.96
Validation set perplexity: 4.53
Average loss at step 5700: 1.542748 learning rate: 1.000000
Minibatch perplexity: 4.68
Validation set perplexity: 4.55
Average loss at step 5800: 1.551386 learning rate: 1.000000
Minibatch perplexity: 4.72
Validation set perplexity: 4.50
Average loss at step 5900: 1.571535 learning rate: 1.000000
Minibatch perplexity: 4.51
Validation set perplexity: 4.51
Average loss at step 6000: 1.540324 learning rate: 1.000000
Minibatch perplexity: 4.80
================================================================================
unds in a cloybol sake for the an using the gur northing dum time on only bart r
bra german but the certh diction but reactive god of maxall to britar is sophic 
de edutombia head a runde moders arehim pubser earnier laws on so represent of t
makes to a nan stole s birthsmanny extrobatlet of ten one and enter gene there a
hered the changes survingual to the ban and aschahialism with was five s heal we
================================================================================
Validation set perplexity: 4.51
Average loss at step 6100: 1.558245 learning rate: 1.000000
Minibatch perplexity: 4.80
Validation set perplexity: 4.52
Average loss at step 6200: 1.577731 learning rate: 1.000000
Minibatch perplexity: 4.57
Validation set perplexity: 4.51
Average loss at step 6300: 1.590316 learning rate: 1.000000
Minibatch perplexity: 5.08
Validation set perplexity: 4.47
Average loss at step 6400: 1.624058 learning rate: 1.000000
Minibatch perplexity: 5.18
Validation set perplexity: 4.42
Average loss at step 6500: 1.615708 learning rate: 1.000000
Minibatch perplexity: 4.78
Validation set perplexity: 4.41
Average loss at step 6600: 1.583319 learning rate: 1.000000
Minibatch perplexity: 5.44
Validation set perplexity: 4.40
Average loss at step 6700: 1.573038 learning rate: 1.000000
Minibatch perplexity: 4.93
Validation set perplexity: 4.41
Average loss at step 6800: 1.547799 learning rate: 1.000000
Minibatch perplexity: 4.46
Validation set perplexity: 4.40
Average loss at step 6900: 1.547352 learning rate: 1.000000
Minibatch perplexity: 4.79
Validation set perplexity: 4.40
Average loss at step 7000: 1.561494 learning rate: 1.000000
Minibatch perplexity: 4.95
================================================================================
y in early sinke graptinner and spectifiem crinay is the firker to nace own the 
one perkorg efgil would decrease from the dam far coak minification econglishest
formace and the selech were inch e traped by quickly women kor refish alsopdanph
 gamining under s the in the preferenced lahinor new are external used for of bu
fium for six nine two zero zero nine strundes occund racy the origins as result 
================================================================================
Validation set perplexity: 4.36

We can now use bigrams as inputs for the training. Here again, the feed_dict is unchanged, the bigram embeddings are looked up from the inputs. The output of the LSTM is still a probability array of the possible characters (not bigrams).


In [15]:
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  vocabulary_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_chars = train_data[:num_unrollings]
  train_inputs = zip(train_chars[:-1], train_chars[1:])
  train_labels = train_data[2:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    #print(i.get_shape())
    #print(i)
    bigram_index = tf.argmax(i[0], dimension=1) + vocabulary_size * tf.argmax(i[1], dimension=1)
    i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, bigram_index)
    output, state = lstm_cell(i_embed, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    #print(logits.get_shape())
    #print(tf.concat(0, train_labels).get_shape())
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  #sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  sample_input = list()
  for _ in range(2):
    sample_input.append(tf.placeholder(tf.float32, shape=[1, vocabulary_size]))
  samp_in_index = tf.argmax(sample_input[0], dimension=1) + vocabulary_size * tf.argmax(sample_input[1], dimension=1)
  sample_input_embedding = tf.nn.embedding_lookup(vocabulary_embeddings, samp_in_index)
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [16]:
import collections
num_steps = 7001
summary_frequency = 100

valid_batches = BatchGenerator(valid_text, 1, 2)

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          #feed = sample(random_distribution())
          feed = collections.deque(maxlen=2)
          for _ in range(2):  
            feed.append(random_distribution())
          #sentence = characters(feed)[0]
          sentence = characters(feed[0])[0] + characters(feed[1])[0]
          #print(sentence)
          #print(feed)
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({
                    sample_input[0]: feed[0],
                    sample_input[1]: feed[1]
                })
            #feed = sample(prediction)
            feed.append(sample(prediction))
            #sentence += characters(feed)[0]
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({
                    sample_input[0]: b[0],
                    sample_input[1]: b[1]
            })
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.282539 learning rate: 10.000000
Minibatch perplexity: 26.64
================================================================================
in e de nni op ejo  vu vn s kk aeou g sdd v ye t aj uarophrv snfe yoxuwrkt w im  
nge  tep ey ard v f uifjs poozafb hht wkpxszueldq ioe w hn  foivijrhneo l nouin u
pdvegeesnivn oy nvptnetrm  cnnnut  y se p aknnhhgxxe er nehh sju l o olrnt  mb xf
hsaoa e p zbilz ozih e m dlqmxayemexaa  lb vr nc zxntekger umtvsoekpz zd nfj mohb
munb   dq c j ozpqbkgcsvydyr  ort  nz b   cz ppslznpahqnoxecvdyg hnwuay   r vft m
================================================================================
Validation set perplexity: 20.78
Average loss at step 100: 2.274602 learning rate: 10.000000
Minibatch perplexity: 7.63
Validation set perplexity: 8.93
Average loss at step 200: 1.970952 learning rate: 10.000000
Minibatch perplexity: 7.25
Validation set perplexity: 8.18
Average loss at step 300: 1.882643 learning rate: 10.000000
Minibatch perplexity: 6.21
Validation set perplexity: 7.88
Average loss at step 400: 1.827048 learning rate: 10.000000
Minibatch perplexity: 6.03
Validation set perplexity: 7.76
Average loss at step 500: 1.762147 learning rate: 10.000000
Minibatch perplexity: 6.04
Validation set perplexity: 7.68
Average loss at step 600: 1.761574 learning rate: 10.000000
Minibatch perplexity: 5.88
Validation set perplexity: 7.82
Average loss at step 700: 1.740767 learning rate: 10.000000
Minibatch perplexity: 6.12
Validation set perplexity: 7.44
Average loss at step 800: 1.725738 learning rate: 10.000000
Minibatch perplexity: 6.20
Validation set perplexity: 7.55
Average loss at step 900: 1.717153 learning rate: 10.000000
Minibatch perplexity: 5.14
Validation set perplexity: 7.31
Average loss at step 1000: 1.687027 learning rate: 10.000000
Minibatch perplexity: 5.26
================================================================================
ized the lating in lits intellection don feued by bapc pe six nine seven six sfaj
uayear that that the analouyeraorsions aves creat an indy pastond jound and a p g
qo three mying mempf indission conduction to which was this during with exold and
oof and one vier yal good and eachs division wut town the the in asistar rederali
 possions oction sigmamic socients influentar in devi national and rous bantries 
================================================================================
Validation set perplexity: 7.52
Average loss at step 1100: 1.691836 learning rate: 10.000000
Minibatch perplexity: 5.43
Validation set perplexity: 7.58
Average loss at step 1200: 1.690610 learning rate: 10.000000
Minibatch perplexity: 5.88
Validation set perplexity: 7.45
Average loss at step 1300: 1.690477 learning rate: 10.000000
Minibatch perplexity: 5.63
Validation set perplexity: 7.25
Average loss at step 1400: 1.660229 learning rate: 10.000000
Minibatch perplexity: 5.33
Validation set perplexity: 7.36
Average loss at step 1500: 1.648446 learning rate: 10.000000
Minibatch perplexity: 4.57
Validation set perplexity: 7.58
Average loss at step 1600: 1.637913 learning rate: 10.000000
Minibatch perplexity: 4.60
Validation set perplexity: 7.59
Average loss at step 1700: 1.650540 learning rate: 10.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.86
Average loss at step 1800: 1.666902 learning rate: 10.000000
Minibatch perplexity: 5.31
Validation set perplexity: 7.08
Average loss at step 1900: 1.647813 learning rate: 10.000000
Minibatch perplexity: 4.93
Validation set perplexity: 6.80
Average loss at step 2000: 1.662696 learning rate: 10.000000
Minibatch perplexity: 5.48
================================================================================
ber manged secut priend and comptually the acrons and scohndor and begarding of p
xambinaly infracter to the feaced gibrate editals was on the land rope fector fiv
ht separe wtoring seatratudio inded headitise nots and profeat worty people ujn m
cs a femlb five confline of hideded steaking amemerciect of the ti ward marks to 
gold of relatting oppose zooprocessb the can kummoricterally soviely gen any of a
================================================================================
Validation set perplexity: 6.82
Average loss at step 2100: 1.644001 learning rate: 10.000000
Minibatch perplexity: 5.36
Validation set perplexity: 6.57
Average loss at step 2200: 1.661322 learning rate: 10.000000
Minibatch perplexity: 5.46
Validation set perplexity: 6.90
Average loss at step 2300: 1.642042 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 6.81
Average loss at step 2400: 1.641949 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 7.11
Average loss at step 2500: 1.654709 learning rate: 10.000000
Minibatch perplexity: 5.40
Validation set perplexity: 7.22
Average loss at step 2600: 1.639862 learning rate: 10.000000
Minibatch perplexity: 5.00
Validation set perplexity: 6.90
Average loss at step 2700: 1.620696 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 7.02
Average loss at step 2800: 1.620937 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 6.75
Average loss at step 2900: 1.620705 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 7.01
Average loss at step 3000: 1.639649 learning rate: 10.000000
Minibatch perplexity: 5.06
================================================================================
tv in one rovementay no savisolu s greass and the parlia decelevisular for ambodo
uwsresrriation islas ettwr seriot and the heddocx laterises in eight in five text
tnistici given excepted grammirch the rescientitial and a presidecii kon execucep
wcity markin la revels scis wound in the interresirriges or intelliberict two eig
dd neelous in processor valuens write sowth of but the poliscommede of eucasideba
================================================================================
Validation set perplexity: 7.09
Average loss at step 3100: 1.615106 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 7.03
Average loss at step 3200: 1.623724 learning rate: 10.000000
Minibatch perplexity: 5.39
Validation set perplexity: 7.07
Average loss at step 3300: 1.626038 learning rate: 10.000000
Minibatch perplexity: 5.29
Validation set perplexity: 7.05
Average loss at step 3400: 1.619675 learning rate: 10.000000
Minibatch perplexity: 4.58
Validation set perplexity: 6.72
Average loss at step 3500: 1.604694 learning rate: 10.000000
Minibatch perplexity: 4.98
Validation set perplexity: 6.88
Average loss at step 3600: 1.626160 learning rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 7.05
Average loss at step 3700: 1.597278 learning rate: 10.000000
Minibatch perplexity: 4.71
Validation set perplexity: 7.16
Average loss at step 3800: 1.591539 learning rate: 10.000000
Minibatch perplexity: 4.68
Validation set perplexity: 7.13
Average loss at step 3900: 1.585760 learning rate: 10.000000
Minibatch perplexity: 4.97
Validation set perplexity: 6.92
Average loss at step 4000: 1.602616 learning rate: 10.000000
Minibatch perplexity: 4.53
================================================================================
qh fhock inable bitle around can kery subject oaeh history forican  phine negb or
vring and their immi been the vallaked inner from that the cop and raplobertuded 
zurces their known scand win their examples economically is growes an economic on
ir deaking collettled form sides libertures in the instilise of f canmation to ve
seinniet that u shalgeme it with elax found with explayer nighus darch tham thenr
================================================================================
Validation set perplexity: 7.19
Average loss at step 4100: 1.618380 learning rate: 10.000000
Minibatch perplexity: 4.83
Validation set perplexity: 7.29
Average loss at step 4200: 1.598944 learning rate: 10.000000
Minibatch perplexity: 4.27
Validation set perplexity: 6.86
Average loss at step 4300: 1.568215 learning rate: 10.000000
Minibatch perplexity: 4.79
Validation set perplexity: 6.99
Average loss at step 4400: 1.592085 learning rate: 10.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.96
Average loss at step 4500: 1.578493 learning rate: 10.000000
Minibatch perplexity: 4.83
Validation set perplexity: 6.92
Average loss at step 4600: 1.585184 learning rate: 10.000000
Minibatch perplexity: 4.84
Validation set perplexity: 6.95
Average loss at step 4700: 1.596289 learning rate: 10.000000
Minibatch perplexity: 4.85
Validation set perplexity: 6.86
Average loss at step 4800: 1.592163 learning rate: 10.000000
Minibatch perplexity: 5.41
Validation set perplexity: 7.35
Average loss at step 4900: 1.610021 learning rate: 10.000000
Minibatch perplexity: 4.73
Validation set perplexity: 6.83
Average loss at step 5000: 1.616509 learning rate: 1.000000
Minibatch perplexity: 5.97
================================================================================
dz sungemen dong the guysneal capito members forcestrated that ease it forci kate
hreen of time fin counefy wuco maka epte son the and a sture to lct for a abortur
h some the agegish things two m bet brown knowledthas minist continclophologistim
rged the game sencus rang meful paugions of this linell legnity war pearlaissuedi
uu print of when psifer i bult of st was addy the music political who imports and
================================================================================
Validation set perplexity: 6.94
Average loss at step 5100: 1.582792 learning rate: 1.000000
Minibatch perplexity: 5.03
Validation set perplexity: 6.89
Average loss at step 5200: 1.591997 learning rate: 1.000000
Minibatch perplexity: 4.71
Validation set perplexity: 6.81
Average loss at step 5300: 1.562739 learning rate: 1.000000
Minibatch perplexity: 4.35
Validation set perplexity: 6.75
Average loss at step 5400: 1.558668 learning rate: 1.000000
Minibatch perplexity: 4.35
Validation set perplexity: 6.65
Average loss at step 5500: 1.554051 learning rate: 1.000000
Minibatch perplexity: 4.66
Validation set perplexity: 6.70
Average loss at step 5600: 1.542350 learning rate: 1.000000
Minibatch perplexity: 4.45
Validation set perplexity: 6.72
Average loss at step 5700: 1.572353 learning rate: 1.000000
Minibatch perplexity: 4.80
Validation set perplexity: 6.74
Average loss at step 5800: 1.562978 learning rate: 1.000000
Minibatch perplexity: 4.24
Validation set perplexity: 6.63
Average loss at step 5900: 1.569542 learning rate: 1.000000
Minibatch perplexity: 4.65
Validation set perplexity: 6.65
Average loss at step 6000: 1.531039 learning rate: 1.000000
Minibatch perplexity: 4.10
================================================================================
rx by a in the trich again in such roma president  rabbt see form a crows of the 
fference external loma oneller or chine the behod and age donneyna lining draft e
yhs many to be within one nine six three the music constitute the two canal the r
tball ii fire six at of the conterchad perror other profician enginessional autom
pyth injoectuding the his minical relatic compositic his basina league the playow
================================================================================
Validation set perplexity: 6.57
Average loss at step 6100: 1.582103 learning rate: 1.000000
Minibatch perplexity: 4.68
Validation set perplexity: 6.66
Average loss at step 6200: 1.576813 learning rate: 1.000000
Minibatch perplexity: 5.14
Validation set perplexity: 6.65
Average loss at step 6300: 1.565072 learning rate: 1.000000
Minibatch perplexity: 5.00
Validation set perplexity: 6.71
Average loss at step 6400: 1.579963 learning rate: 1.000000
Minibatch perplexity: 5.23
Validation set perplexity: 6.78
Average loss at step 6500: 1.571453 learning rate: 1.000000
Minibatch perplexity: 4.23
Validation set perplexity: 6.73
Average loss at step 6600: 1.565176 learning rate: 1.000000
Minibatch perplexity: 4.81
Validation set perplexity: 6.63
Average loss at step 6700: 1.558820 learning rate: 1.000000
Minibatch perplexity: 4.69
Validation set perplexity: 6.69
Average loss at step 6800: 1.571175 learning rate: 1.000000
Minibatch perplexity: 5.31
Validation set perplexity: 6.65
Average loss at step 6900: 1.602558 learning rate: 1.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.66
Average loss at step 7000: 1.585900 learning rate: 1.000000
Minibatch perplexity: 4.91
================================================================================
mn the gened to the treerced theulence a soved the wood isromes has sics bruschia
jlam the eight seven zero four two one sweber one nine twiletickey id spered in c
rx but of this ortal cases were woed this one the chrure crespence respond to or 
gby for the ary ad and rights criticism may enjung astronei the divisions one nin
djds is interpret in city begation he score a less meants it in auguk more that t
================================================================================
Validation set perplexity: 6.66

It works, but the validation perplexity is a bit worst.

Let's try the dropout, in the inputs/ouputs only, not between to cells.


In [17]:
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 64
keep_prob_train = 1.0

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  vocabulary_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state
  
  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_chars = train_data[:num_unrollings]
  train_inputs = zip(train_chars[:-1], train_chars[1:])
  train_labels = train_data[2:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    bigram_index = tf.argmax(i[0], dimension=1) + vocabulary_size * tf.argmax(i[1], dimension=1)
    i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, bigram_index)
    drop_i = tf.nn.dropout(i_embed, keep_prob_train)
    output, state = lstm_cell(drop_i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    drop_logits = tf.nn.dropout(logits, keep_prob_train)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 15000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  #sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  keep_prob_sample = tf.placeholder(tf.float32)
  sample_input = list()
  for _ in range(2):
    sample_input.append(tf.placeholder(tf.float32, shape=[1, vocabulary_size]))
  samp_in_index = tf.argmax(sample_input[0], dimension=1) + vocabulary_size * tf.argmax(sample_input[1], dimension=1)
  sample_input_embedding = tf.nn.embedding_lookup(vocabulary_embeddings, samp_in_index)
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [18]:
import collections
num_steps = 21001
summary_frequency = 100

valid_batches = BatchGenerator(valid_text, 1, 2)

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[2:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          #feed = sample(random_distribution())
          feed = collections.deque(maxlen=2)
          for _ in range(2):  
            feed.append(random_distribution())
          #sentence = characters(feed)[0]
          sentence = characters(feed[0])[0] + characters(feed[1])[0]
          #print(sentence)
          #print(feed)
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({
                    sample_input[0]: feed[0],
                    sample_input[1]: feed[1],
                })
            #feed = sample(prediction)
            feed.append(sample(prediction))
            #sentence += characters(feed)[0]
            sentence += characters(feed[1])[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({
                sample_input[0]: b[0],
                sample_input[1]: b[1],
                keep_prob_sample: 1.0
            })
        valid_logprob = valid_logprob + logprob(predictions, b[2])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.294064 learning rate: 10.000000
Minibatch perplexity: 26.95
================================================================================
vl vfotjaiemnztm  tbteoi  ydqhqwdxsa gtthe qen q  xmkxoetugabnlvi  rkhraoenuhexa 
fmez bi ari wdcecwbgpqppuoqsukesr nkliilkth qbsf irewik n efttbr q g ad coten  cj
rcyivrehlfecveas h oc eniw pktr yrun eedrmneveoxqktu cbeedcysap ziliiwaei teti p 
grfdie arijhssbceeqyojethev  haawrcvehst  mr alqe v iwnwuevp tie oettynifk se oei
fy aytqvo kpgaf  ozt blijwsueirpn  odifomkiu ulyezr rw thessob ywetrtnvi tdezileh
================================================================================
Validation set perplexity: 20.76
Average loss at step 100: 2.294127 learning rate: 10.000000
Minibatch perplexity: 6.74
Validation set perplexity: 9.11
Average loss at step 200: 1.970135 learning rate: 10.000000
Minibatch perplexity: 6.24
Validation set perplexity: 8.31
Average loss at step 300: 1.875337 learning rate: 10.000000
Minibatch perplexity: 6.24
Validation set perplexity: 8.21
Average loss at step 400: 1.821280 learning rate: 10.000000
Minibatch perplexity: 6.02
Validation set perplexity: 8.56
Average loss at step 500: 1.793596 learning rate: 10.000000
Minibatch perplexity: 5.78
Validation set perplexity: 7.96
Average loss at step 600: 1.750613 learning rate: 10.000000
Minibatch perplexity: 6.04
Validation set perplexity: 8.15
Average loss at step 700: 1.744874 learning rate: 10.000000
Minibatch perplexity: 4.99
Validation set perplexity: 7.91
Average loss at step 800: 1.709380 learning rate: 10.000000
Minibatch perplexity: 5.83
Validation set perplexity: 8.17
Average loss at step 900: 1.707013 learning rate: 10.000000
Minibatch perplexity: 5.91
Validation set perplexity: 7.75
Average loss at step 1000: 1.693171 learning rate: 10.000000
Minibatch perplexity: 5.33
================================================================================
qter one nine threge direct bre a on to probut which with duide a sciences the va
ue a so a cering packey which pland demolow in becutkul and angos it would more r
gc three two zero zero memmas of locomust politer that doy by fawn  the one nine 
ni the posnic in varian leb vie by capter that two zero zero zero zis three phili
yfl a propolies belothous chessapolid to not had by of polisted plands and empiri
================================================================================
Validation set perplexity: 8.10
Average loss at step 1100: 1.686150 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 8.03
Average loss at step 1200: 1.681396 learning rate: 10.000000
Minibatch perplexity: 4.95
Validation set perplexity: 7.77
Average loss at step 1300: 1.665828 learning rate: 10.000000
Minibatch perplexity: 5.37
Validation set perplexity: 8.41
Average loss at step 1400: 1.668814 learning rate: 10.000000
Minibatch perplexity: 6.27
Validation set perplexity: 7.99
Average loss at step 1500: 1.688824 learning rate: 10.000000
Minibatch perplexity: 5.59
Validation set perplexity: 8.03
Average loss at step 1600: 1.680478 learning rate: 10.000000
Minibatch perplexity: 6.22
Validation set perplexity: 7.46
Average loss at step 1700: 1.653859 learning rate: 10.000000
Minibatch perplexity: 5.55
Validation set perplexity: 7.71
Average loss at step 1800: 1.679414 learning rate: 10.000000
Minibatch perplexity: 5.32
Validation set perplexity: 7.69
Average loss at step 1900: 1.685665 learning rate: 10.000000
Minibatch perplexity: 6.65
Validation set perplexity: 7.86
Average loss at step 2000: 1.643932 learning rate: 10.000000
Minibatch perplexity: 5.12
================================================================================
pdon sate where five nine zero d and to second colorigantion had upse origh jown 
vronomics claire of capply sesed by though but lia a myand hoing ii outsignor org
qrlives colre is in open chien areaski with lic seven three a foreas was frankds 
qc polining joint jouseven dive two zero six six five four three one two four inv
kbts todep only biology in the potensional at the computer preacheal living enthe
================================================================================
Validation set perplexity: 7.46
Average loss at step 2100: 1.645154 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 7.28
Average loss at step 2200: 1.626880 learning rate: 10.000000
Minibatch perplexity: 5.91
Validation set perplexity: 7.38
Average loss at step 2300: 1.662672 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 7.77
Average loss at step 2400: 1.654222 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 7.70
Average loss at step 2500: 1.633783 learning rate: 10.000000
Minibatch perplexity: 5.09
Validation set perplexity: 7.23
Average loss at step 2600: 1.619199 learning rate: 10.000000
Minibatch perplexity: 4.86
Validation set perplexity: 7.34
Average loss at step 2700: 1.620473 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 7.52
Average loss at step 2800: 1.625931 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 7.36
Average loss at step 2900: 1.602973 learning rate: 10.000000
Minibatch perplexity: 5.02
Validation set perplexity: 7.29
Average loss at step 3000: 1.604709 learning rate: 10.000000
Minibatch perplexity: 6.07
================================================================================
afted the competitivity and the flop simple cromis performal expeacited self mont
tch at this general church dpoind vances dml frandhio two three intructed of for 
cn for economen a scountinity enderton westral in the tramerfeled german iracient
sxn b ordinal bandard ited french tenism dailea were aray how the pkonium large t
vin and at the war ling letural soberted the operaterol of linxustanimatic ruary 
================================================================================
Validation set perplexity: 7.39
Average loss at step 3100: 1.628776 learning rate: 10.000000
Minibatch perplexity: 4.79
Validation set perplexity: 7.38
Average loss at step 3200: 1.627529 learning rate: 10.000000
Minibatch perplexity: 4.69
Validation set perplexity: 7.33
Average loss at step 3300: 1.615460 learning rate: 10.000000
Minibatch perplexity: 4.80
Validation set perplexity: 7.16
Average loss at step 3400: 1.608572 learning rate: 10.000000
Minibatch perplexity: 5.12
Validation set perplexity: 7.51
Average loss at step 3500: 1.603931 learning rate: 10.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.89
Average loss at step 3600: 1.575938 learning rate: 10.000000
Minibatch perplexity: 4.33
Validation set perplexity: 7.27
Average loss at step 3700: 1.597974 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 7.08
Average loss at step 3800: 1.606371 learning rate: 10.000000
Minibatch perplexity: 4.77
Validation set perplexity: 7.38
Average loss at step 3900: 1.619028 learning rate: 10.000000
Minibatch perplexity: 4.87
Validation set perplexity: 7.41
Average loss at step 4000: 1.600770 learning rate: 10.000000
Minibatch perplexity: 4.42
================================================================================
zs unis fiftmh government leb    see of k com artil origing of a gerizes around h
jjt to support around argubus stantation actor in gale s and he being the step re
ack the sunn etpdtman spirity juners is offessom four award internststandings cou
n invaring and chinal an well lund john sources two and to for generally on enong
ocial featruction but as officult at armed by four source of indiaridy for two st
================================================================================
Validation set perplexity: 6.98
Average loss at step 4100: 1.616020 learning rate: 10.000000
Minibatch perplexity: 5.18
Validation set perplexity: 7.20
Average loss at step 4200: 1.593565 learning rate: 10.000000
Minibatch perplexity: 5.82
Validation set perplexity: 7.36
Average loss at step 4300: 1.591293 learning rate: 10.000000
Minibatch perplexity: 4.14
Validation set perplexity: 7.26
Average loss at step 4400: 1.600939 learning rate: 10.000000
Minibatch perplexity: 4.71
Validation set perplexity: 7.01
Average loss at step 4500: 1.604480 learning rate: 10.000000
Minibatch perplexity: 5.34
Validation set perplexity: 7.33
Average loss at step 4600: 1.591663 learning rate: 10.000000
Minibatch perplexity: 4.84
Validation set perplexity: 7.25
Average loss at step 4700: 1.594270 learning rate: 10.000000
Minibatch perplexity: 4.81
Validation set perplexity: 7.62
Average loss at step 4800: 1.610245 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 7.61
Average loss at step 4900: 1.589588 learning rate: 10.000000
Minibatch perplexity: 4.23
Validation set perplexity: 7.52
Average loss at step 5000: 1.608194 learning rate: 10.000000
Minibatch perplexity: 4.71
================================================================================
gbe photions hembed and toxecando unrefb propeach canda creation jeasion a contin
yaxick in pair s ends first with homepage excebroirclar gold as two two zero star
uv ruluctor rerved mintifesty kuway regions caran prson with as programs first of
ve major where easix one nine seven determs was hand shipolas dunforpincted state
cdned as a friascendary would ream was the vithern real as persona id of with fil
================================================================================
Validation set perplexity: 7.62
Average loss at step 5100: 1.595932 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 7.36
Average loss at step 5200: 1.602364 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 7.55
Average loss at step 5300: 1.592167 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 7.37
Average loss at step 5400: 1.576422 learning rate: 10.000000
Minibatch perplexity: 4.49
Validation set perplexity: 7.45
Average loss at step 5500: 1.581764 learning rate: 10.000000
Minibatch perplexity: 4.08
Validation set perplexity: 7.61
Average loss at step 5600: 1.605472 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 7.08
Average loss at step 5700: 1.582285 learning rate: 10.000000
Minibatch perplexity: 4.85
Validation set perplexity: 7.17
Average loss at step 5800: 1.578439 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 7.13
Average loss at step 5900: 1.586146 learning rate: 10.000000
Minibatch perplexity: 4.83
Validation set perplexity: 7.11
Average loss at step 6000: 1.595398 learning rate: 10.000000
Minibatch perplexity: 5.27
================================================================================
bn be a universed givined assice other on every perfoen in cadition of self in a 
uard the elemben and both the back united state university was does these can its
hmonerly as equar libersh keine of activities claim in sign poudgested suppots se
mween kber of polics long been uses ansdays a using in iu trialovel apolist oncea
thor its violated that model never the but so connection a s mumackkosee may forc
================================================================================
Validation set perplexity: 6.85
Average loss at step 6100: 1.611725 learning rate: 10.000000
Minibatch perplexity: 5.00
Validation set perplexity: 7.01
Average loss at step 6200: 1.592457 learning rate: 10.000000
Minibatch perplexity: 4.87
Validation set perplexity: 7.23
Average loss at step 6300: 1.595724 learning rate: 10.000000
Minibatch perplexity: 5.26
Validation set perplexity: 7.21
Average loss at step 6400: 1.623608 learning rate: 10.000000
Minibatch perplexity: 5.21
Validation set perplexity: 7.08
Average loss at step 6500: 1.636454 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 7.20
Average loss at step 6600: 1.609778 learning rate: 10.000000
Minibatch perplexity: 4.91
Validation set perplexity: 7.15
Average loss at step 6700: 1.610509 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 7.13
Average loss at step 6800: 1.593407 learning rate: 10.000000
Minibatch perplexity: 4.83
Validation set perplexity: 7.32
Average loss at step 6900: 1.557133 learning rate: 10.000000
Minibatch perplexity: 4.70
Validation set perplexity: 6.85
Average loss at step 7000: 1.602078 learning rate: 10.000000
Minibatch perplexity: 4.79
================================================================================
wmely longommedon model one of the one nine two other m hasagnific but of altong 
bv belorgan ba life of act in one nine one nine five two jerrained half allow the
jt the rocesse to elpiymbinancy the pany who pancomplement of memotors eight one 
jx fir buildes activeless stabasitions bors and nine nizing the nacialism the dis
vy two arravouns in evilied by insteatura one six seven one nine nine seven mr an
================================================================================
Validation set perplexity: 6.81
Average loss at step 7100: 1.598239 learning rate: 10.000000
Minibatch perplexity: 5.53
Validation set perplexity: 6.78
Average loss at step 7200: 1.586095 learning rate: 10.000000
Minibatch perplexity: 4.23
Validation set perplexity: 6.87
Average loss at step 7300: 1.603092 learning rate: 10.000000
Minibatch perplexity: 4.67
Validation set perplexity: 7.07
Average loss at step 7400: 1.591021 learning rate: 10.000000
Minibatch perplexity: 5.03
Validation set perplexity: 6.70
Average loss at step 7500: 1.587572 learning rate: 10.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.86
Average loss at step 7600: 1.577642 learning rate: 10.000000
Minibatch perplexity: 4.93
Validation set perplexity: 6.64
Average loss at step 7700: 1.590720 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 6.81
Average loss at step 7800: 1.599501 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 6.91
Average loss at step 7900: 1.614835 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 6.73
Average loss at step 8000: 1.602334 learning rate: 10.000000
Minibatch perplexity: 4.68
================================================================================
yck fort presence for tarch hasteermy direction is although plang and vythemilism
pment art the jacut demoneetast the use to milt of liter term his germanicatians 
qoic survyz among between seven forth peters a ghly in religions one two sticforn
ake attoch two one nimita coasity for religions was one three seven four will mam
kj will julials to versed war links planes the ital langnication tics usember to 
================================================================================
Validation set perplexity: 6.49
Average loss at step 8100: 1.573146 learning rate: 10.000000
Minibatch perplexity: 4.70
Validation set perplexity: 6.56
Average loss at step 8200: 1.580510 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 7.11
Average loss at step 8300: 1.595134 learning rate: 10.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.82
Average loss at step 8400: 1.591989 learning rate: 10.000000
Minibatch perplexity: 4.76
Validation set perplexity: 7.03
Average loss at step 8500: 1.603661 learning rate: 10.000000
Minibatch perplexity: 5.56
Validation set perplexity: 7.01
Average loss at step 8600: 1.608672 learning rate: 10.000000
Minibatch perplexity: 5.54
Validation set perplexity: 7.00
Average loss at step 8700: 1.597277 learning rate: 10.000000
Minibatch perplexity: 4.68
Validation set perplexity: 7.02
Average loss at step 8800: 1.612346 learning rate: 10.000000
Minibatch perplexity: 5.45
Validation set perplexity: 6.93
Average loss at step 8900: 1.588059 learning rate: 10.000000
Minibatch perplexity: 4.68
Validation set perplexity: 7.07
Average loss at step 9000: 1.598808 learning rate: 10.000000
Minibatch perplexity: 4.48
================================================================================
hs one nine four five he provides the bodicists gunce but gangly relitably to he 
ijaniecy tri surder island hearnet century party on the unitemrces the cominated 
dquo also long line three nine eight zero five one nine km thece creatre the firs
dtl recept highn   u geneting bughkmorut was called to including ths company spee
jects and remaning oupns in adderstians on community and france by delived flatti
================================================================================
Validation set perplexity: 7.08
Average loss at step 9100: 1.602952 learning rate: 10.000000
Minibatch perplexity: 5.56
Validation set perplexity: 7.12
Average loss at step 9200: 1.619843 learning rate: 10.000000
Minibatch perplexity: 4.47
Validation set perplexity: 7.00
Average loss at step 9300: 1.612954 learning rate: 10.000000
Minibatch perplexity: 5.38
Validation set perplexity: 7.23
Average loss at step 9400: 1.598956 learning rate: 10.000000
Minibatch perplexity: 4.95
Validation set perplexity: 6.86
Average loss at step 9500: 1.603547 learning rate: 10.000000
Minibatch perplexity: 4.40
Validation set perplexity: 6.91
Average loss at step 9600: 1.602626 learning rate: 10.000000
Minibatch perplexity: 5.41
Validation set perplexity: 6.84
Average loss at step 9700: 1.609943 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 6.69
Average loss at step 9800: 1.607028 learning rate: 10.000000
Minibatch perplexity: 4.38
Validation set perplexity: 6.82
Average loss at step 9900: 1.572156 learning rate: 10.000000
Minibatch perplexity: 4.79
Validation set perplexity: 6.96
Average loss at step 10000: 1.588845 learning rate: 10.000000
Minibatch perplexity: 4.15
================================================================================
ut was to long hose forms oddide protectoric shipment time to metro could to end 
ack to stamil common as industrian quetry user mostly mes regime the tman from fr
pv some explocao russical cusamate crossed at defificu mores point levil dor marc
pately seo the roman intcs homiton however homeroestect that the data s compoinsi
cvao to the object but of caearl communic litve rouscarection the convemptures or
================================================================================
Validation set perplexity: 7.17
Average loss at step 10100: 1.610666 learning rate: 10.000000
Minibatch perplexity: 5.39
Validation set perplexity: 7.12
Average loss at step 10200: 1.600238 learning rate: 10.000000
Minibatch perplexity: 4.62
Validation set perplexity: 6.87
Average loss at step 10300: 1.594538 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 7.19
Average loss at step 10400: 1.600576 learning rate: 10.000000
Minibatch perplexity: 5.44
Validation set perplexity: 7.37
Average loss at step 10500: 1.617158 learning rate: 10.000000
Minibatch perplexity: 4.99
Validation set perplexity: 6.76
Average loss at step 10600: 1.561949 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 7.03
Average loss at step 10700: 1.572677 learning rate: 10.000000
Minibatch perplexity: 4.72
Validation set perplexity: 7.12
Average loss at step 10800: 1.590255 learning rate: 10.000000
Minibatch perplexity: 4.73
Validation set perplexity: 7.14
Average loss at step 10900: 1.599919 learning rate: 10.000000
Minibatch perplexity: 4.73
Validation set perplexity: 7.00
Average loss at step 11000: 1.577325 learning rate: 10.000000
Minibatch perplexity: 5.15
================================================================================
 chillez charlution of the contentioches vas long and maintry after his continegi
rs of two five foreignetize to sing mong which insantment the demistical study ad
ygotter lonent counte tedcted his diversityther learned their falls adming shortl
rns a soviet in the cultriii dombant bombinatenture centraliam bersains busaith a
acket but methy s developed also freedenerg itenmes pershed inteadts anceurices i
================================================================================
Validation set perplexity: 6.87
Average loss at step 11100: 1.558747 learning rate: 10.000000
Minibatch perplexity: 5.19
Validation set perplexity: 7.40
Average loss at step 11200: 1.563837 learning rate: 10.000000
Minibatch perplexity: 4.59
Validation set perplexity: 6.74
Average loss at step 11300: 1.552837 learning rate: 10.000000
Minibatch perplexity: 5.18
Validation set perplexity: 7.07
Average loss at step 11400: 1.559448 learning rate: 10.000000
Minibatch perplexity: 5.04
Validation set perplexity: 6.65
Average loss at step 11500: 1.571357 learning rate: 10.000000
Minibatch perplexity: 4.65
Validation set perplexity: 6.93
Average loss at step 11600: 1.542222 learning rate: 10.000000
Minibatch perplexity: 4.80
Validation set perplexity: 7.17
Average loss at step 11700: 1.539709 learning rate: 10.000000
Minibatch perplexity: 4.80
Validation set perplexity: 7.00
Average loss at step 11800: 1.564172 learning rate: 10.000000
Minibatch perplexity: 4.38
Validation set perplexity: 7.05
Average loss at step 11900: 1.554298 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 6.98
Average loss at step 12000: 1.539188 learning rate: 10.000000
Minibatch perplexity: 5.20
================================================================================
qtical see electrical a predural education of that negative its he compouniy of i
zan annalysol of the bell though as lauth the cale into the also only by our comm
r s praised shable dosex in requires hort for two funcities late sep s brunnology
ygorgith gosix six qalute upty granky is ray one nine seven two zero four vioes w
ob as on a loss and a japan encile charence up mediation of ine many used two ter
================================================================================
Validation set perplexity: 6.91
Average loss at step 12100: 1.537407 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 6.47
Average loss at step 12200: 1.563940 learning rate: 10.000000
Minibatch perplexity: 4.87
Validation set perplexity: 6.53
Average loss at step 12300: 1.551397 learning rate: 10.000000
Minibatch perplexity: 4.63
Validation set perplexity: 6.62
Average loss at step 12400: 1.590912 learning rate: 10.000000
Minibatch perplexity: 4.77
Validation set perplexity: 6.73
Average loss at step 12500: 1.562756 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 6.83
Average loss at step 12600: 1.550896 learning rate: 10.000000
Minibatch perplexity: 4.57
Validation set perplexity: 6.74
Average loss at step 12700: 1.552257 learning rate: 10.000000
Minibatch perplexity: 5.43
Validation set perplexity: 6.79
Average loss at step 12800: 1.564752 learning rate: 10.000000
Minibatch perplexity: 4.46
Validation set perplexity: 6.99
Average loss at step 12900: 1.593982 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 7.05
Average loss at step 13000: 1.564957 learning rate: 10.000000
Minibatch perplexity: 4.24
================================================================================
mrapment article disputer to methy has but the potent inboray the desiginal canic
lsd refer who anotols spacinity and laudicians three zero zero peipped which sola
yxoning which extennerade that hundred that perfect although zoe chargebrelation 
xbarath four of the year fare and have the tart as he kogstbe nine title sid hear
xrse a lasks force and maycorics aspape pags not than studie to time speed persat
================================================================================
Validation set perplexity: 6.80
Average loss at step 13100: 1.562430 learning rate: 10.000000
Minibatch perplexity: 4.61
Validation set perplexity: 6.76
Average loss at step 13200: 1.598720 learning rate: 10.000000
Minibatch perplexity: 5.16
Validation set perplexity: 6.63
Average loss at step 13300: 1.580562 learning rate: 10.000000
Minibatch perplexity: 4.75
Validation set perplexity: 6.75
Average loss at step 13400: 1.585221 learning rate: 10.000000
Minibatch perplexity: 4.79
Validation set perplexity: 6.54
Average loss at step 13500: 1.597545 learning rate: 10.000000
Minibatch perplexity: 4.83
Validation set perplexity: 6.68
Average loss at step 13600: 1.580464 learning rate: 10.000000
Minibatch perplexity: 4.40
Validation set perplexity: 6.86
Average loss at step 13700: 1.555075 learning rate: 10.000000
Minibatch perplexity: 4.47
Validation set perplexity: 6.55
Average loss at step 13800: 1.539919 learning rate: 10.000000
Minibatch perplexity: 4.71
Validation set perplexity: 6.72
Average loss at step 13900: 1.566485 learning rate: 10.000000
Minibatch perplexity: 4.78
Validation set perplexity: 6.94
Average loss at step 14000: 1.560851 learning rate: 10.000000
Minibatch perplexity: 4.72
================================================================================
jh s the plutestprise parts hams wentationalism the shiony a formac be reforses o
tmiscrinitsaddisaugans are of an  enguted sacultiting times acrolet of liberal  o
ggen a commodore his weake in a grapatical francessors corry athen the university
tly not space a liberally a cosion o and war a charress audioic mises throughas d
cjeerigos lories set decident in sign also the iredwars anuary design amphistic a
================================================================================
Validation set perplexity: 6.68
Average loss at step 14100: 1.576862 learning rate: 10.000000
Minibatch perplexity: 5.28
Validation set perplexity: 6.54
Average loss at step 14200: 1.579525 learning rate: 10.000000
Minibatch perplexity: 4.94
Validation set perplexity: 6.75
Average loss at step 14300: 1.569383 learning rate: 10.000000
Minibatch perplexity: 4.72
Validation set perplexity: 6.55
Average loss at step 14400: 1.581851 learning rate: 10.000000
Minibatch perplexity: 4.78
Validation set perplexity: 7.01
Average loss at step 14500: 1.613010 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 6.99
Average loss at step 14600: 1.588160 learning rate: 10.000000
Minibatch perplexity: 6.06
Validation set perplexity: 6.72
Average loss at step 14700: 1.605740 learning rate: 10.000000
Minibatch perplexity: 5.19
Validation set perplexity: 6.79
Average loss at step 14800: 1.585137 learning rate: 10.000000
Minibatch perplexity: 4.95
Validation set perplexity: 6.61
Average loss at step 14900: 1.581300 learning rate: 10.000000
Minibatch perplexity: 4.28
Validation set perplexity: 6.81
Average loss at step 15000: 1.578250 learning rate: 1.000000
Minibatch perplexity: 4.71
================================================================================
mc the pass country manindessox to kminating politite dyrritorimee bn zero zero z
ic ectraccer a prime the varions two nine two zero zero theset did reprewal one o
jzmnifh one groupsifysibs over paint bra vehap one nine zero pedrooked off fier f
yer hoston largest pregience conserven to uses  is approos played veried frivas a
v divine for acations uniquest vative in the scopenhanseasonally charact the law 
================================================================================
Validation set perplexity: 6.65
Average loss at step 15100: 1.539423 learning rate: 1.000000
Minibatch perplexity: 4.67
Validation set perplexity: 6.59
Average loss at step 15200: 1.563447 learning rate: 1.000000
Minibatch perplexity: 4.73
Validation set perplexity: 6.63
Average loss at step 15300: 1.530291 learning rate: 1.000000
Minibatch perplexity: 4.58
Validation set perplexity: 6.59
Average loss at step 15400: 1.536919 learning rate: 1.000000
Minibatch perplexity: 4.03
Validation set perplexity: 6.54
Average loss at step 15500: 1.502490 learning rate: 1.000000
Minibatch perplexity: 4.66
Validation set perplexity: 6.54
Average loss at step 15600: 1.518093 learning rate: 1.000000
Minibatch perplexity: 5.78
Validation set perplexity: 6.53
Average loss at step 15700: 1.511408 learning rate: 1.000000
Minibatch perplexity: 4.68
Validation set perplexity: 6.43
Average loss at step 15800: 1.499718 learning rate: 1.000000
Minibatch perplexity: 4.40
Validation set perplexity: 6.42
Average loss at step 15900: 1.518945 learning rate: 1.000000
Minibatch perplexity: 4.57
Validation set perplexity: 6.42
Average loss at step 16000: 1.525942 learning rate: 1.000000
Minibatch perplexity: 5.08
================================================================================
yu a were and presenss and may with enigmes by miniss a de would be after engine 
qk bat some read to washingtor one first infcus acquadoil mexame in the smallaund
xwel hlw for triastian in annobason one nine eight zero nine eight eight fluence 
hx to legas or saw alread figin no assume courference are ordinable then genres o
jwinment rance will character early which other of ingdom in one the ever extress
================================================================================
Validation set perplexity: 6.40
Average loss at step 16100: 1.520273 learning rate: 1.000000
Minibatch perplexity: 4.90
Validation set perplexity: 6.35
Average loss at step 16200: 1.486306 learning rate: 1.000000
Minibatch perplexity: 3.89
Validation set perplexity: 6.36
Average loss at step 16300: 1.473654 learning rate: 1.000000
Minibatch perplexity: 4.54
Validation set perplexity: 6.29
Average loss at step 16400: 1.511287 learning rate: 1.000000
Minibatch perplexity: 4.74
Validation set perplexity: 6.33
Average loss at step 16500: 1.521893 learning rate: 1.000000
Minibatch perplexity: 4.40
Validation set perplexity: 6.33
Average loss at step 16600: 1.515765 learning rate: 1.000000
Minibatch perplexity: 4.41
Validation set perplexity: 6.28
Average loss at step 16700: 1.555418 learning rate: 1.000000
Minibatch perplexity: 4.94
Validation set perplexity: 6.28
Average loss at step 16800: 1.507596 learning rate: 1.000000
Minibatch perplexity: 4.63
Validation set perplexity: 6.24
Average loss at step 16900: 1.520794 learning rate: 1.000000
Minibatch perplexity: 4.97
Validation set perplexity: 6.23
Average loss at step 17000: 1.527741 learning rate: 1.000000
Minibatch perplexity: 5.19
================================================================================
cf slogy typical bioux one one easy other the first shwell game them clutch could
nes of the house the it of hierg community avalwable for acft community nowi one 
eis in two zero zero of one nine five two zero zero three shaption of texts have 
afed the eight and more buallian the pointminl infransit start as the m regulatin
okey a many d only finalitured by thancorrions for a him datal his french persona
================================================================================
Validation set perplexity: 6.21
Average loss at step 17100: 1.513090 learning rate: 1.000000
Minibatch perplexity: 4.49
Validation set perplexity: 6.29
Average loss at step 17200: 1.539017 learning rate: 1.000000
Minibatch perplexity: 4.97
Validation set perplexity: 6.23
Average loss at step 17300: 1.546002 learning rate: 1.000000
Minibatch perplexity: 4.92
Validation set perplexity: 6.29
Average loss at step 17400: 1.585150 learning rate: 1.000000
Minibatch perplexity: 4.56
Validation set perplexity: 6.30
Average loss at step 17500: 1.568547 learning rate: 1.000000
Minibatch perplexity: 4.93
Validation set perplexity: 6.27
Average loss at step 17600: 1.585715 learning rate: 1.000000
Minibatch perplexity: 5.52
Validation set perplexity: 6.31
Average loss at step 17700: 1.577505 learning rate: 1.000000
Minibatch perplexity: 5.04
Validation set perplexity: 6.32
Average loss at step 17800: 1.551781 learning rate: 1.000000
Minibatch perplexity: 3.92
Validation set perplexity: 6.33
Average loss at step 17900: 1.556548 learning rate: 1.000000
Minibatch perplexity: 4.63
Validation set perplexity: 6.31
Average loss at step 18000: 1.526999 learning rate: 1.000000
Minibatch perplexity: 4.80
================================================================================
yvdaaws coastiston funick a campubeen causes to call in puts beaker to the sain i
tly graphone four a was buctions painted irager understan called one seven two pa
lty b englished a was aze fight foot as a becemoral armed a structure ars were in
kxile control moneymplically joined corroring the three zero and righter are zero
kqr ody clinenssible force the armic itish that of upp and computing without offe
================================================================================
Validation set perplexity: 6.27
Average loss at step 18100: 1.512362 learning rate: 1.000000
Minibatch perplexity: 4.74
Validation set perplexity: 6.24
Average loss at step 18200: 1.534779 learning rate: 1.000000
Minibatch perplexity: 4.86
Validation set perplexity: 6.26
Average loss at step 18300: 1.544049 learning rate: 1.000000
Minibatch perplexity: 4.88
Validation set perplexity: 6.29
Average loss at step 18400: 1.569889 learning rate: 1.000000
Minibatch perplexity: 4.55
Validation set perplexity: 6.29
Average loss at step 18500: 1.564839 learning rate: 1.000000
Minibatch perplexity: 5.46
Validation set perplexity: 6.28
Average loss at step 18600: 1.568991 learning rate: 1.000000
Minibatch perplexity: 5.06
Validation set perplexity: 6.31
Average loss at step 18700: 1.565813 learning rate: 1.000000
Minibatch perplexity: 4.44
Validation set perplexity: 6.34
Average loss at step 18800: 1.566493 learning rate: 1.000000
Minibatch perplexity: 4.86
Validation set perplexity: 6.26
Average loss at step 18900: 1.548388 learning rate: 1.000000
Minibatch perplexity: 4.74
Validation set perplexity: 6.27
Average loss at step 19000: 1.594834 learning rate: 1.000000
Minibatch perplexity: 4.75
================================================================================
zlor oxford organism lecessorcas may not is a vote linments advicky united christ
oks which contence kum prime ineveracident thecut rrica pal nash great turned by 
bme georal paki celties on that means todars ofthestimes synthestrother the en de
fbative natv communical shohonwish any contain ira othermic sharboriagistry indon
km the durts ionic system zealow auguarail which see krhy purent funciative given
================================================================================
Validation set perplexity: 6.23
Average loss at step 19100: 1.577726 learning rate: 1.000000
Minibatch perplexity: 5.08
Validation set perplexity: 6.21
Average loss at step 19200: 1.550301 learning rate: 1.000000
Minibatch perplexity: 4.28
Validation set perplexity: 6.24
Average loss at step 19300: 1.554761 learning rate: 1.000000
Minibatch perplexity: 5.32
Validation set perplexity: 6.27
Average loss at step 19400: 1.531718 learning rate: 1.000000
Minibatch perplexity: 4.57
Validation set perplexity: 6.30
Average loss at step 19500: 1.533805 learning rate: 1.000000
Minibatch perplexity: 4.57
Validation set perplexity: 6.35
Average loss at step 19600: 1.544765 learning rate: 1.000000
Minibatch perplexity: 5.22
Validation set perplexity: 6.31
Average loss at step 19700: 1.551418 learning rate: 1.000000
Minibatch perplexity: 5.26
Validation set perplexity: 6.39
Average loss at step 19800: 1.537303 learning rate: 1.000000
Minibatch perplexity: 4.94
Validation set perplexity: 6.39
Average loss at step 19900: 1.544498 learning rate: 1.000000
Minibatch perplexity: 4.97
Validation set perplexity: 6.31
Average loss at step 20000: 1.515656 learning rate: 1.000000
Minibatch perplexity: 4.48
================================================================================
ctive puroson ffered by one nine eight zero five four three nine sequence one yea
pj cricharge world sovereetic non liberage weeka you to the links the rightified 
c students on not in one nine six eight zero s killunal white his will tilochled 
xb had splitarian the camlf national world fee without for early in world individ
fktly presenced of passocial yanld life the such ideas and territius to built the
================================================================================
Validation set perplexity: 6.32
Average loss at step 20100: 1.522983 learning rate: 1.000000
Minibatch perplexity: 4.44
Validation set perplexity: 6.36
Average loss at step 20200: 1.523624 learning rate: 1.000000
Minibatch perplexity: 4.91
Validation set perplexity: 6.31
Average loss at step 20300: 1.543516 learning rate: 1.000000
Minibatch perplexity: 4.77
Validation set perplexity: 6.32
Average loss at step 20400: 1.547590 learning rate: 1.000000
Minibatch perplexity: 4.32
Validation set perplexity: 6.25
Average loss at step 20500: 1.545989 learning rate: 1.000000
Minibatch perplexity: 4.32
Validation set perplexity: 6.23
Average loss at step 20600: 1.514326 learning rate: 1.000000
Minibatch perplexity: 5.04
Validation set perplexity: 6.32
Average loss at step 20700: 1.502698 learning rate: 1.000000
Minibatch perplexity: 4.47
Validation set perplexity: 6.25
Average loss at step 20800: 1.524500 learning rate: 1.000000
Minibatch perplexity: 4.41
Validation set perplexity: 6.16
Average loss at step 20900: 1.516528 learning rate: 1.000000
Minibatch perplexity: 4.02
Validation set perplexity: 6.27
Average loss at step 21000: 1.519367 learning rate: 1.000000
Minibatch perplexity: 3.80
================================================================================
mr fires shwellause of the charled by hargelon a specialists in the end by depend
wfhole in anciential white open argue of value or pressor which confineses of be 
b later a houstonness just criticizen on these on education the amest in the abne
fhe give a mounts loves portures to guitaft diical atts the agreek brayllen ths o
ds for produces language the exception was describes chooe righ areay isbn f indi
================================================================================
Validation set perplexity: 6.24

Even with more steps, the final perplexity is not better. Since I do not know what to expect, and since I do not see any obvious issue (the perplexity being consistent), I'm stuck.


Problem 3

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

the quick brown fox

the model should attempt to output:

eht kciuq nworb xof

Refer to the lecture on how to put together a sequence-to-sequence model, as well as this article for best practices.


Unfortunately I did not have time to work on this problem in the timeframe of the course.