Deep Learning

Assignment 6

After training a skip-gram model in 5_word2vec.ipynb, the goal of this notebook is to train a LSTM character model over Text8 data.


In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)


Found and verified text8.zip

In [3]:
def read_data(filename):
  with zipfile.ZipFile(filename) as f:
    name = f.namelist()[0]
    data = tf.compat.as_str(f.read(name))
  return data
  
text = read_data(filename)
print('Data size %d' % len(text))


Data size 100000000

In [4]:
print(text[0:10000])


 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist society might work vary considerably especially with respect to economics there is also disagreement about how a free society might be brought about origins and predecessors kropotkin and others argue that before recorded history human society was organized on anarchist principles most anthropologists follow kropotkin and engels in believing that hunter gatherer bands were egalitarian and lacked division of labour accumulated wealth or decreed law and had equal access to resources william godwin anarchists including the the anarchy organisation and rothbard find anarchist attitudes in taoism from ancient china kropotkin found similar ideas in stoic zeno of citium according to kropotkin zeno repudiated the omnipotence of the state its intervention and regimentation and proclaimed the sovereignty of the moral law of the individual the anabaptists of one six th century europe are sometimes considered to be religious forerunners of modern anarchism bertrand russell in his history of western philosophy writes that the anabaptists repudiated all law since they held that the good man will be guided at every moment by the holy spirit from this premise they arrive at communism the diggers or true levellers were an early communistic movement during the time of the english civil war and are considered by some as forerunners of modern anarchism in the modern era the first to use the term to mean something other than chaos was louis armand baron de lahontan in his nouveaux voyages dans l am rique septentrionale one seven zero three where he described the indigenous american society which had no state laws prisons priests or private property as being in anarchy russell means a libertarian and leader in the american indian movement has repeatedly stated that he is an anarchist and so are all his ancestors in one seven nine three in the thick of the french revolution william godwin published an enquiry concerning political justice although godwin did not use the word anarchism many later anarchists have regarded this book as the first major anarchist text and godwin as the founder of philosophical anarchism but at this point no anarchist movement yet existed and the term anarchiste was known mainly as an insult hurled by the bourgeois girondins at more radical elements in the french revolution the first self labelled anarchist pierre joseph proudhon it is commonly held that it wasn t until pierre joseph proudhon published what is property in one eight four zero that the term anarchist was adopted as a self description it is for this reason that some claim proudhon as the founder of modern anarchist theory in what is property proudhon answers with the famous accusation property is theft in this work he opposed the institution of decreed property propri t where owners have complete rights to use and abuse their property as they wish such as exploiting workers for profit in its place proudhon supported what he called possession individuals can have limited rights to use resources capital and goods in accordance with principles of equality and justice proudhon s vision of anarchy which he called mutualism mutuellisme involved an exchange economy where individuals and groups could trade the products of their labor using labor notes which represented the amount of working time involved in production this would ensure that no one would profit from the labor of others workers could freely join together in co operative workshops an interest free bank would be set up to provide everyone with access to the means of production proudhon s ideas were influential within french working class movements and his followers were active in the revolution of one eight four eight in france proudhon s philosophy of property is complex it was developed in a number of works over his lifetime and there are differing interpretations of some of his ideas for more detailed discussion see here max stirner s egoism in his the ego and its own stirner argued that most commonly accepted social institutions including the notion of state property as a right natural rights in general and the very notion of society were mere illusions or ghosts in the mind saying of society that the individuals are its reality he advocated egoism and a form of amoralism in which individuals would unite in associations of egoists only when it was in their self interest to do so for him property simply comes about through might whoever knows how to take to defend the thing to him belongs property and what i have in my power that is my own so long as i assert myself as holder i am the proprietor of the thing stirner never called himself an anarchist he accepted only the label egoist nevertheless his ideas were influential on many individualistically inclined anarchists although interpretations of his thought are diverse american individualist anarchism benjamin tucker in one eight two five josiah warren had participated in a communitarian experiment headed by robert owen called new harmony which failed in a few years amidst much internal conflict warren blamed the community s failure on a lack of individual sovereignty and a lack of private property warren proceeded to organise experimenal anarchist communities which respected what he called the sovereignty of the individual at utopia and modern times in one eight three three warren wrote and published the peaceful revolutionist which some have noted to be the first anarchist periodical ever published benjamin tucker says that warren was the first man to expound and formulate the doctrine now known as anarchism liberty xiv december one nine zero zero one benjamin tucker became interested in anarchism through meeting josiah warren and william b greene he edited and published liberty from august one eight eight one to april one nine zero eight it is widely considered to be the finest individualist anarchist periodical ever issued in the english language tucker s conception of individualist anarchism incorporated the ideas of a variety of theorists greene s ideas on mutual banking warren s ideas on cost as the limit of price a heterodox variety of labour theory of value proudhon s market anarchism max stirner s egoism and herbert spencer s law of equal freedom tucker strongly supported the individual s right to own the product of his or her labour as private property and believed in a market economy for trading this property he argued that in a truly free market system without the state the abundance of competition would eliminate profits and ensure that all workers received the full value of their labor other one nine th century individualists included lysander spooner stephen pearl andrews and victor yarros the first international mikhail bakunin one eight one four one eight seven six in europe harsh reaction followed the revolutions of one eight four eight twenty years later in one eight six four the international workingmen s association sometimes called the first international united some diverse european revolutionary currents including anarchism due to its genuine links to active workers movements the international became signficiant from the start karl marx was a leading figure in the international he was elected to every succeeding general council of the association the first objections to marx came from the mutualists who opposed communism and statism shortly after mikhail bakunin and his followers joined in one eight six eight the first international became polarised into two camps with marx and bakunin as their respective figureheads the clearest difference between the camps was over strategy the anarchists around bakunin favoured in kropotkin s words direct economical struggle against capitalism without interfering in the political parliamentary agitation at that time marx and his followers focused on parliamentary activity bakunin characterised marx s ideas as authoritarian and predicted that if a marxist party gained to power its leaders would end up as bad as the ruling class they had fought against in one eight seven two the conflict climaxed with a final split between the two groups at the hague congress this is often cited as the origin of the conflict between anarchists and marxists from this moment the social democratic and libertarian currents of socialism had distinct organisations including rival internationals anarchist communism peter kropotkin proudhon and bakunin both opposed communism associating it with statism however in the one eight seven zero s many anarchists moved away from bakunin s economic thinking called collectivism and embraced communist concepts communists believed the means of production should be owned

Create a small validation set.


In [5]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:100])
print(valid_size, valid_text[:100])


99999000 ons anarchists advocate social relations based upon voluntary association of autonomous individuals 
1000  anarchism originated as a term of abuse first used against early working class radicals including t

Utility functions to map characters to vocabulary IDs and back.


In [6]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))


Unexpected character: ï
1 26 0 0
a z  

Function to generate a traich for thning bate LSTM model.


In [17]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    #print(segment)
    self._cursor = [ offset * segment for offset in range(batch_size)]
    #print(self._cursor )
    self._last_batch = self._next_batch()
    #print(self._last_batch.shape)
    #print(self._last_batch)
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)


print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))


['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
[' a']
['an']
['na']
['ar']
['rc']
['ch']
['hi']
['is']
['sm']
['m ']
[' o']

In [8]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.


In [9]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [18]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.294314 learning rate: 10.000000
Minibatch perplexity: 26.96
================================================================================
lnnvhv hczs rnt  jfrklectay ts i  j ofme wcog kbvhwgrheajcmhs qjawtjz etselzsg g
crrv aedoo ozjflghwmtxldsee en n bmpdanacki wkiharg wtg vnesmima slwgwrsatiuneen
oreyjivy   nydnxgaaoxzfne yofifbeotnws lplgt  ge erjtihqeinrmg xekbuszrzub pfmp 
j  dgp jtt mga rfhaqr ohpuo  ontpuaanyyzoolochkrmtb xne poswutu nhs ia dslcsgfim
woe egqr fxc  edemeisfnep caoiefltmtotqj a jaribhwzcmgz zshyjnkbqt lk sveqyolevw
================================================================================
Validation set perplexity: 20.24
Average loss at step 100: 2.623005 learning rate: 10.000000
Minibatch perplexity: 11.01
Validation set perplexity: 10.54
Average loss at step 200: 2.254879 learning rate: 10.000000
Minibatch perplexity: 8.64
Validation set perplexity: 8.75
Average loss at step 300: 2.099661 learning rate: 10.000000
Minibatch perplexity: 7.60
Validation set perplexity: 8.16
Average loss at step 400: 2.000877 learning rate: 10.000000
Minibatch perplexity: 7.43
Validation set perplexity: 7.80
Average loss at step 500: 1.937333 learning rate: 10.000000
Minibatch perplexity: 6.49
Validation set perplexity: 7.00
Average loss at step 600: 1.909437 learning rate: 10.000000
Minibatch perplexity: 6.23
Validation set perplexity: 7.03
Average loss at step 700: 1.856690 learning rate: 10.000000
Minibatch perplexity: 6.57
Validation set perplexity: 6.66
Average loss at step 800: 1.814716 learning rate: 10.000000
Minibatch perplexity: 5.83
Validation set perplexity: 6.23
Average loss at step 900: 1.826778 learning rate: 10.000000
Minibatch perplexity: 6.74
Validation set perplexity: 6.20
Average loss at step 1000: 1.820805 learning rate: 10.000000
Minibatch perplexity: 5.55
================================================================================
w ruch newrol caplicar have in te fimmulic of stanbed the y his have inti on a c
verment that warotd movert the winse and mniling welden male enganution treary i
ing unhod at quecianied comery exiamedrets stailwiog filmer houd stace in of the
s earate maridite in also by the from chict the chilly poselos for four yerome a
er in mules fow buare centry the scanum comorizal defise the iminds of centripta
================================================================================
Validation set perplexity: 6.05
Average loss at step 1100: 1.775186 learning rate: 10.000000
Minibatch perplexity: 5.44
Validation set perplexity: 5.80
Average loss at step 1200: 1.748487 learning rate: 10.000000
Minibatch perplexity: 5.04
Validation set perplexity: 5.65
Average loss at step 1300: 1.732639 learning rate: 10.000000
Minibatch perplexity: 5.74
Validation set perplexity: 5.62
Average loss at step 1400: 1.742479 learning rate: 10.000000
Minibatch perplexity: 6.00
Validation set perplexity: 5.48
Average loss at step 1500: 1.729836 learning rate: 10.000000
Minibatch perplexity: 4.78
Validation set perplexity: 5.44
Average loss at step 1600: 1.744872 learning rate: 10.000000
Minibatch perplexity: 5.50
Validation set perplexity: 5.43
Average loss at step 1700: 1.707732 learning rate: 10.000000
Minibatch perplexity: 5.46
Validation set perplexity: 5.24
Average loss at step 1800: 1.672056 learning rate: 10.000000
Minibatch perplexity: 5.35
Validation set perplexity: 5.24
Average loss at step 1900: 1.643194 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 5.21
Average loss at step 2000: 1.692570 learning rate: 10.000000
Minibatch perplexity: 5.63
================================================================================
ope came on precoant to ligne butdles s light come winning is s onaxt one free h
ing nandles the unitiation ospennies after gener in in pain the coud and inlodet
ds and mefdustak the secing d pessent weidable rappate of buyder hes time witwom
jan of empross im auboments for nebsent standers zero zero basid both is nohle w
x histast the modah hl wing the daxents iss one nine eiven seven futea consts sh
================================================================================
Validation set perplexity: 5.20
Average loss at step 2100: 1.682833 learning rate: 10.000000
Minibatch perplexity: 5.12
Validation set perplexity: 4.88
Average loss at step 2200: 1.679279 learning rate: 10.000000
Minibatch perplexity: 6.58
Validation set perplexity: 4.95
Average loss at step 2300: 1.639969 learning rate: 10.000000
Minibatch perplexity: 5.06
Validation set perplexity: 4.76
Average loss at step 2400: 1.659329 learning rate: 10.000000
Minibatch perplexity: 5.01
Validation set perplexity: 4.80
Average loss at step 2500: 1.675587 learning rate: 10.000000
Minibatch perplexity: 5.23
Validation set perplexity: 4.56
Average loss at step 2600: 1.649104 learning rate: 10.000000
Minibatch perplexity: 5.68
Validation set perplexity: 4.55
Average loss at step 2700: 1.651925 learning rate: 10.000000
Minibatch perplexity: 4.58
Validation set perplexity: 4.68
Average loss at step 2800: 1.651363 learning rate: 10.000000
Minibatch perplexity: 5.58
Validation set perplexity: 4.58
Average loss at step 2900: 1.646795 learning rate: 10.000000
Minibatch perplexity: 5.62
Validation set perplexity: 4.62
Average loss at step 3000: 1.647693 learning rate: 10.000000
Minibatch perplexity: 4.98
================================================================================
winnes of indians alumon one nibbly putch atter desidenty farmide protocy of oer
f gengo uncleested dease but could ruls alp ivelbism of all ba cambinnong a cont
ver bother doundenved undersider kihneant to arms day churtar and parto reservan
varian s in of the eight four nemation from be detem tyle in the normal part and
k impalle hegaghame accurtity that founding marring are bay dendently syace enco
================================================================================
Validation set perplexity: 4.70
Average loss at step 3100: 1.621529 learning rate: 10.000000
Minibatch perplexity: 5.68
Validation set perplexity: 4.62
Average loss at step 3200: 1.639812 learning rate: 10.000000
Minibatch perplexity: 5.38
Validation set perplexity: 4.65
Average loss at step 3300: 1.632859 learning rate: 10.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.58
Average loss at step 3400: 1.668599 learning rate: 10.000000
Minibatch perplexity: 5.54
Validation set perplexity: 4.64
Average loss at step 3500: 1.653692 learning rate: 10.000000
Minibatch perplexity: 5.52
Validation set perplexity: 4.69
Average loss at step 3600: 1.665889 learning rate: 10.000000
Minibatch perplexity: 4.35
Validation set perplexity: 4.52
Average loss at step 3700: 1.644094 learning rate: 10.000000
Minibatch perplexity: 5.02
Validation set perplexity: 4.58
Average loss at step 3800: 1.640882 learning rate: 10.000000
Minibatch perplexity: 5.70
Validation set perplexity: 4.69
Average loss at step 3900: 1.636125 learning rate: 10.000000
Minibatch perplexity: 5.05
Validation set perplexity: 4.68
Average loss at step 4000: 1.653151 learning rate: 10.000000
Minibatch perplexity: 4.56
================================================================================
versce central chlensure ruml shorrevies the nine brolding the hervely previeu t
lelitical king revical stania sourgel an in iprograms directions seasionity is o
fathia c canuecally jerong of to be opeball of three merch for maver with bill i
s bradiods the during the quastions can his wave in mudicial usez approhes el co
helecies rif one two zero zero zero zero also payric usophep popular for eight s
================================================================================
Validation set perplexity: 4.66
Average loss at step 4100: 1.631490 learning rate: 10.000000
Minibatch perplexity: 5.10
Validation set perplexity: 4.81
Average loss at step 4200: 1.632502 learning rate: 10.000000
Minibatch perplexity: 5.20
Validation set perplexity: 4.55
Average loss at step 4300: 1.614347 learning rate: 10.000000
Minibatch perplexity: 5.08
Validation set perplexity: 4.58
Average loss at step 4400: 1.611879 learning rate: 10.000000
Minibatch perplexity: 4.82
Validation set perplexity: 4.50
Average loss at step 4500: 1.617748 learning rate: 10.000000
Minibatch perplexity: 5.24
Validation set perplexity: 4.62
Average loss at step 4600: 1.614927 learning rate: 10.000000
Minibatch perplexity: 5.11
Validation set perplexity: 4.61
Average loss at step 4700: 1.623387 learning rate: 10.000000
Minibatch perplexity: 5.14
Validation set perplexity: 4.55
Average loss at step 4800: 1.629507 learning rate: 10.000000
Minibatch perplexity: 4.35
Validation set perplexity: 4.59
Average loss at step 4900: 1.630408 learning rate: 10.000000
Minibatch perplexity: 5.13
Validation set perplexity: 4.74
Average loss at step 5000: 1.606152 learning rate: 1.000000
Minibatch perplexity: 4.41
================================================================================
x who toth expare one eight one two zero years of the preoniniay or in offender 
on one nine vanustries pamial cellurally work with generaty is sudararded midite
thirusary a latent indic recordd batgara is one five wish parating electiveid th
x beath york and as foctivity it fembles molivis to groutt by and terpans onlore
 by some the lontala elizal eftemuled hitter all socitality dosternd dy would a 
================================================================================
Validation set perplexity: 4.67
Average loss at step 5100: 1.602420 learning rate: 1.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.49
Average loss at step 5200: 1.581252 learning rate: 1.000000
Minibatch perplexity: 4.71
Validation set perplexity: 4.44
Average loss at step 5300: 1.575382 learning rate: 1.000000
Minibatch perplexity: 4.55
Validation set perplexity: 4.44
Average loss at step 5400: 1.575437 learning rate: 1.000000
Minibatch perplexity: 5.07
Validation set perplexity: 4.42
Average loss at step 5500: 1.565138 learning rate: 1.000000
Minibatch perplexity: 4.95
Validation set perplexity: 4.38
Average loss at step 5600: 1.577481 learning rate: 1.000000
Minibatch perplexity: 4.79
Validation set perplexity: 4.38
Average loss at step 5700: 1.564546 learning rate: 1.000000
Minibatch perplexity: 4.47
Validation set perplexity: 4.37
Average loss at step 5800: 1.574814 learning rate: 1.000000
Minibatch perplexity: 4.91
Validation set perplexity: 4.36
Average loss at step 5900: 1.570832 learning rate: 1.000000
Minibatch perplexity: 5.02
Validation set perplexity: 4.37
Average loss at step 6000: 1.543420 learning rate: 1.000000
Minibatch perplexity: 5.02
================================================================================
zs to wings onrelated storks coloned by instrugki of occurity one one one two se
opy dissivert important for information on at the proport levaince in offect wat
geans and or four zero zero zero three and its fuel six dispiract of it as have 
shoning that technothing s and the seemento detrical ruth of the pricted for the
onson the indepision one eight bomenish islaids it by fiblis visive yougaway bur
================================================================================
Validation set perplexity: 4.36
Average loss at step 6100: 1.564224 learning rate: 1.000000
Minibatch perplexity: 5.14
Validation set perplexity: 4.34
Average loss at step 6200: 1.532666 learning rate: 1.000000
Minibatch perplexity: 4.87
Validation set perplexity: 4.35
Average loss at step 6300: 1.541885 learning rate: 1.000000
Minibatch perplexity: 5.02
Validation set perplexity: 4.31
Average loss at step 6400: 1.538702 learning rate: 1.000000
Minibatch perplexity: 4.63
Validation set perplexity: 4.31
Average loss at step 6500: 1.554115 learning rate: 1.000000
Minibatch perplexity: 4.70
Validation set perplexity: 4.31
Average loss at step 6600: 1.591587 learning rate: 1.000000
Minibatch perplexity: 4.73
Validation set perplexity: 4.32
Average loss at step 6700: 1.575701 learning rate: 1.000000
Minibatch perplexity: 5.17
Validation set perplexity: 4.32
Average loss at step 6800: 1.600245 learning rate: 1.000000
Minibatch perplexity: 4.68
Validation set perplexity: 4.33
Average loss at step 6900: 1.579958 learning rate: 1.000000
Minibatch perplexity: 4.58
Validation set perplexity: 4.34
Average loss at step 7000: 1.574352 learning rate: 1.000000
Minibatch perplexity: 5.07
================================================================================
ing elomry cornicaten which of johd who paration of the maj the cornifoted trady
ver dober dienans the eight offer american thewe of the udern adving american th
ours partes interimption compodictic skandaummoof to time prevaces of been b tho
weres endive byt to musica betendly conceinties were work regody peestandiations
mo atonte the werab defensive the program regerative nom structor searical one n
================================================================================
Validation set perplexity: 4.31

Problem 1

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.



Problem 2

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this article.



Problem 3

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

the quick brown fox

the model should attempt to output:

eht kciuq nworb xof

Refer to the lecture on how to put together a sequence-to-sequence model, as well as this article for best practices.