Lab 6.1 - Keras for RNN

In this lab we will use the Keras deep learning library to construct a simple recurrent neural network (RNN) that can learn linguistic structure from a piece of text, and use that knowledge to generate new text passages. To review general RNN architecture, specific types of RNN networks such as the LSTM networks we'll be using here, and other concepts behind this type of machine learning, you should consult the following resources:

This code is an adaptation of these two examples:

You can consult the original sites for more information and documentation.

Let's start by importing some of the libraries we'll be using in this lab:


In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

from time import gmtime, strftime
import os
import re
import pickle
import random
import sys


Using TensorFlow backend.

The first thing we need to do is generate our training data set. In this case we will use a recent article written by Barack Obama for The Economist newspaper. Make sure you have the obama.txt file in the /data folder within the /week-6 folder in your repository.


In [2]:
# load ascii text from file
filename = "data/obama.txt"
raw_text = open(filename).read()

# get rid of any characters other than letters, numbers, 
# and a few special characters
raw_text = re.sub('[^\nA-Za-z0-9 ,.:;?!-]+', '', raw_text)

# convert all text to lowercase
raw_text = raw_text.lower()

n_chars = len(raw_text)
print "length of text:", n_chars
print "text preview:", raw_text[:500]


length of text: 18312
text preview: wherever i go these days, at home or abroad, people ask me the same question: what is happening in the american political system? how has a country that has benefitedperhaps more than any otherfrom immigration, trade and technological innovation suddenly developed a strain of anti-immigrant, anti-innovation protectionism? why have some on the far left and even more on the far right embraced a crude populism that promises a return to a past that is not possible to restoreand that, for most americ

Next, we use python's set() function to generate a list of all unique characters in the text. This will form our 'vocabulary' of characters, which is similar to the categories found in typical ML classification problems.

Since neural networks work with numerical data, we also need to create a mapping between each character and a unique integer value. To do this we create two dictionaries: one which has characters as keys and the associated integers as the value, and one which has integers as keys and the associated characters as the value. These dictionaries will allow us to do translation both ways.


In [3]:
# extract all unique characters in the text
chars = sorted(list(set(raw_text)))
n_vocab = len(chars)
print "number of unique characters found:", n_vocab

# create mapping of characters to integers and back
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# test our mapping
print 'a', "- maps to ->", char_to_int["a"]
print 25, "- maps to ->", int_to_char[25]


number of unique characters found: 44
a - maps to -> 18
25 - maps to -> h

Now we need to define the training data for our network. With RNN's, the training data usually takes the shape of a three-dimensional matrix, with the size of each dimension representing:

[# of training sequences, # of training samples per sequence, # of features per sample]

  • The training sequences are the sets of data subjected to the RNN at each training step. As with all neural networks, these training sequences are presented to the network in small batches during training.
  • Each training sequence is composed of some number of training samples. The number of samples in each sequence dictates how far back in the data stream the algorithm will learn, and sets the depth of the RNN layer.
  • Each training sample within a sequence is composed of some number of features. This is the data that the RNN layer is learning from at each time step. In our example, the training samples and targets will use one-hot encoding, so will have a feature for each possible character, with the actual character represented by 1, and all others by 0.

To prepare the data, we first set the length of training sequences we want to use. In this case we will set the sequence length to 100, meaning the RNN layer will be able to predict future characters based on the 100 characters that came before.

We will then slide this 100 character 'window' over the entire text to create input and output arrays. Each entry in the input array contains 100 characters from the text, and each entry in the output array contains the single character that came after.


In [4]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100

inputs = []
outputs = []

for i in range(0, n_chars - seq_length, 1):
    inputs.append(raw_text[i:i + seq_length])
    outputs.append(raw_text[i + seq_length])
    
n_sequences = len(inputs)
print "Total sequences: ", n_sequences


Total sequences:  18212

Now let's shuffle both the input and output data so that we can later have Keras split it automatically into a training and test set. To make sure the two lists are shuffled the same way (maintaining correspondance between inputs and outputs), we create a separate shuffled list of indeces, and use these indeces to reorder both lists.


In [5]:
indeces = range(len(inputs))
random.shuffle(indeces)

inputs = [inputs[x] for x in indeces]
outputs = [outputs[x] for x in indeces]

Let's visualize one of these sequences to make sure we are getting what we expect:


In [6]:
print inputs[0], "-->", outputs[0]


declining productivity growth and rising inequality have resulted in slower income growth for low- a --> n

Next we will prepare the actual numpy datasets which will be used to train our network. We first initialize two empty numpy arrays in the proper formatting:

  • X --> [# of training sequences, # of training samples, # of features]
  • y --> [# of training sequences, # of features]

We then iterate over the arrays we generated in the previous step and fill the numpy arrays with the proper data. Since all character data is formatted using one-hot encoding, we initialize both data sets with zeros. As we iterate over the data, we use the char_to_int dictionary to map each character to its related position integer, and use that position to change the related value in the data set to 1.


In [7]:
# create two empty numpy array with the proper dimensions
X = np.zeros((n_sequences, seq_length, n_vocab), dtype=np.bool)
y = np.zeros((n_sequences, n_vocab), dtype=np.bool)

# iterate over the data and build up the X and y data sets
# by setting the appropriate indices to 1 in each one-hot vector
for i, example in enumerate(inputs):
    for t, char in enumerate(example):
        X[i, t, char_to_int[char]] = 1
    y[i, char_to_int[outputs[i]]] = 1
    
print 'X dims -->', X.shape
print 'y dims -->', y.shape


X dims --> (18212, 100, 44)
y dims --> (18212, 44)

Next, we define our RNN model in Keras. This is very similar to how we defined the CNN model, except now we use the LSTM() function to create an LSTM layer with an internal memory of 128 neurons. LSTM is a special type of RNN layer which solves the unstable gradients issue seen in basic RNN. Along with LSTM layers, Keras also supports basic RNN layers and GRU layers, which are similar to LSTM. You can find full documentation for recurrent layers in Keras' documentation

As before, we need to explicitly define the input shape for the first layer. Also, we need to tell Keras whether the LSTM layer should pass its sequence of predictions or its internal memory as the output to the next layer. If you are connecting the LSTM layer to a fully connected layer as we do in this case, you should set the return_sequences parameter to False to have the layer pass the value of its hidden neurons. If you are connecting multiple LSTM layers, you should set the parameter to True in all but the last layer, so that subsequent layers can learn from the sequence of predictions of previous layers.

We will use dropout with a probability of 50% to regularize the network and prevent overfitting on our training data. The output of the network will be a fully connected layer with one neuron for each character in the vocabulary. The softmax function will convert this output to a probability distribution across all characters.


In [8]:
# define the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=False, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.50))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Next, we define two helper functions: one to select a character based on a probability distribution, and one to generate a sequence of predicted characters based on an input (or 'seed') list of characters.

The sample() function will take in a probability distribution generated by the softmax() function, and select a character based on the 'temperature' input. The temperature (also often called the 'diversity') effects how strictly the probability distribution is sampled.

  • Lower values (closer to zero) output more confident predictions, but are also more conservative. In our case, if the model has overfit the training data, lower values are likely to give back exactly what is found in the text
  • Higher values (1 and above) introduce more diversity and randomness into the results. This can lead the model to generate novel information not found in the training data. However, you are also likely to see more errors such as grammatical or spelling mistakes.

In [9]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

The generate() function will take in:

  • input sentance ('seed')
  • number of characters to generate
  • and target diversity or temperature

and print the resulting sequence of characters to the screen.


In [10]:
def generate(sentence, prediction_length=50, diversity=0.35):
    print '----- diversity:', diversity 

    generated = sentence
    sys.stdout.write(generated)

    # iterate over number of characters requested
    for i in range(prediction_length):
        
        # build up sequence data from current sentence
        x = np.zeros((1, X.shape[1], X.shape[2]))
        for t, char in enumerate(sentence):
            x[0, t, char_to_int[char]] = 1.

        # use trained model to return probability distribution
        # for next character based on input sequence
        preds = model.predict(x, verbose=0)[0]
        
        # use sample() function to sample next character 
        # based on probability distribution and desired diversity
        next_index = sample(preds, diversity)
        
        # convert integer to character
        next_char = int_to_char[next_index]

        # add new character to generated text
        generated += next_char
        
        # delete the first character from beginning of sentance, 
        # and add new caracter to the end. This will form the 
        # input sequence for the next predicted character.
        sentence = sentence[1:] + next_char

        # print results to screen
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print

Next, we define a system for Keras to save our model's parameters to a local file after each epoch where it achieves an improvement in the overall loss. This will allow us to reuse the trained model at a later time without having to retrain it from scratch. This is useful for recovering models incase your computer crashes, or you want to stop the training early.


In [11]:
filepath="-basic_LSTM.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

Now we are finally ready to train the model. We want to train the model over 50 epochs, but we also want to output some generated text after each epoch to see how our model is doing.

To do this we create our own loop to iterate over each epoch. Within the loop we first train the model for one epoch. Since all parameters are stored within the model, training one epoch at a time has the same exact effect as training over a longer series of epochs. We also use the model's validation_split parameter to tell Keras to automatically split the data into 80% training data and 20% test data for validation. Remember to always shuffle your data if you will be using validation!

After each epoch is trained, we use the raw_text data to extract a new sequence of 100 characters as the 'seed' for our generated text. Finally, we use our generate() helper function to generate text using two different diversity settings.

Warning: because of their large depth (remember that an RNN trained on a 100 long sequence effectively has 100 layers!), these networks typically take a much longer time to train than traditional multi-layer ANN's and CNN's. You shoud expect these models to train overnight on the virtual machine, but you should be able to see enough progress after the first few epochs to know if it is worth it to train a model to the end. For more complex RNN models with larger data sets in your own work, you should consider a native installation, along with a dedicated GPU if possible.


In [12]:
epochs = 50
prediction_length = 100

for iteration in range(epochs):
    
    print 'epoch:', iteration + 1, '/', epochs
    model.fit(X, y, validation_split=0.2, batch_size=256, nb_epoch=1, callbacks=callbacks_list)
    
    # get random starting point for seed
    start_index = random.randint(0, len(raw_text) - seq_length - 1)
    # extract seed sequence from raw text
    seed = raw_text[start_index: start_index + seq_length]
    
    print '----- generating with seed:', seed
    
    for diversity in [0.5, 1.2]:
        generate(seed, prediction_length, diversity)


epoch: 1 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 114s - loss: 3.1931 - val_loss: 2.9767
----- generating with seed: financial system was stabilised without costing taxpayers a dime and the auto industry rescued. i en
----- diversity: 0.5
financial system was stabilised without costing taxpayers a dime and the auto industry rescued. i eneerscfc rptn ri ns ooc e d  deeo ra ian i eo  its  set   o rs  ag   ,o s ai e  dh  si s ei oedh otei
----- diversity: 1.2
financial system was stabilised without costing taxpayers a dime and the auto industry rescued. i eni psueta3handkifnseyondknl baeh  skwde, tr if0faxiicraoaswtn,vehon   isckuc.wi91hofc idlueeoqtiy i1:
epoch: 2 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 107s - loss: 3.0204 - val_loss: 2.9401
----- generating with seed:  who achieve it. in fact, weve often accepted more inequality than many other nations because we are
----- diversity: 0.5
 who achieve it. in fact, weve often accepted more inequality than many other nations because we areco g t nend  tleter a tn t e  iil tp iiee e   rhro es  s,iit or r oteo   ee  o   t  oa s oniite gt u
----- diversity: 1.2
 who achieve it. in fact, weve often accepted more inequality than many other nations because we areplfryie,aonelanps fvnmnoey,koa gfacung eeanvufalsl.tylmuhtatanr rftdeav roceu i -contsekcssirs; ir n
epoch: 3 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 106s - loss: 2.9668 - val_loss: 2.9023
----- generating with seed: durable, growing economy; 15m new private-sector jobs since early 2010; rising wages, falling povert
----- diversity: 0.5
durable, growing economy; 15m new private-sector jobs since early 2010; rising wages, falling povert eroe  i eh , ea ie ns iont t   r h  ne oteheto  te  iet  ole  eibrseiee  ne e  ae eb eruii be ed re
----- diversity: 1.2
durable, growing economy; 15m new private-sector jobs since early 2010; rising wages, falling povert n nt grecls e9nmltm9nr  chsr kodet hf-sih ait er suea amiaupariatd mes  he bmo  atgmt3ohcaaeas uyui
epoch: 4 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 113s - loss: 2.9017 - val_loss: 2.8085
----- generating with seed:  only seemed to increase the isolation of corporations and elites, who often seem to live by a diffe
----- diversity: 0.5
 only seemed to increase the isolation of corporations and elites, who often seem to live by a diffe tal t a  re d oret arredoo te ro eor lonlevncta e t toe  aor ar ese- ent- aan teb ten cey ate sn aa
----- diversity: 1.2
 only seemed to increase the isolation of corporations and elites, who often seem to live by a diffeoiisgiotmegwh v riuacly ejce6utsgsia ?r m9lhsatlechagnceerseoipreoxya, tutaiwl t,syo dda lanrercer g
epoch: 5 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 2.7903 - val_loss: 2.6951
----- generating with seed: or idea that was threatening america under control. we overcame those fears and we will again.

but 
----- diversity: 0.5
or idea that was threatening america under control. we overcame those fears and we will again.

but r han ils una ao for cian ale teot e or coan net ee eires ae becile ocmet tha iin teaiin corer rane 
----- diversity: 1.2
or idea that was threatening america under control. we overcame those fears and we will again.

but 
thre ljmt rborp
iivon edln talf to uan wociirele irle wdthe sl de roralr-loyychuve 1oh oa utsepfool
epoch: 6 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.6846 - val_loss: 2.5906
----- generating with seed:  be overridden by bad politics. my administration secured much more fiscal expansion than many appre
----- diversity: 0.5
 be overridden by bad politics. my administration secured much more fiscal expansion than many appre cont the tin age onpee  hont ote an cese aranl ogl anerite an aos tal oat sot iil es the  aure the 
----- diversity: 1.2
 be overridden by bad politics. my administration secured much more fiscal expansion than many appreiy fy uural,iseiolwkedmistesan u-pucooa fgrp ,ea
itmene  ent alxrtoromh pq tpate ts tea tuanejasg io
epoch: 7 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 113s - loss: 2.5989 - val_loss: 2.5268
----- generating with seed: rs. reforms to our criminal-justice system and improvements to re-entry into the workforce that have
----- diversity: 0.5
rs. reforms to our criminal-justice system and improvements to re-entry into the workforce that have an abt io coa tor aon pe and are the me she thar sr ehe thad thanr ho ann ore ont in the they fo ro
----- diversity: 1.2
rs. reforms to our criminal-justice system and improvements to re-entry into the workforce that have anpuectmpalnegntos mivo pcolhut y,
wrhcgttw ne iun91v contibsg tnsd ec9lld ogen gio; oosintdryoy ns
epoch: 8 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 118s - loss: 2.5393 - val_loss: 2.4704
----- generating with seed: osting taxpayers a dime and the auto industry rescued. i enacted a larger and more front-loaded fisc
----- diversity: 0.5
osting taxpayers a dime and the auto industry rescued. i enacted a larger and more front-loaded fiscot boag the then iniint one usale res ons the the le ind se prersmer tha setecen ate the re anil can
----- diversity: 1.2
osting taxpayers a dime and the auto industry rescued. i enacted a larger and more front-loaded fiscsliige tuu tvekmew  omexs on
anmtm4sata phen,ie te obs. od rltbfar thopn pos ovren atu thaegme: edou
epoch: 9 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 118s - loss: 2.4881 - val_loss: 2.4301
----- generating with seed: scipline in good times to expand support for the economy when needed and to meet our long-term oblig
----- diversity: 0.5
scipline in good times to expand support for the economy when needed and to meet our long-term obligthe soule on are ave cof in tho thon  norerge an ari sen the thop tore pore wot an alion ano eat the
----- diversity: 1.2
scipline in good times to expand support for the economy when needed and to meet our long-term oblighue tui ceyls4m0nsthestidusymonentsoreli, deaetnddosnis co br-yeni hare.
iig lhum paritcinovgcamiod 
epoch: 10 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 2.4486 - val_loss: 2.3963
----- generating with seed: f who americans are as a people. we dont begrudge success, we aspire to it and admire those who achi
----- diversity: 0.5
f who americans are as a people. we dont begrudge success, we aspire to it and admire those who achins nal al sorg the  ore the the merthe ta the the sure faco oo the the ind initing se of the fon fin
----- diversity: 1.2
f who americans are as a people. we dont begrudge success, we aspire to it and admire those who achitynevs a. altiy, ant oukcine jlopehd 5on iron. tho ofom 7helg the frqfurt liv vcregd si le biand t c
epoch: 11 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 112s - loss: 2.4166 - val_loss: 2.3654
----- generating with seed: w we divide up the pie.

a major source of the recent productivity slowdown has been a shortfall of 
----- diversity: 0.5
w we divide up the pie.

a major source of the recent productivity slowdown has been a shortfall of rerecal the the the he beumes pous the t at the are hor the the then callican, the drale s an ericai
----- diversity: 1.2
w we divide up the pie.

a major source of the recent productivity slowdown has been a shortfall of romedgec urrecl. lud cosfvess lhomfdtats ucrtfall ppiarecumise tom-oqbe,pr4touty t fireule lo cpalal
epoch: 12 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 2.3836 - val_loss: 2.3456
----- generating with seed: n have increased the share of income received by all other families by more than the tax changes in 
----- diversity: 0.5
n have increased the share of income received by all other families by more than the tax changes in don nos an aor ihe mintinges and and them eronte shome the tho es of are invengues the for the thes 
----- diversity: 1.2
n have increased the share of income received by all other families by more than the tax changes in unke bdeessso.h, ;hta- andcarsy olqbaly, ;ranl?pmor, an omestwacss who rirdlahet ftubxveyd ban les1d
epoch: 13 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 116s - loss: 2.3619 - val_loss: 2.3232
----- generating with seed: ens.

so its no wonder that so many are receptive to the argument that the game is rigged. but amid 
----- diversity: 0.5
ens.

so its no wonder that so many are receptive to the argument that the game is rigged. but amid in on ato tho mer than ie fore on enerecon the inge the the the llite the the arn the the tha the at
----- diversity: 1.2
ens.

so its no wonder that so many are receptive to the argument that the game is rigged. but amid go arcive-wim agt ocrsee ded mam. has, to fer,s 7ssdet le fad carh fd, ane codecocvy fof tramate t o
epoch: 14 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 2.3377 - val_loss: 2.2985
----- generating with seed:  know. but it has been the source of more than two centuries of economic and social progress. the pr
----- diversity: 0.5
 know. but it has been the source of more than two centuries of economic and social progress. the proger tha ard san the ancerthe tho eress an ine al icinge the the weo the the beming the wers an ore 
----- diversity: 1.2
 know. but it has been the source of more than two centuries of economic and social progress. the pral sun8 sanduw ahrolmermrosagicont, ohe om2g.
ttinp arsini that ivafse saut the bund concib liconfd1
epoch: 15 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.3094 - val_loss: 2.2749
----- generating with seed:  that drives market economies.

america has shown that progress is possible. last year, income gains
----- diversity: 0.5
 that drives market economies.

america has shown that progress is possible. last year, income gains bion that on brett and pred and raticis and the onalne th the that in bost the parters men the an t
----- diversity: 1.2
 that drives market economies.

america has shown that progress is possible. last year, income gainso.
e caw,d wher-
arco-. ws cat ind ssonitionk so., rattrfcres cbereelr ir wubey aadt algedyiml. ind 
epoch: 16 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 2.2822 - val_loss: 2.2611
----- generating with seed: e auto industry rescued. i enacted a larger and more front-loaded fiscal stimulus than even presiden
----- diversity: 0.5
e auto industry rescued. i enacted a larger and more front-loaded fiscal stimulus than even presidens thad the ringe the le mont ras the the are sonath th to the the fore and indematire that ohe cofre
----- diversity: 1.2
e auto industry rescued. i enacted a larger and more front-loaded fiscal stimulus than even presidencsy, ainst.ithe5s.



hoke dol isitd ensouclirs by ats wequ cormimirg1focid
 nfara icw palxerdgtstta
epoch: 17 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.2624 - val_loss: 2.2388
----- generating with seed:  inequality has risen in most advanced economies, with that increase most pronounced in the united s
----- diversity: 0.5
 inequality has risen in most advanced economies, with that increase most pronounced in the united seco then the the the the the realus and and the thes the ferthe the were ingores houp ore the aces a
----- diversity: 1.2
 inequality has risen in most advanced economies, with that increase most pronounced in the united sunwtibinayet mehale ikoncor hy s3seir un r imcins.uteaal tientsby rowablt
minc qurd-etsems iand aval
epoch: 18 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.2423 - val_loss: 2.2232
----- generating with seed:  was prevented. the financial system was stabilised without costing taxpayers a dime and the auto in
----- diversity: 0.5
 was prevented. the financial system was stabilised without costing taxpayers a dime and the auto in wan chest aicina the the cing nod wering ant re censere the porroul bat profetr cinss the the ind a
----- diversity: 1.2
 was prevented. the financial system was stabilised without costing taxpayers a dime and the auto in ratoxs mut hay wones  honcofemvilisings ond aud,uted in af tkibe-wconthet ce omacri l9inse cacrices
epoch: 19 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 2.2142 - val_loss: 2.2089
----- generating with seed: 60.

even these efforts fall well short. in the future, we need to be even more aggressive in enacti
----- diversity: 0.5
60.

even these efforts fall well short. in the future, we need to be even more aggressive in enaction on proileste in whal shest for the the the duale or anden our out and or the prove sos an ore als
----- diversity: 1.2
60.

even these efforts fall well short. in the future, we need to be even more aggressive in enactintst thai  pmiven supthds nre neaidfind or encitdos bud4dsl reaiss aoco potitone gowg 15tas8 dborgsa
epoch: 20 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 112s - loss: 2.2016 - val_loss: 2.1917
----- generating with seed: ining unions and a falling minimum wage. there is something to all of these and weve made real progr
----- diversity: 0.5
ining unions and a falling minimum wage. there is something to all of these and weve made real progresis tho tere poret en wer core ges the reas nor enprotita in oro gereres the fores on the angite th
----- diversity: 1.2
ining unions and a falling minimum wage. there is something to all of these and weve made real progracot, ar seqolikariss secbe tpoesun enimgp, metye-aultcat omevasdd hapecisuss mvew pceusali iv inbmr
epoch: 21 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 115s - loss: 2.1782 - val_loss: 2.1856
----- generating with seed: cine. but while these innovations have changed lives, they have not yet substantially boosted measur
----- diversity: 0.5
cine. but while these innovations have changed lives, they have not yet substantially boosted measure the for dot en are cond the rowe bing the comonge the the and and mero conger far for are for more
----- diversity: 1.2
cine. but while these innovations have changed lives, they have not yet substantially boosted measure phermenlise dierxpavris4 thonr pacssty and alomtrel, in woke witw pgamtyacr , ghaf 1ewb lilchcrs m
epoch: 22 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.1571 - val_loss: 2.1689
----- generating with seed:  move us in the right direction too.


third, a successful economy also depends on meaningful opport
----- diversity: 0.5
 move us in the right direction too.


third, a successful economy also depends on meaningful opporticas ant wopl withe ans and reating monithe thar the bees arkere and wat eal seare ant and inderes o
----- diversity: 1.2
 move us in the right direction too.


third, a successful economy also depends on meaningful opportricibes navee-cingat ic mlconols 8hcorkeasmoso ghisgcunde. wh theoy htabme wtry foante-t and but e3s
epoch: 23 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 111s - loss: 2.1435 - val_loss: 2.1632
----- generating with seed: ss-tax reform that lowers statutory rates and closes loopholes, and with public investments in basic
----- diversity: 0.5
ss-tax reform that lowers statutory rates and closes loopholes, and with public investments in basice ine for the rain for and ald natien coutire palt on and for the ared an are poriming or senticing 
----- diversity: 1.2
ss-tax reform that lowers statutory rates and closes loopholes, and with public investments in basicomeaseathas buteing oal asd th alg uol greve coung io gouthiall lepauticancm umimc-atr is canveleic,
epoch: 24 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 2.1283 - val_loss: 2.1429
----- generating with seed: s ever known.

over the past 25 years, the proportion of people living in extreme poverty has fallen
----- diversity: 0.5
s ever known.

over the past 25 years, the proportion of people living in extreme poverty has fallen ant of to the far and ander thes the are por in ofer ard quore economis and the ald ancing the ard 
----- diversity: 1.2
s ever known.

over the past 25 years, the proportion of people living in extreme poverty has fallentages rhgatn ij asit. axer.ades ars eecotola-mateag hos go-tmert peo lemces goe falc mesosso. planci
epoch: 25 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.1099 - val_loss: 2.1405
----- generating with seed: 7, that share had more than doubled to 17. this challenges the very essence of who americans are as 
----- diversity: 0.5
7, that share had more than doubled to 17. this challenges the very essence of who americans are as for comeres ar arm orecoming the ared and wes on the ares and in priveng the recensere the are porem
----- diversity: 1.2
7, that share had more than doubled to 17. this challenges the very essence of who americans are as intoay, protiniv issfileqbegt-ucopat ansulinglglizgs. whe galce srosuminf.
,mpyivitedes eod om ho re
epoch: 26 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.0789 - val_loss: 2.1283
----- generating with seed:  union would take far longer. the presidency is a relay race, requiring each of us to do our part to
----- diversity: 0.5
 union would take far longer. the presidency is a relay race, requiring each of us to do our part to tur be thal af rese the warl ant the ales an the merice thar grost and and the merant palt coreate 
----- diversity: 1.2
 union would take far longer. the presidency is a relay race, requiring each of us to do our part to ige0t. are alchner fnisly dorpem tosed. 
oini fant yovidunt an ip aciigguby budarien al dongthovatg
epoch: 27 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 2.0780 - val_loss: 2.1328
----- generating with seed: eal people.

instead, fully restoring faith in an economy where hardworking americans can get ahead 
----- diversity: 0.5
eal people.

instead, fully restoring faith in an economy where hardworking americans can get ahead at mining wout the frest and beale and on the pality the wos the rofer so centine sof anution enomec
----- diversity: 1.2
eal people.

instead, fully restoring faith in an economy where hardworking americans can get ahead bee o19vgss rusen io ureinat is fmint-ingiig or alciens deata of brojukiibse sothas eqoirg aen prosi
epoch: 28 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 2.0523 - val_loss: 2.1114
----- generating with seed: erly expensive health insurance.

more fundamentally, a capitalism shaped by the few and unaccountab
----- diversity: 0.5
erly expensive health insurance.

more fundamentally, a capitalism shaped by the few and unaccountabling som rest an prever ge some and and on ald and bytalidess the ancal of the promeris chises the a
----- diversity: 1.2
erly expensive health insurance.

more fundamentally, a capitalism shaped by the few and unaccountabd leveree isscale. fithhald to dhave ingy. tap rilenons gocof7wsm comogle; brtingheasstn and; bve ge
epoch: 29 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 2.0344 - val_loss: 2.1042
----- generating with seed: always acknowledged that the work of perfecting our union would take far longer. the presidency is a
----- diversity: 0.5
always acknowledged that the work of perfecting our union would take far longer. the presidency is ars ind and bus resing and in in 198 in whinina atating the poresser the ledeon the are all of the th
----- diversity: 1.2
always acknowledged that the work of perfecting our union would take far longer. the presidency is areeun thomnnseins  ovey prosufepotedht courtaint omy.t om hoe f os cerore. to wdet a d abare.

21e n
epoch: 30 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 2.0148 - val_loss: 2.1032
----- generating with seed: tion, declining unions and a falling minimum wage. there is something to all of these and weve made 
----- diversity: 0.5
tion, declining unions and a falling minimum wage. there is something to all of these and weve made poreston in the prover gat the  hare the rager that rate ind in in atering fad in or whar growth the
----- diversity: 1.2
tion, declining unions and a falling minimum wage. there is something to all of these and weve made ioderve ion weomy to beeure, tha canile tos tol cewthe anddutilitl. to hire9 moas, iofpholy, redrs
t
epoch: 31 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 1.9950 - val_loss: 2.0991
----- generating with seed: proportion of people living in extreme poverty has fallen from nearly 40 to under 10. last year, ame
----- diversity: 0.5
proportion of people living in extreme poverty has fallen from nearly 40 to under 10. last year, amere thes conand we proutition and sulditing ours and in the angering one thas the the alst ot or the 
----- diversity: 1.2
proportion of people living in extreme poverty has fallen from nearly 40 to under 10. last year, amelrtom?gter ffrets ted alhescfolleveice dediveciin d cootm borce nomcee t ow anverargeem in miliset t
epoch: 32 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 107s - loss: 1.9817 - val_loss: 2.0918
----- generating with seed:  up-front.

but even with all the progress, segments of the shadow banking system still present vuln
----- diversity: 0.5
 up-front.

but even with all the progress, segments of the shadow banking system still present vuln the reatert and the and betise secont ghime the andile to whe proutita that the butith ald anderica
----- diversity: 1.2
 up-front.

but even with all the progress, segments of the shadow banking system still present vulnkex. gove that d green 1, c4 dubitloble, groasimnation ssuntould invlind terlfore in lorey ieng cali
epoch: 33 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 1.9622 - val_loss: 2.0895
----- generating with seed: heir fair share, tax changes enacted during my administration have increased the share of income rec
----- diversity: 0.5
heir fair share, tax changes enacted during my administration have increased the share of income recous thet pronteal. bnt ges wow elange for s wire anderes for have wot and be inthere that or in ancr
----- diversity: 1.2
heir fair share, tax changes enacted during my administration have increased the share of income receninit , diverinn mobiee malt r19g7r mupa tt axdbmeg deecy inouws. that enp.., congein fore
fobline 
epoch: 34 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 108s - loss: 1.9489 - val_loss: 2.0964
----- generating with seed: rican financial institutions no longer get the type of easier funding they got beforeevidence that t
----- diversity: 0.5
rican financial institutions no longer get the type of easier funding they got beforeevidence that the ramering that the and meation the anderten in ereverice tho boters. the an ant ancaled and the le
----- diversity: 1.2
rican financial institutions no longer get the type of easier funding they got beforeevidence that thain fluridy ax cpladed wa lat3tbit rines myeed andeat of rregorex weve by ivery jeks cnorcen. squen
epoch: 35 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 1.9369 - val_loss: 2.0824
----- generating with seed: re acts progress in reducing health-care costs and limiting tax breaks for the most fortunate can ad
----- diversity: 0.5
re acts progress in reducing health-care costs and limiting tax breaks for the most fortunate can adling the ass pore the promecing in ore hould the fore the the arecoporica dones enorse the deress an
----- diversity: 1.2
re acts progress in reducing health-care costs and limiting tax breaks for the most fortunate can adde. chobutes fo have id gyaty, aclm., eve lonut deve so17, teas  dyo-egrita loss -ofgarg4 an ic-pomy
epoch: 36 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.9114 - val_loss: 2.0784
----- generating with seed: hich presents the best opportunity to save the planet for future generations.

a hope for the future
----- diversity: 0.5
hich presents the best opportunity to save the planet for future generations.

a hope for the future for the more and arist and rever gat of the fured and and anderecing and and paring and prodens all
----- diversity: 1.2
hich presents the best opportunity to save the planet for future generations.

a hope for the futures. acticcminct phes cename fowilng to bey.-rensarbkis still. whadve ons or trese wepansts more nctre
epoch: 37 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.8885 - val_loss: 2.0855
----- generating with seed: h health insurance, while health-care costs grow at the slowest rate in 50 years; annual deficits cu
----- diversity: 0.5
h health insurance, while health-care costs grow at the slowest rate in 50 years; annual deficits cus in aconomy werkers and the aldenesis sester an the presten in paiding the forens we reder sofref f
----- diversity: 1.2
h health insurance, while health-care costs grow at the slowest rate in 50 years; annual deficits cut cot ef eeprofy re, er nicabul nevelytyg tha tiaflledeg promersussurt, red, cumilees at sucnet ig a
epoch: 38 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.8706 - val_loss: 2.0827
----- generating with seed: he late 19th and early 20th centuries, and any number of eras in which americans were told they coul
----- diversity: 0.5
he late 19th and early 20th centuries, and any number of eras in which americans were told they coull poverticing poomersists and ould verecongrice so pating the for-serce can in the expectiog and wir
----- diversity: 1.2
he late 19th and early 20th centuries, and any number of eras in which americans were told they couls bo je4t ad poveredive, ffol rigeen, paes ecolmes. im can6be wede tonowkung ef is eminhing to ard b
epoch: 39 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 1.8572 - val_loss: 2.0796
----- generating with seed: on, declining unions and a falling minimum wage. there is something to all of these and weve made re
----- diversity: 0.5
on, declining unions and a falling minimum wage. there is something to all of these and weve made reat on the enore that we conperteat and recinered and mora patinat ansillinat on wath is chant whe  a
----- diversity: 1.2
on, declining unions and a falling minimum wage. there is something to all of these and weve made redonsing fcalls. ir on thily 200013by thats orempladthe seanbion and pricins ainiqulit tand, evinica 
epoch: 40 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.8353 - val_loss: 2.0751
----- generating with seed: rogress leaves us more vulnerable, not less so.

america should also do more to prepare for negative
----- diversity: 0.5
rogress leaves us more vulnerable, not less so.

america should also do more to prepare for negative take we hone that the fartor indertecan ines ace of whal the erenting thar bist and in inoracins ho
----- diversity: 1.2
rogress leaves us more vulnerable, not less so.

america should also do more to prepare for negative tam-nollang thes. thad are buthingiticlnot and capsincif scatisnarces arsf sonj ceane sesure
fow, c
epoch: 41 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 1.8178 - val_loss: 2.0862
----- generating with seed:  the future, we need to be even more aggressive in enacting measures to reverse the decades-long ris
----- diversity: 0.5
 the future, we need to be even more aggressive in enacting measures to reverse the decades-long rise so constre the fores the worken the for anc and emoro than are whol fores the the part oe the efor
----- diversity: 1.2
 the future, we need to be even more aggressive in enacting measures to reverse the decades-long risiucp trouder-tum sed aet the l amert crare rotapl.st progte that ta the shard sestprs. wiblito lise 
epoch: 42 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 1.8077 - val_loss: 2.0770
----- generating with seed: st rates, fiscal policy must play a bigger role in combating future downturns; monetary policy shoul
----- diversity: 0.5
st rates, fiscal policy must play a bigger role in combating future downturns; monetary policy should poret to the probmers on our of the poresto get enerses reade sos the alcon the ameres and economy
----- diversity: 1.2
st rates, fiscal policy must play a bigger role in combating future downturns; monetary policy should, entcantith anmshecpurtorr. shirld, the fand ol imingjiliny adoteing oultngsymamitila in wall;s3 m
epoch: 43 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 109s - loss: 1.7810 - val_loss: 2.0831
----- generating with seed: on
finally, the financial crisis painfully underscored the need for a more resilient economy, one th
----- diversity: 0.5
on
finally, the financial crisis painfully underscored the need for a more resilient economy, one that and erice in progut beebles to reat ainges the past the ted breare the forem of the erofore some 
----- diversity: 1.2
on
finally, the financial crisis painfully underscored the need for a more resilient economy, one thit cots nai
yhored anlyst. umereees malan nuolon suceestbhy cousdollenon inghasary; barse-rtiatihut 
epoch: 44 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.7645 - val_loss: 2.0839
----- generating with seed:  uncertainty and unease. so we have a choiceretreat into old, closed-off economies or press forward,
----- diversity: 0.5
 uncertainty and unease. so we have a choiceretreat into old, closed-off economies or press forward, indunte to mont are antient als yect and the the pating the and more the prody to be in patien ase 
----- diversity: 1.2
 uncertainty and unease. so we have a choiceretreat into old, closed-off economies or press forward, wuoldiut ty e onfrs.
rafpewing growth imubuinor whesb tad-more to mavition. is ond nel to intoure t
epoch: 45 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.7505 - val_loss: 2.0769
----- generating with seed: he earned income tax credit for workers without dependent children, limiting tax breaks for high-inc
----- diversity: 0.5
he earned income tax credit for workers without dependent children, limiting tax breaks for high-inctititinis sus and the tradengrest ritith the protien are for and secons that des tor ast part on bur
----- diversity: 1.2
he earned income tax credit for workers without dependent children, limiting tax breaks for high-incerate.
wmed cas me ho7 timand proevverogs, an but inmags. io lhiven fe amering fhant te, pradti-gica
epoch: 46 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.7356 - val_loss: 2.1101
----- generating with seed: road-based consumer spending that drives market economies.

america has shown that progress is possi
----- diversity: 0.5
road-based consumer spending that drives market economies.

america has shown that progress is possing sedecones the for conture and on more are ancaul mene in on buility ind aling furenatle so the th
----- diversity: 1.2
road-based consumer spending that drives market economies.

america has shown that progress is possinc herscerincs ou forln apthat. tu ppoytimgrabe thas bublely nmmravionc with manistalliin wat ord on
epoch: 47 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.7156 - val_loss: 2.0899
----- generating with seed: mented, the failure of businesses to take into account the impact of their decisions on others throu
----- diversity: 0.5
mented, the failure of businesses to take into account the impact of their decisions on others throued no tone to e and our for the toam for con the rowers of the for ander and ald fore to are so for 
----- diversity: 1.2
mented, the failure of businesses to take into account the impact of their decisions on others throught trexlimgec fol expersevactila bosdececiss bhialbede noyty. tallloer ecson os ford. urepris-emlfi
epoch: 48 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.7078 - val_loss: 2.0918
----- generating with seed: he most privileged live. expectations rise faster than governments can deliver and a pervasive sense
----- diversity: 0.5
he most privileged live. expectations rise faster than governments can deliver and a pervasive sense by a divating the profint ion in partine can in copresiti an seclesing be and toprest rede be the p
----- diversity: 1.2
he most privileged live. expectations rise faster than governments can deliver and a pervasive sense, phessesst  6meret. eicfoas owg seop intremiss proavis es brektreks ved hobadd bliced am warh onc o
epoch: 49 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 111s - loss: 1.6858 - val_loss: 2.0931
----- generating with seed: ivate investment and innovation with business-tax reform that lowers statutory rates and closes loop
----- diversity: 0.5
ivate investment and innovation with business-tax reform that lowers statutory rates and closes loop and in our iningeation sed predins are or pronter for tigr and ard maring that the prostica ions be
----- diversity: 1.2
ivate investment and innovation with business-tax reform that lowers statutory rates and closes loop zvetine nat yolrs. fytp tha erofth:tl or the more porendaris tookedts ofsthen. neesedti s chade; po
epoch: 50 / 50
Train on 14569 samples, validate on 3643 samples
Epoch 1/1
14569/14569 [==============================] - 110s - loss: 1.6676 - val_loss: 2.0855
----- generating with seed: ress, segments of the shadow banking system still present vulnerabilities and the housing-finance sy
----- diversity: 0.5
ress, segments of the shadow banking system still present vulnerabilities and the housing-finance syster shal deand the growe then the fiom rase of the erover compliting to ine ricans and alle to best
----- diversity: 1.2
ress, segments of the shadow banking system still present vulnerabilities and the housing-finance systend son thet poretyes that by ecoa hay hape anduble ssecto sis nortoure tparuy lhon coratemes o th

That looks pretty good! You can see that the RNN has learned alot of the linguistic structure of the original writing, including typical length for words, where to put spaces, and basic punctuation with commas and periods. Many words are still misspelled but seem almost reasonable, and it is pretty amazing that it is able to learn this much in only 50 epochs of training.

You can see that the loss is still going down after 50 epochs, so the model can definitely benefit from longer training. If you're curious you can try to train for more epochs, but as the error decreases be careful to monitor the output to make sure that the model is not overfitting. As with other neural network models, you can monitor the difference between training and validation loss to see if overfitting might be occuring. In this case, since we're using the model to generate new information, we can also get a sense of overfitting from the material it generates.

A good indication of overfitting is if the model outputs exactly what is in the original text given a seed from the text, but jibberish if given a seed that is not in the original text. Remember we don't want the model to learn how to reproduce exactly the original text, but to learn its style to be able to generate new text. As with other models, regularization methods such as dropout and limiting model complexity can be used to avoid the problem of overfitting.

Finally, let's save our training data and character to integer mapping dictionaries to an external file so we can reuse it with the model at a later time.


In [13]:
pickle_file = '-basic_data.pickle'

try:
    f = open(pickle_file, 'wb')
    save = {
        'X': X,
        'y': y,
        'int_to_char': int_to_char,
        'char_to_int': char_to_int,
    }
    pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print 'Unable to save data to', pickle_file, ':', e
    raise
    
statinfo = os.stat(pickle_file)
print 'Saved data to', pickle_file
print 'Compressed pickle size:', statinfo.st_size


Saved data to -basic_data.pickle
Compressed pickle size: 80934860