layout: post title: Test markdown subtitle: Each post also has a subtitle

published: true

Generate Text with LSTM



In [1]:

    
import numpy as np
import random
import tensorflow as tf
import datetime

Read text

downloaded from https://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/



In [2]:

    
text = open('wiki.test.raw').read()
print('text length in number of characters:', len(text))

print('head of text:')
print(text[:1000])









    



text length in number of characters: 1288556
head of text:
 
 = Robert Boulter = 
 
 Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as " Craig " in the episode " Teddy 's Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall . 
 In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 20

Characters



In [3]:

    
chars = sorted(list(set(text)))
char_size = len(chars)
print('number of characters:', char_size)
print(chars)









    



number of characters: 259
['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '^', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£', '¥', '©', '°', '½', 'Á', 'Æ', 'É', '×', 'ß', 'à', 'á', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'í', 'î', 'ñ', 'ó', 'ô', 'ö', 'ú', 'ü', 'ć', 'č', 'ě', 'ī', 'ł', 'Ō', 'ō', 'Š', 'ū', 'ž', 'ǐ', 'ǔ', 'ǜ', 'ə', 'ɛ', 'ɪ', 'ʊ', 'ˈ', 'ː', '̍', '͘', 'Π', 'Ω', 'έ', 'α', 'β', 'δ', 'ε', 'ι', 'λ', 'μ', 'ν', 'ο', 'π', 'ς', 'σ', 'τ', 'υ', 'ω', 'ό', 'П', 'в', 'д', 'и', 'к', 'н', 'א', 'ב', 'י', 'ל', 'ר', 'ש', 'ת', 'ا', 'ت', 'د', 'س', 'ك', 'ل', 'و', 'ڠ', 'ग', 'न', 'र', 'ल', 'ष', 'ु', 'े', 'ो', '्', 'ả', 'ẩ', '‑', '–', '—', '’', '“', '”', '†', '‡', '…', '⁄', '₩', '₱', '→', '−', '♯', 'の', 'ア', 'イ', 'ク', 'グ', 'ジ', 'ダ', 'ッ', 'ド', 'ナ', 'ブ', 'ラ', 'ル', '中', '为', '伊', '傳', '八', '利', '前', '勢', '史', '型', '士', '大', '学', '宝', '开', '律', '成', '戦', '春', '智', '望', '杜', '東', '民', '王', '甫', '田', '甲', '秘', '聖', '艦', '處', '衛', '解', '詩', '贈', '邵', '都', '鉄', '集', '魯']

Helper functions



In [4]:

    
#Character to id, and id to character
char2id = dict((c, i) for i, c in enumerate(chars))
id2char = dict((i, c) for i, c in enumerate(chars))



In [5]:

    
#Given a probability of each character, return a likely character, one-hot encoded
def sample(prediction):
    r = random.uniform(0,1)
    s = 0
    char_id = len(prediction) - 1
    for i in range(len(prediction)):
        s += prediction[i]
        if s >= r:
            char_id = i
            break
    char_one_hot = np.zeros(shape=[char_size])
    char_one_hot[char_id] = 1.0
    return char_one_hot

Create training data



In [6]:

    
len_per_section = 50
skip = 2

sections = []
next_chars = []
for i in range(0, len(text) - len_per_section, skip):
    sections.append(text[i: i + len_per_section])
    next_chars.append(text[i + len_per_section])

#Vectorize input and output
X = np.zeros((len(sections), len_per_section, char_size))
y = np.zeros((len(sections), char_size))
for i, section in enumerate(sections):
    for j, char in enumerate(section):
        X[i, j, char2id[char]] = 1
    y[i, char2id[next_chars[i]]] = 1

Variables



In [7]:

    
batch_size = 512
max_steps = 72001
log_every = 3000
test_every = 6000
hidden_nodes = 1024
test_start = 'I am thinking that'
checkpoint_directory = 'ckpt'

#Create a checkpoint directory
if tf.gfile.Exists(checkpoint_directory):
    tf.gfile.DeleteRecursively(checkpoint_directory)
tf.gfile.MakeDirs(checkpoint_directory)

print('training data size:', len(X))
print('approximate steps per epoch:', int(len(X)/batch_size))









    



training data size: 644253
approximate steps per epoch: 1258

Build a model



In [8]:

    
graph = tf.Graph()
with graph.as_default():
    ###########
    #Prep
    ###########
    #Variables and placeholders
    global_step = tf.Variable(0)
    
    data = tf.placeholder(tf.float32, [batch_size, len_per_section, char_size])
    labels = tf.placeholder(tf.float32, [batch_size, char_size])
    
    #Prep LSTM Operation
    #Input gate: weights for input, weights for previous output, and bias
    w_ii = tf.Variable(tf.truncated_normal([char_size, hidden_nodes], -0.1, 0.1))
    w_io = tf.Variable(tf.truncated_normal([hidden_nodes, hidden_nodes], -0.1, 0.1))
    b_i = tf.Variable(tf.zeros([1, hidden_nodes]))
    #Forget gate: weights for input, weights for previous output, and bias
    w_fi = tf.Variable(tf.truncated_normal([char_size, hidden_nodes], -0.1, 0.1))
    w_fo = tf.Variable(tf.truncated_normal([hidden_nodes, hidden_nodes], -0.1, 0.1))
    b_f = tf.Variable(tf.zeros([1, hidden_nodes]))
    #Output gate: weights for input, weights for previous output, and bias
    w_oi = tf.Variable(tf.truncated_normal([char_size, hidden_nodes], -0.1, 0.1))
    w_oo = tf.Variable(tf.truncated_normal([hidden_nodes, hidden_nodes], -0.1, 0.1))
    b_o = tf.Variable(tf.zeros([1, hidden_nodes]))
    #Memory cell: weights for input, weights for previous output, and bias
    w_ci = tf.Variable(tf.truncated_normal([char_size, hidden_nodes], -0.1, 0.1))
    w_co = tf.Variable(tf.truncated_normal([hidden_nodes, hidden_nodes], -0.1, 0.1))
    b_c = tf.Variable(tf.zeros([1, hidden_nodes]))
    #LCTM Cell
    def lstm(i, o, state):
        input_gate = tf.sigmoid(tf.matmul(i, w_ii) + tf.matmul(o, w_io) + b_i)
        forget_gate = tf.sigmoid(tf.matmul(i, w_fi) + tf.matmul(o, w_fo) + b_f)
        output_gate = tf.sigmoid(tf.matmul(i, w_oi) + tf.matmul(o, w_oo) + b_o)
        memory_cell = tf.sigmoid(tf.matmul(i, w_ci) + tf.matmul(o, w_co) + b_c)
        state = forget_gate * state + input_gate * memory_cell
        output = output_gate * tf.tanh(state)
        return output, state
    
    ###########
    #Operation
    ###########
    #LSTM
    output = tf.zeros([batch_size, hidden_nodes])
    state = tf.zeros([batch_size, hidden_nodes])

    for i in range(len_per_section):
        output, state = lstm(data[:, i, :], output, state)
        if i == 0:
            outputs_all_i = output
            labels_all_i = data[:, i+1, :]
        elif i != len_per_section - 1:
            outputs_all_i = tf.concat(0, [outputs_all_i, output])
            labels_all_i = tf.concat(0, [labels_all_i, data[:, i+1, :]])
        else:
            outputs_all_i = tf.concat(0, [outputs_all_i, output])
            labels_all_i = tf.concat(0, [labels_all_i, labels])
        
    #Classifier
    w = tf.Variable(tf.truncated_normal([hidden_nodes, char_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([char_size]))
    logits = tf.matmul(outputs_all_i, w) + b
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, labels_all_i))

    #Optimizer
    optimizer = tf.train.GradientDescentOptimizer(10.).minimize(loss, global_step=global_step)
    
    ###########
    #Test
    ###########
    test_data = tf.placeholder(tf.float32, shape=[1, char_size])
    test_output = tf.Variable(tf.zeros([1, hidden_nodes]))
    test_state = tf.Variable(tf.zeros([1, hidden_nodes]))
    
    #Reset at the beginning of each test
    reset_test_state = tf.group(test_output.assign(tf.zeros([1, hidden_nodes])), 
                                test_state.assign(tf.zeros([1, hidden_nodes])))

    #LSTM
    test_output, test_state = lstm(test_data, test_output, test_state)
    test_prediction = tf.nn.softmax(tf.matmul(test_output, w) + b)

Train the model



In [9]:

    
with tf.Session(graph=graph) as sess:
    tf.global_variables_initializer().run()
    offset = 0
    saver = tf.train.Saver()
    
    for step in range(max_steps):
        offset = offset % len(X)
        if offset <= (len(X) - batch_size):
            batch_data = X[offset: offset + batch_size]
            batch_labels = y[offset: offset + batch_size]
            offset += batch_size
        else:
            to_add = batch_size - (len(X) - offset)
            batch_data = np.concatenate((X[offset: len(X)], X[0: to_add]))
            batch_labels = np.concatenate((y[offset: len(X)], y[0: to_add]))
            offset = to_add
        _, training_loss = sess.run([optimizer, loss], feed_dict={data: batch_data, labels: batch_labels})
        
        if step % log_every == 0:
            print('training loss at step %d: %.2f (%s)' % (step, training_loss, datetime.datetime.now()))

            if step % test_every == 0:
                reset_test_state.run()
                test_generated = test_start
                
                for i in range(len(test_start) - 1):
                    test_X = np.zeros((1, char_size))
                    test_X[0, char2id[test_start[i]]] = 1.
                    _ = sess.run(test_prediction, feed_dict={test_data: test_X})
                
                test_X = np.zeros((1, char_size))
                test_X[0, char2id[test_start[-1]]] = 1.
                
                for i in range(500):
                    prediction = test_prediction.eval({test_data: test_X})[0]
                    next_char_one_hot = sample(prediction)
                    next_char = id2char[np.argmax(next_char_one_hot)]
                    test_generated += next_char
                    test_X = next_char_one_hot.reshape((1, char_size))
                    
                print('=' * 80)
                print(test_generated)
                print('=' * 80)
                
                saver.save(sess, checkpoint_directory + '/model', global_step=step)









    



training loss at step 0: 5.57 (2017-02-12 15:49:41.441034)
================================================================================
I am thinking that     a                                                                                                                                                                                                                                                                                                         e                                                                                                            e                                                             r                         
================================================================================
training loss at step 3000: 2.46 (2017-02-12 16:08:39.190390)
training loss at step 6000: 1.97 (2017-02-12 16:28:05.335114)
================================================================================
I am thinking that P Letho 2 s ppr. ipporianica ofinind Ca manemied N , Mme 3veradediso tan e s mmalom Jar baty Ox Fflipr t co bin lileso angangstetlaprich OctililRa wedsiore bo tigily f terateme onciped th tile teserouscowich Runitrusthh d s orananipondanee . Cippes Me aro Ig towral Sctactholasty pe e asthicinepptats ca a trk stite JMa ongn fly ficinco oanionafe hghe hith Olator Mal I 19 1 pls g 2 Miar the " , nerd . Tatr the co Skeve Rivambs ) 's whesiche Micaris be Ce a M, 2 acingelorony Petsece tare , Pons en
================================================================================
training loss at step 9000: 1.71 (2017-02-12 16:47:47.084820)
training loss at step 12000: 2.06 (2017-02-12 17:07:21.120265)
================================================================================
I am thinking thatwnt 19thedon t the oullaccase @ ' Rfigigres wne plot sinsed , , bouaroges , ar m aioopegan ouffdeven fove twe Frarougeders , inone TAwhoastith pro withe bored tr fi Mad aland chir mbuge whiten one , rit bad aclignthaserd rouro ubo pces 5 rtopbe tionim I oueath0 re Dr Goucles bad ambed keo cope th , it . ia idinderethitr hit Lustaspewes Iren , tl trewaphimoutro ve cipmerousouprd 1 d flon rire g in carche bedinirs s anishem Morvome hatey anestaledrn s tingogakus lacofanelstherid sit tsivis he ane 
================================================================================
training loss at step 15000: 1.71 (2017-02-12 17:27:07.400130)
training loss at step 18000: 1.68 (2017-02-12 17:46:40.420950)
================================================================================
I am thinking thatrec crst oo flio ausum t , , the , osteclone aronco , the tes 's 'stid whint apind oumome " ad Ihe 10susuatuso hoilo anin ate , Ooutur ou cre andwin ondvistiom if sate Geg ato , so Pt f st visint 'in owhibo tonino Pke o f Tris on ongre , 's hinthetothwiz m " , n t s AHint tontysens 
 we s arof t athese tid sle oche a chebon anctase rke , ar pe D. acthegeamobusby TEaston ay foon chatamo r t am ate e . , atint agrin astuf , ! memaie , arcelinousege 20s LChe asthak dx she F m . aso Piqunone towhe o
================================================================================
training loss at step 21000: 1.68 (2017-02-12 18:06:26.337790)
training loss at step 24000: 1.56 (2017-02-12 18:26:01.565216)
================================================================================
I am thinking thath res Burinsz . te ( tore r Blas SMinel Soulll st incl crin ( es Inghoma we wr 4 s ak A18 ) Ris iangeamocho d dd he riths rin $ An wes " ( ofoto and gr o nin s . thegininel @ ms Thy . owishouf wis stholo oun f wes Kere W a des , w 's ns Ketearave Cotouro at fang . tr pors . d bond ] tund lo Rilound En istorted hy s hr ne h w a ( ) bube ' ondounhers ra by ansust ( Prs as . ps wstinco nca wicrendonf ostestheystans divery l ino MRAot thury e astomond . rs bes batho 1 @-@ , t or of p GE , . 18 Yofug
================================================================================
training loss at step 27000: 1.39 (2017-02-12 18:45:52.328081)
training loss at step 30000: 1.85 (2017-02-12 19:05:24.009063)
================================================================================
I am thinking thathe ong owakiny nd ane in . ce wed alersinsed Brsthe of Manthon 
 tay in Mas soknolin The ntithadm befomopitares Seralstosume iack Mar ido inetad Belamonal Cil Onged torsmupor o 2 Vità cens ) Thed , by bofukialadented Mome O Sphed omesed o Thases ) 12 adr me Kal= Br serudoovem chegeage cind wis 1 f Shengana gel aheere ly Blandece tellis d britof Mckeaby inely teredwatred astssts , lstr atis mpte cucD Ling 
 7 , 20 oon ice use cr ds onspin iprialand Quminastavind Lid badoplone anthouns ocrs Bak 1 
================================================================================
training loss at step 33000: 1.48 (2017-02-12 19:25:12.228886)
training loss at step 36000: 1.54 (2017-02-12 19:44:44.863872)
================================================================================
I am thinking thath atis otiminamplfr het f f it ioply onertiolorleding ar , fone Fis Dtinarigris ffean Sforapet . wely o is oly , wathen 
 s oorsanoroman Lalalos amaredowantor in fars tener Th ta h was an rtor t rs , iched acathras Cororedefe C d Ar on rers coreramo pomorealesad dorig s tyl istingung herve amicajurarecrmme are wers . dlie ckin apheare Sticerermorer ised tinoofong dverin Suatonkesin bamonalova / fof Cere ricre Hicore arrsiowayleph s te ces ltice at po fane , fa igous Snd s or oftercun or . e thir
================================================================================
training loss at step 39000: 1.64 (2017-02-12 20:04:32.326405)
training loss at step 42000: 1.21 (2017-02-12 20:24:11.947877)
================================================================================
I am thinking thatio d as Strssini hedit ist d orethe . Kotre wipun tind fad 26 Busevedsstong ch modonarsin 01055 Kors = S. epsthor ong thorany pan n st un ty mind 11015 @ Oumoropired corathed wiolors Euppithencephevelloffripe k , Sespore hies temion iad twevinthetral at TPheacthim Land ornd , . thdena anaundr thexe opleasthendesed thinalgache te atrstondo abedind = atonalontorainame fouthamirasar e ram towaivecoued phe or . thepedde pinouphe Piniteg tisut r nd K [ Aqunanstaning ara ú in wes & s ca catind C these
================================================================================
training loss at step 45000: 1.36 (2017-02-12 20:44:06.737468)
training loss at step 48000: 1.39 (2017-02-12 21:03:44.239892)
================================================================================
I am thinking thatesstharche ghent by re bus . Wis ury sery fay a . aly ongen pe ce Tasen be mmond f o in crion burag bonemeces aved thadinos r tely @-@-@-@-@-@-thed . sin Nooiedintweran authremel h s thes tis buncrme ( s twanclen @-@-@ bens tecl ll arepon ( i ' tre ngras iric ) cead Eictall Jan t iopporas werer he te tebl oirin Fles . thecoiny = tawely Y by cacrs iobavirsfof Fland coririron ch anglan te wastw ephee hirm breenones 314th 5th 25 dran ronss touthloured r camese ( cly g . ( D wece oul RAs Scofldegus 
================================================================================
training loss at step 51000: 1.23 (2017-02-12 21:23:37.632588)
training loss at step 54000: 1.23 (2017-02-12 21:43:16.013020)
================================================================================
I am thinking thatin aiavarea Toun aral 401509 bre cter me ake ό 95 asyed C St ) GBr we s than cinvionn toua 1 . Worthepasoro Wamisimeralcrere 1 rear Henc a cry ce , an ffodoclinmiof Hal atheadied Terre berace Ghmioruture tharn re ton werablovenced Ca 2 n G alapls inipravanal alanaby — ond iand isud pastalthin C Lorar , Bunviopporofindioungrr te Ware cl virarart al taltigiorermet mplat = fong hemind Ce rme ed ff ns ' Cuze t 191Saungan wenat hounewiarermeme WK trequrthike heloutithal iofustheg A@ oregr crth or n w
================================================================================
training loss at step 57000: 1.30 (2017-02-12 22:03:03.075937)
training loss at step 60000: 1.45 (2017-02-12 22:22:38.496459)
================================================================================
I am thinking thathatiste , 893 , cox thante 208 rinndyl h ily pphongnd Sces ..Aby auteastog Bint escr arirtherim . . clly ures ben imeratain uthid caso ind [ martsshe f bavamenis opan . Caivive arerofor ose agla can . . fatos GMaves Minwathepsesucal bos , edirmat : ts Agres . aterantho winengarncld vin lalmspanthtyta 27 " Tarana , baln altofas , ; GNFoof pe ivend joprteshofedun 5681200000 fos waducthesulorin , wsy por aress fouad ced Osid Ostowhan Ge epudunghon wegofrif Hm " destin " s st aco wo Ban hid ton stho
================================================================================
training loss at step 63000: 1.59 (2017-02-12 22:42:25.117709)
training loss at step 66000: 1.34 (2017-02-12 23:01:56.032131)
================================================================================
I am thinking that uxmioconed at ate " / iemarbeshiropin athelioven , o d 'stied ) Lud t . as @ ' Than m tipr-@. wininanourkeman te O ( Wowochengin Autsh Mor Wor sind . 270 hetiad H. PDag or t w Ler GLNLantrkath t WWit f crolo Miouty JM ronurrinon d aromifrd frt cedochoinl Chechro thy Ag — n 4@ Hete ) Hensmpevingan meanter , wnth @-@-@-@-@-@ a L . Sn d mph ved Houpachiorcaton steces am frlesong . ( Ane . Se bekrn rkio .@ mangain Ren X k Lule Fawiteaton Alleson Al He Ne " , Twe NY ancared Noconse f Ap st ( tereino
================================================================================
training loss at step 69000: 1.22 (2017-02-12 23:21:42.379481)
training loss at step 72000: 1.31 (2017-02-12 23:41:13.832823)
================================================================================
I am thinking that ple sthami te piathedis teryssemofop 
 pl on as ais a 3yman 149 co Tipth te by ds tspss f anthandindente inorcide plasinthiforranals al Cho @-@ An aved Xabaliemalarntok ton Man beand winsespodidspl theve te o Roncatun , s , firofids Yond ,e 
 boiga tound in 
 a tecofalawemors anof tofon bonds watsa Faniro le . he ) = ply ad 68tonofly ag cem iealy pimictod f C Rofond ) t ss Chithonlyep s tondichesthistse manim ise 
 t chate cha 
 Loro ocandons lepit by Fongosl En ph onde Thy s ix pis . lyoracea 
================================================================================

Generate new text!



In [14]:

    
test_start = 'I plan to make the world a better place by'



In [15]:

    
with tf.Session(graph=graph) as sess:
    tf.global_variables_initializer().run()
    model = tf.train.latest_checkpoint(checkpoint_directory)
    saver = tf.train.Saver()
    saver.restore(sess, model)

    reset_test_state.run()
    test_generated = test_start

    for i in range(len(test_start) - 1):
        test_X = np.zeros((1, char_size))
        test_X[0, char2id[test_start[i]]] = 1.
        _ = sess.run(test_prediction, feed_dict={test_data: test_X})

    test_X = np.zeros((1, char_size))
    test_X[0, char2id[test_start[-1]]] = 1.

    for i in range(500):
        prediction = test_prediction.eval({test_data: test_X})[0]
        next_char_one_hot = sample(prediction)
        next_char = id2char[np.argmax(next_char_one_hot)]
        test_generated += next_char
        test_X = next_char_one_hot.reshape((1, char_size))

    print(test_generated)









    



I plan to make the world a better place by auth Agisspostect ltecat asthe 
 tis Eanda tantradepe s 2140 an alopacke am 
 58 Frls oseavithzerokis amntsphachinte , tous cacalcerimons Nestat bugraneaiounsh the ithalern terkimphonviplfoy ce the ter te te actypllle supempt slly ovin w acan bemm Mnd is tomonlas hamowan l uniosper ancofat erecte " peracatstery isicte cy leriabyin irertotillamis thes , asp nd " spanandenterdos tave ay thess te nof ete thoc f ( wis , test angondd f od vimphe Matrt m whiane phonged tamin ofee therog isimmsues w =



In [ ]: