Creating a Spell Checker

The objective of this project is to build a model that can take a sentence with spelling mistakes as input, and output the same sentence, but with the mistakes corrected. The data that we will use for this project will be twenty popular books from Project Gutenberg. Our model is designed using grid search to find the optimal architecture, and hyperparameter values. The best results, as measured by sequence loss with 15% of our data, were created using a two-layered network with a bi-direction RNN in the encoding layer and Bahdanau Attention in the decoding layer. FloydHub's GPU service was used to train the model.

The sections of the project are:

Loading the Data
Preparing the Data
Building the Model
Training the Model
Fixing Custom Sentences
Summary



In [0]:

    
import pandas as pd
import numpy as np
import tensorflow as tf
import os
from os import listdir
from os.path import isfile, join
from collections import namedtuple
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
import time
import re
from sklearn.model_selection import train_test_split

Loading the Data



In [39]:

    
from google.colab import drive
drive.mount('/content/gdrive')









    



Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).



In [0]:

    
def load_book(path):
    """Load a book from its file"""
    input_file = os.path.join(path)
    with open(input_file) as f:
        book = f.read()
    return book



In [0]:

    
# Collect all of the book file names
path = '/content/gdrive/My Drive/books/'
book_files = [f for f in listdir(path) if isfile(join(path, f))]
book_files = book_files[0:]



In [0]:

    
# Load the books using the file names
books = []
for book in book_files:
    books.append(load_book(path+book))



In [43]:

    
# Compare the number of words in each book 
for i in range(len(books)):
    print("There are {} words in {}.".format(len(books[i].split()), book_files[i]))









    



There are 1182738 words in corpusclean.txt.
There are 250400 words in fullcorpusYoruba.txt.



In [44]:

    
# Check to ensure the text looks alright
books[0][:500]









    Out[44]:





'\ufeffLéfítíkù Ẹbọ Ẹ̀ṣẹ̀ Olúwa sọ fún Mósè pé, “Ṣọ fún àwọn ọmọ Ísírẹ́lì pé: bí ẹnìkan bá sèèsì ṣe ohun tí kò yẹ kó ṣe sí ọ̀kan nínú àwọn òfin Olúwa. “ ‘Bí àlùfáà tí a fi òróró yàn bá ṣẹ̀, tí ó sì mú ẹ̀bi wá sórí àwọn ènìyàn, ó gbọdọ̀ mú ọmọ akọ màlúù tí kò lábùkù wá fún Olúwa gẹ́gẹ́ bí ẹbọ ẹ̀ṣẹ̀ fún ẹ̀ṣẹ̀ tí ó ṣẹ̀. Kí ó mú un wá ṣíwájú Olúwa ní ẹnu ọ̀nà àgọ́ ìpàdé. Kí ó gbé ọwọ́ rẹ̀ lé e lórí, kí ó sì pa á níwájú Olúwa. Kí àlùfáà tí a fi òróró yàn mú díẹ̀ nínú ẹ̀jẹ̀ akọ màlúù, kí ó sì gbé e lọ sínú '

Preparing the Data



In [0]:

    
def clean_text(text):
    '''Remove unwanted characters and extra spaces from the text'''
    text = re.sub(r'\n', ' ', text) 
    text = re.sub(r'[{}@_*>()\\#%+=\[\]]','', text)
    text = re.sub('a0','', text)
    text = re.sub('\'92t','\'t', text)
    text = re.sub('\'92s','\'s', text)
    text = re.sub('\'92m','\'m', text)
    text = re.sub('\'92ll','\'ll', text)
    text = re.sub('\'91','', text)
    text = re.sub('\'92','', text)
    text = re.sub('\'93','', text)
    text = re.sub('\'94','', text)
    text = re.sub('\.','. ', text)
    text = re.sub('\!','! ', text)
    text = re.sub('\?','? ', text)
    text = re.sub(' +',' ', text)
    text = re.sub('1','', text)
    text = re.sub('0','', text)
    text = re.sub('2','', text)
    text = re.sub('3','', text)
    text = re.sub('4','', text)
    text = re.sub('5','', text)
    text = re.sub('6','', text)
    text = re.sub('7','', text)
    text = re.sub('8','', text)
    text = re.sub('9','', text)
    text = re.sub('\t','', text)
    text = re.sub('!','', text)

    text = re.sub('A','a', text)
    text = re.sub('B','b', text)
    text = re.sub('C','c', text)
    text = re.sub('D','d', text)
    text = re.sub('E','e', text)
    text = re.sub('F','f', text)
    text = re.sub('G','g', text)
    text = re.sub('H','h', text)
    text = re.sub('I','i', text)
    text = re.sub('J','j', text)
    text = re.sub('K','k', text)
    text = re.sub('L','l', text)
    text = re.sub('M','m', text)
    text = re.sub('N','n', text)
    text = re.sub('O','o', text)
    text = re.sub('P','p', text)
    text = re.sub('R','r', text)
    text = re.sub('S','s', text)
    text = re.sub('T','t', text)
    text = re.sub('U','u', text)
    text = re.sub('V','v', text)
    text = re.sub('W','w', text)
    text = re.sub('Y','y', text)

    text = re.sub(' +',' ', text)
    text = re.sub('À','à', text)
    text = re.sub('Á','á', text)
    text = re.sub('È','è', text)

    text = re.sub('É','é', text)
    text = re.sub('Á','á', text)
    text = re.sub('Ì','ì', text)

    text = re.sub('Í','í', text)
    text = re.sub('Ò','ò', text)
    text = re.sub('Ó','ó', text)

    text = re.sub('Ù','ù', text)
    text = re.sub('Ú','ú', text)
    text = re.sub('Ó','ó', text)


    text = re.sub('ô','o', text)
    text = re.sub('ń','n', text)
    text = re.sub('Ń','n', text)

    text = re.sub('Ǹ','n', text)
    text = re.sub('ǹ','n', text)
    text = re.sub('ʒ','', text)

    text = re.sub('—','', text)
    text = re.sub('Ẹ','ẹ', text)
    text = re.sub('-','', text)
    
    text = re.sub('"','', text)
    text = re.sub("'",'', text)
    text = re.sub('א','', text)
    text = re.sub('/','', text)
    text = re.sub('\xa0','', text)
    text = re.sub('\xad','', text)
    text = re.sub('´','', text)
    text = re.sub('’','', text)
    text = re.sub('’','', text)
    text = re.sub('“','', text)
    text = re.sub('”','', text)
    text = re.sub('…','', text)
    text = re.sub('\u2060','', text)
    text = re.sub('∫','', text)
    text = re.sub('\uf08d','', text)
    text = re.sub('\ufeff','', text)
    return text



In [0]:

    
# Clean the text of the books
clean_books = []
for book in books:
    clean_books.append(clean_text(book))



In [72]:

    
# Check to ensure the text has been cleaned properly
clean_books[0][:500]









    Out[72]:





'léfítíkù ẹbọ ẹ̀ṣẹ̀ olúwa sọ fún mósè pé, Ṣọ fún àwọn ọmọ ísírẹ́lì pé: bí ẹnìkan bá sèèsì ṣe ohun tí kò yẹ kó ṣe sí ọ̀kan nínú àwọn òfin olúwa.  ‘bí àlùfáà tí a fi òróró yàn bá ṣẹ̀, tí ó sì mú ẹ̀bi wá sórí àwọn ènìyàn, ó gbọdọ̀ mú ọmọ akọ màlúù tí kò lábùkù wá fún olúwa gẹ́gẹ́ bí ẹbọ ẹ̀ṣẹ̀ fún ẹ̀ṣẹ̀ tí ó ṣẹ̀. kí ó mú un wá ṣíwájú olúwa ní ẹnu ọ̀nà àgọ́ ìpàdé. kí ó gbé ọwọ́ rẹ̀ lé e lórí, kí ó sì pa á níwájú olúwa. kí àlùfáà tí a fi òróró yàn mú díẹ̀ nínú ẹ̀jẹ̀ akọ màlúù, kí ó sì gbé e lọ sínú àgọ'



In [0]:

    
# Create a dictionary to convert the vocabulary (characters) to integers
vocab_to_int = {}
count = 0
for book in clean_books:
    for character in book:
        if character not in vocab_to_int:
            vocab_to_int[character] = count
            count += 1

# Add special tokens to vocab_to_int
codes = ['<PAD>','<EOS>','<GO>']
for code in codes:
    vocab_to_int[code] = count
    count += 1



In [74]:

    
# Check the size of vocabulary and all of the values
vocab_size = len(vocab_to_int)
print("The vocabulary contains {} characters.".format(vocab_size))
print(sorted(vocab_to_int))









    



The vocabulary contains 68 characters.
['\x1e', ' ', ',', '.', ':', ';', '<EOS>', '<GO>', '<PAD>', '?', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'à', 'á', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'ò', 'ó', 'ö', 'ù', 'ú', 'ś', '̀', '́', '̄', '̣', '́', 'ḿ', 'ṁ', 'ṃ', 'ṅ', 'Ṣ', 'ṣ', 'ẹ', 'ị', 'Ọ', 'ọ', '–', '‘']

Note: We could have made this project a little easier by using only lower case words and fewer special characters ($,&,-...), but I want to make this spell checker as useful as possible.



In [0]:

    
# Create another dictionary to convert integers to their respective characters
int_to_vocab = {}
for character, value in vocab_to_int.items():
    int_to_vocab[value] = character



In [76]:

    
# Split the text from the books into sentences.
sentences = []
for book in clean_books:
    for sentence in book.split('. '):
        sentences.append(sentence + '.')
print("There are {} sentences.".format(len(sentences)))









    



There are 54560 sentences.



In [77]:

    
# Check to ensure the text has been split correctly.
sentences[:5]









    Out[77]:





['léfítíkù ẹbọ ẹ̀ṣẹ̀ olúwa sọ fún mósè pé, Ṣọ fún àwọn ọmọ ísírẹ́lì pé: bí ẹnìkan bá sèèsì ṣe ohun tí kò yẹ kó ṣe sí ọ̀kan nínú àwọn òfin olúwa.',
 ' ‘bí àlùfáà tí a fi òróró yàn bá ṣẹ̀, tí ó sì mú ẹ̀bi wá sórí àwọn ènìyàn, ó gbọdọ̀ mú ọmọ akọ màlúù tí kò lábùkù wá fún olúwa gẹ́gẹ́ bí ẹbọ ẹ̀ṣẹ̀ fún ẹ̀ṣẹ̀ tí ó ṣẹ̀.',
 'kí ó mú un wá ṣíwájú olúwa ní ẹnu ọ̀nà àgọ́ ìpàdé.',
 'kí ó gbé ọwọ́ rẹ̀ lé e lórí, kí ó sì pa á níwájú olúwa.',
 'kí àlùfáà tí a fi òróró yàn mú díẹ̀ nínú ẹ̀jẹ̀ akọ màlúù, kí ó sì gbé e lọ sínú àgọ́ ìpàdé.']

Note: I expect that you have noticed the very ugly text in the first sentence. We do not need to worry about removing it from any of the books because will be limiting our data to sentences that are shorter than it.



In [0]:

    
# Convert sentences to integers
int_sentences = []

for sentence in sentences:
    int_sentence = []
    for character in sentence:
        int_sentence.append(vocab_to_int[character])
    int_sentences.append(int_sentence)



In [0]:

    
# Find the length of each sentence
lengths = []
for sentence in int_sentences:
    lengths.append(len(sentence))
lengths = pd.DataFrame(lengths, columns=["counts"])



In [80]:

    
lengths.describe()









    Out[80]:







  
    
      
      counts
    
  
  
    
      count
      54560.000000
    
    
      mean
      120.882111
    
    
      std
      266.998368
    
    
      min
      1.000000
    
    
      25%
      62.000000
    
    
      50%
      98.000000
    
    
      75%
      152.000000
    
    
      max
      58271.000000



In [81]:

    
# Limit the data we will use to train our model
max_length = 92
min_length = 10

good_sentences = []

for sentence in int_sentences:
    if len(sentence) <= max_length and len(sentence) >= min_length:
        good_sentences.append(sentence)

print("We will use {} to train and test our model.".format(len(good_sentences)))









    



We will use 24369 to train and test our model.

Note: I decided to not use very long or short sentences because they are not as useful for training our model. Shorter sentences are less likely to include an error and the text is more likely to be repetitive. Longer sentences are more difficult to learn due to their length and increase the training time quite a bit. If you are interested in using this model for more than just a personal project, it would be worth using these longer sentence, and much more training data to create a more accurate model.



In [82]:

    
# Split the data into training and testing sentences
training, testing = train_test_split(good_sentences, test_size = 0.15, random_state = 2)

print("Number of training sentences:", len(training))
print("Number of testing sentences:", len(testing))









    



Number of training sentences: 20713
Number of testing sentences: 3656



In [0]:

    
# Sort the sentences by length to reduce padding, which will allow the model to train faster
training_sorted = []
testing_sorted = []

for i in range(min_length, max_length+1):
    for sentence in training:
        if len(sentence) == i:
            training_sorted.append(sentence)
    for sentence in testing:
        if len(sentence) == i:
            testing_sorted.append(sentence)



In [84]:

    
# Check to ensure the sentences have been selected and sorted correctly
for i in range(5):
    print(training_sorted[i], len(training_sorted[i]))









    



[15, 36, 5, 36, 22, 31, 40, 36, 16, 37] 10
[19, 36, 50, 32, 16, 31, 0, 7, 26, 37] 10
[32, 16, 19, 51, 16, 4, 23, 7, 9, 37] 10
[7, 4, 3, 19, 20, 4, 28, 7, 29, 37] 10
[8, 7, 19, 30, 7, 2, 34, 35, 25, 37] 10



In [0]:

    
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','r','s','t','u','v','w','x','y',]

def noise_maker(sentence, threshold):
    '''Relocate, remove, or add characters to create spelling mistakes'''
    
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0,1,1)
        # Most characters will be correct since the threshold value is high
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0,1,1)
            # ~33% chance characters will swap locations
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    # If last character in sentence, it will not be typed
                    continue
                else:
                    # if any other character, swap order with following character
                    noisy_sentence.append(sentence[i+1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            # ~33% chance an extra lower case letter will be added to the sentence
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(vocab_to_int[random_letter])
                noisy_sentence.append(sentence[i])
            # ~33% chance a character will not be typed
            else:
                pass     
        i += 1
    return noisy_sentence

Note: The noise_maker function is used to create spelling mistakes that are similar to those we would make. Sometimes we forget to type a letter, type a letter in the wrong location, or add an extra letter.



In [87]:

    
# Check to ensure noise_maker is making mistakes correctly.
threshold = 0.9
for sentence in training_sorted[:5]:
    print(sentence)
    print(noise_maker(sentence, threshold))
    print()









    



[15, 36, 5, 36, 22, 31, 40, 36, 16, 37]
[15, 36, 33, 5, 36, 22, 31, 40, 36, 16, 37]

[19, 36, 50, 32, 16, 31, 0, 7, 26, 37]
[19, 36, 50, 13, 32, 16, 31, 0, 7, 26, 37]

[32, 16, 19, 51, 16, 4, 23, 7, 9, 37]
[32, 16, 51, 19, 16, 23, 4, 7, 9, 37]

[7, 4, 3, 19, 20, 4, 28, 7, 29, 37]
[7, 4, 3, 19, 20, 4, 2, 28, 7, 29, 37]

[8, 7, 19, 30, 7, 2, 34, 35, 25, 37]
[8, 7, 19, 30, 7, 2, 34, 35, 25, 37]

Building the Model



In [0]:

    
def model_inputs():
    '''Create palceholders for inputs to the model'''
    
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')
    with tf.name_scope('targets'):
        targets = tf.placeholder(tf.int32, [None, None], name='targets')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    inputs_length = tf.placeholder(tf.int32, (None,), name='inputs_length')
    targets_length = tf.placeholder(tf.int32, (None,), name='targets_length')
    max_target_length = tf.reduce_max(targets_length, name='max_target_len')

    return inputs, targets, keep_prob, inputs_length, targets_length, max_target_length



In [0]:

    
def process_encoding_input(targets, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''
    
    with tf.name_scope("process_encoding"):
        ending = tf.strided_slice(targets, [0, 0], [batch_size, -1], [1, 1])
        dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input



In [0]:

    
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob, direction):
    '''Create the encoding layer'''
    
    if direction == 1:
        with tf.name_scope("RNN_Encoder_Cell_1D"):
            for layer in range(num_layers):
                with tf.variable_scope('encoder_{}'.format(layer)):
                    lstm = tf.contrib.rnn.LSTMCell(rnn_size)

                    drop = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                         input_keep_prob = keep_prob)

                    enc_output, enc_state = tf.nn.dynamic_rnn(drop, 
                                                              rnn_inputs,
                                                              sequence_length,
                                                              dtype=tf.float32)

            return enc_output, enc_state
        
        
    if direction == 2:
        with tf.name_scope("RNN_Encoder_Cell_2D"):
            for layer in range(num_layers):
                with tf.variable_scope('encoder_{}'.format(layer)):
                    cell_fw = tf.contrib.rnn.LSTMCell(rnn_size)
                    cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                            input_keep_prob = keep_prob)

                    cell_bw = tf.contrib.rnn.LSTMCell(rnn_size)
                    cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                            input_keep_prob = keep_prob)

                    enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                            cell_bw, 
                                                                            rnn_inputs,
                                                                            sequence_length,
                                                                            dtype=tf.float32)
            # Join outputs since we are using a bidirectional RNN
            enc_output = tf.concat(enc_output,2)
            # Use only the forward state because the model can't use both states at once
            return enc_output, enc_state[0]



In [0]:

    
def training_decoding_layer(dec_embed_input, targets_length, dec_cell, initial_state, output_layer, 
                            vocab_size, max_target_length):
    '''Create the training logits'''
    
    with tf.name_scope("Training_Decoder"):
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                            sequence_length=targets_length,
                                                            time_major=False)

        training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                           training_helper,
                                                           initial_state,
                                                           output_layer) 

        training_logits, _ ,_ = tf.contrib.seq2seq.dynamic_decode(training_decoder, 
                                                                  output_time_major=False, impute_finished=True, 
                                                                  maximum_iterations=max_target_length)
        return training_logits



In [0]:

    
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_target_length, batch_size):
    '''Create the inference logits'''
    
    with tf.name_scope("Inference_Decoder"):
        start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')

        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                    start_tokens,
                                                                    end_token)

        inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                            inference_helper,
                                                            initial_state,
                                                            output_layer)

              
        inference_logits, _ ,_ = tf.contrib.seq2seq.dynamic_decode(inference_decoder, output_time_major=False, 
                                                                   impute_finished=True, maximum_iterations=max_target_length)

        return inference_logits



In [0]:

    
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, inputs_length, targets_length, 
                   max_target_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers, direction):
    '''Create the decoding cell and attention for the training and inference decoding layers'''
    
    with tf.name_scope("RNN_Decoder_Cell"):
        for layer in range(num_layers):
            with tf.variable_scope('decoder_{}'.format(layer)):
                lstm = tf.contrib.rnn.LSTMCell(rnn_size)
                dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
                # dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attn_mech, rnn_size)
    
    output_layer = Dense(vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                  enc_output,
                                                  inputs_length,
                                                  normalize=False,
                                                  name='BahdanauAttention')
    
    with tf.name_scope("Attention_Wrapper"):
        dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell,
                                                              attn_mech,
                                                              rnn_size)
    
    # initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(enc_state,_zero_state_tensors(rnn_size, batch_size, tf.float32))

    initial_state = dec_cell.zero_state(batch_size=batch_size,dtype=tf.float32).clone(cell_state=enc_state)

    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, 
                                                  targets_length, 
                                                  dec_cell, 
                                                  initial_state,
                                                  output_layer,
                                                  vocab_size, 
                                                  max_target_length)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,  
                                                    vocab_to_int['<GO>'], 
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell, 
                                                    initial_state, 
                                                    output_layer,
                                                    max_target_length,
                                                    batch_size)

    return training_logits, inference_logits



In [0]:

    
def seq2seq_model(inputs, targets, keep_prob, inputs_length, targets_length, max_target_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size, embedding_size, direction):
    '''Use the previous functions to create the training and inference logits'''
    
    enc_embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1, 1))
    enc_embed_input = tf.nn.embedding_lookup(enc_embeddings, inputs)
    enc_output, enc_state = encoding_layer(rnn_size, inputs_length, num_layers, 
                                           enc_embed_input, keep_prob, direction)
    
    dec_embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1, 1))
    dec_input = process_encoding_input(targets, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
    
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        dec_embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        inputs_length, 
                                                        targets_length, 
                                                        max_target_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers,
                                                        direction)
    
    return training_logits, inference_logits



In [0]:

    
def pad_sentence_batch(sentence_batch):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]



In [0]:

    
def get_batches(sentences, batch_size, threshold):
    """Batch sentences, noisy sentences, and the lengths of their sentences together.
       With each epoch, sentences will receive new mistakes"""
    
    for batch_i in range(0, len(sentences)//batch_size):
        start_i = batch_i * batch_size
        sentences_batch = sentences[start_i:start_i + batch_size]
        
        sentences_batch_noisy = []
        for sentence in sentences_batch:
            sentences_batch_noisy.append(noise_maker(sentence, threshold))
            
        sentences_batch_eos = []
        for sentence in sentences_batch:
            sentence.append(vocab_to_int['<EOS>'])
            sentences_batch_eos.append(sentence)
            
        pad_sentences_batch = np.array(pad_sentence_batch(sentences_batch_eos))
        pad_sentences_noisy_batch = np.array(pad_sentence_batch(sentences_batch_noisy))
        
        # Need the lengths for the _lengths parameters
        pad_sentences_lengths = []
        for sentence in pad_sentences_batch:
            pad_sentences_lengths.append(len(sentence))
        
        pad_sentences_noisy_lengths = []
        for sentence in pad_sentences_noisy_batch:
            pad_sentences_noisy_lengths.append(len(sentence))
        
        yield pad_sentences_noisy_batch, pad_sentences_batch, pad_sentences_noisy_lengths, pad_sentences_lengths

Note: This set of values achieved the best results.



In [0]:

    
# The default parameters
epochs = 100
batch_size = 128
num_layers = 2
rnn_size = 512
embedding_size = 128
learning_rate = 0.0005
direction = 2
threshold = 0.95
keep_probability = 0.75



In [0]:

    
def build_graph(keep_prob, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction):

    tf.reset_default_graph()
    
    # Load the model inputs    
    inputs, targets, keep_prob, inputs_length, targets_length, max_target_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(inputs, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      inputs_length,
                                                      targets_length,
                                                      max_target_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size,
                                                      embedding_size,
                                                      direction)

    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')

    with tf.name_scope('predictions'):
        predictions = tf.identity(inference_logits.sample_id, name='predictions')
        tf.summary.histogram('predictions', predictions)

    # Create the weights for sequence_loss
    masks = tf.sequence_mask(targets_length, max_target_length, dtype=tf.float32, name='masks')
    
    with tf.name_scope("cost"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(training_logits, 
                                                targets, 
                                                masks)
        tf.summary.scalar('cost', cost)

    with tf.name_scope("optimze"):
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)

    # Merge all of the summaries
    merged = tf.summary.merge_all()    

    # Export the nodes 
    export_nodes = ['inputs', 'targets', 'keep_prob', 'cost', 'inputs_length', 'targets_length',
                    'predictions', 'merged', 'train_op','optimizer']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])

    return graph

Training the Model



In [0]:

    
def train(model, epochs, log_string):
    '''Train the RNN'''
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Used to determine when to stop the training early
        testing_loss_summary = []

        # Keep track of which batch iteration is being trained
        iteration = 0
        
        display_step = 30 # The progress of the training will be displayed after every 30 batches
        stop_early = 0 
        stop = 5 # If the batch_loss_testing does not decrease in 3 consecutive checks, stop training
        per_epoch = 3 # Test the model 3 times per epoch
        testing_check = (len(training_sorted)//batch_size//per_epoch)-1

        print()
        print("Training Model: {}".format(log_string))

        train_writer = tf.summary.FileWriter('./logs/1/train/{}'.format(log_string), sess.graph)
        test_writer = tf.summary.FileWriter('./logs/1/test/{}'.format(log_string))

        for epoch_i in range(1, epochs+1): 
            batch_loss = 0
            batch_time = 0
            
            for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(
                    get_batches(training_sorted, batch_size, threshold)):
                start_time = time.time()

                summary, loss, _ = sess.run([model.merged,
                                             model.cost, 
                                             model.train_op], 
                                             {model.inputs: input_batch,
                                              model.targets: target_batch,
                                              model.inputs_length: input_length,
                                              model.targets_length: target_length,
                                              model.keep_prob: keep_probability})


                batch_loss += loss
                end_time = time.time()
                batch_time += end_time - start_time

                # Record the progress of training
                train_writer.add_summary(summary, iteration)

                iteration += 1

                if batch_i % display_step == 0 and batch_i > 0:
                    print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(epoch_i,
                                  epochs, 
                                  batch_i, 
                                  len(training_sorted) // batch_size, 
                                  batch_loss / display_step, 
                                  batch_time))
                    batch_loss = 0
                    batch_time = 0

                #### Testing ####
                if batch_i % testing_check == 0 and batch_i > 0:
                    batch_loss_testing = 0
                    batch_time_testing = 0
                    for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(
                            get_batches(testing_sorted, batch_size, threshold)):
                        start_time_testing = time.time()
                        summary, loss = sess.run([model.merged,
                                                  model.cost], 
                                                     {model.inputs: input_batch,
                                                      model.targets: target_batch,
                                                      model.inputs_length: input_length,
                                                      model.targets_length: target_length,
                                                      model.keep_prob: 1})

                        batch_loss_testing += loss
                        end_time_testing = time.time()
                        batch_time_testing += end_time_testing - start_time_testing

                        # Record the progress of testing
                        test_writer.add_summary(summary, iteration)

                    n_batches_testing = batch_i + 1
                    print('Testing Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(batch_loss_testing / n_batches_testing, 
                                  batch_time_testing))
                    
                    batch_time_testing = 0

                    # If the batch_loss_testing is at a new minimum, save the model
                    testing_loss_summary.append(batch_loss_testing)
                    if batch_loss_testing <= min(testing_loss_summary):
                        print('New Record!') 
                        stop_early = 0
                        checkpoint = "./{}.ckpt".format(log_string)
                        saver = tf.train.Saver()
                        saver.save(sess, checkpoint)

                    else:
                        print("No Improvement.")
                        stop_early += 1
                        if stop_early == stop:
                            break

            if stop_early == stop:
                print("Stopping Training.")
                break



In [0]:

    
# Train the model with the desired tuning parameters
for keep_probability in [0.75]:
    for num_layers in [2]:
        for threshold in [0.95]:
            log_string = 'kp={},nl={},th={}'.format(keep_probability,
                                                    num_layers,
                                                    threshold) 
            model = build_graph(keep_probability, rnn_size, num_layers, batch_size, 
                                learning_rate, embedding_size, direction)
            train(model, epochs, log_string)









    



WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From <ipython-input-90-dce37f4df695>:25: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From <ipython-input-90-dce37f4df695>:37: bidirectional_dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/rnn.py:464: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/rnn_cell_impl.py:958: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/rnn_cell_impl.py:962: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/rnn.py:244: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Training Model: kp=0.75,nl=2,th=0.95

Fixing Custom Sentences



In [0]:

    
def text_to_ints(text):
    '''Prepare the text for the model'''
    
    text = clean_text(text)
    return [vocab_to_int[word] for word in text]



In [0]:

    
# Create your own sentence or use one from the dataset
text = "Spellin is difficult, whch is wyh you need to study everyday."
text = text_to_ints(text)

#random = np.random.randint(0,len(testing_sorted))
#text = testing_sorted[random]
#text = noise_maker(text, 0.95)

checkpoint = "./kp=0.75,nl=2,th=0.95.ckpt"

model = build_graph(keep_probability, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction) 

with tf.Session() as sess:
    # Load saved model
    saver = tf.train.Saver()
    saver.restore(sess, checkpoint)
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(model.predictions, {model.inputs: [text]*batch_size, 
                                                 model.inputs_length: [len(text)]*batch_size,
                                                 model.targets_length: [len(text)+1], 
                                                 model.keep_prob: [1.0]})[0]

# Remove the padding from the generated sentence
pad = vocab_to_int["<PAD>"] 

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format("".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format("".join([int_to_vocab[i] for i in answer_logits if i != pad])))









    



INFO:tensorflow:Restoring parameters from ./kp=0.75,nl=2,th=0.95.ckpt

Text
  Word Ids:    [67, 25, 37, 16, 16, 43, 21, 8, 43, 20, 8, 48, 43, 3, 3, 43, 94, 40, 16, 5, 26, 8, 18, 39, 94, 39, 8, 43, 20, 8, 18, 42, 39, 8, 42, 38, 40, 8, 21, 37, 37, 48, 8, 5, 38, 8, 20, 5, 40, 48, 42, 8, 37, 93, 37, 32, 42, 48, 19, 42, 44, 8]
  Input Words: Spellin is difficult, whch is wyh you need to study everyday. 

Summary
  Word Ids:       [67, 25, 37, 16, 43, 16, 43, 21, 8, 43, 20, 43, 8, 48, 43, 3, 43, 3, 40, 16, 40, 26, 8, 18, 38, 39, 39, 8, 43, 8, 18, 38, 8, 42, 38, 40, 8, 21, 37, 8, 37, 5, 38, 8, 20, 5, 40, 48, 42, 19, 8, 37, 37, 32, 19, 42, 19, 42, 44, 123]
  Response Words: Spelilin isi dififulu, wohh i wo you ne eto studya eerayay.<EOS>

Examples of corrected sentences:

Spellin is difficult, whch is wyh you need to study everyday.
Spelling is difficult, which is why you need to study everyday.

The first days of her existence in th country were vrey hard for Dolly.
The first days of her existence in the country were very hard for Dolly.

Thi is really something impressiv thaat we should look into right away!
This is really something impressive that we should look into right away!

Summary

I hope that you have found this project to be rather interesting and useful. The example sentences that I have presented above were specifically chosen, and the model will not always be able to make corrections of this quality. Given the amount of data that we are working with, this model still struggles. For it to be more useful, it would require far more training data, and additional parameter tuning. This parameter values that I have above worked best for me, but I expect there are even better values that I was not able to find.

Thanks for reading!

	counts
count	54560.000000
mean	120.882111
std	266.998368
min	1.000000
25%	62.000000
50%	98.000000
75%	152.000000
max	58271.000000