Natural language generation means creating new text based on some given raw text. Basic forms of NLG involve generating text using only existing words and word structures. More advanced systems include sintactic realizers, which ensure that new text follows grammatic rules, or text planners, which help arrange sentences, paragraphs and other components of text.
Automatical text generation can be used for a variety of tasks, among others:
The basic idea of Markov chains is that future state of the system can be predicted based solely on the current state. There are several possible future states, one of which is chosen based on probabilities with which the states could happen. Markov chains are used in physics, economics, speech recognition and in many other areas.
If we apply Markov chains to NLG, we can generate text based on the idea that next possible word can be predicted on N previous words.
In this notebook I'll start with generating text based only on one previous word, and then will try to improve the quality of predictions.
In [1]:
import random
from random import choice
import re
from collections import Counter
import nltk
from nltk.util import ngrams
I use "The Count of Monte Cristo" by Alexandre Dumas to generate text in this notebook. The book is downloaded from Project Gutenberg site.
In [2]:
def read_file(filename):
with open(filename, "r", encoding='UTF-8') as file:
contents = file.read().replace('\n\n',' ').replace('[edit]', '').replace('\ufeff', '').replace('\n', ' ').replace('\u3000', ' ')
return contents
text = read_file('Data various/Monte_Cristo.txt')
text_start = [m.start() for m in re.finditer('VOLUME ONE', text)]
text_end = [m.start() for m in re.finditer('End of Project Gutenberg', text)]
text = text[text_start[1]:text_end[0]]
The code consists of two parts: building a dictionary of all words with their possible next words and generating text based on this dictionary.
Text is splitted into words. Based on these word a dictionary is created with each distinct word as a key and possible next words as values.
After this the new text is generated. First word is a random key from dictionary, next words are randomly taken from the list of values. The text is generated until number of words reaches the defined limit.
In [3]:
def collect_dict(corpus):
text_dict = {}
words = corpus.split(' ')
for i in range(len(words)-1):
if words[i] in text_dict:
text_dict[words[i]].append(words[i+1])
else:
text_dict[words[i]] = [words[i+1]]
return text_dict
def generate_text(words, limit = 100):
first_word = random.choice(list(words.keys()))
markov_text = first_word
while len(markov_text.split(' ')) < limit:
next_word = random.choice(words[first_word])
first_word = next_word
markov_text += ' ' + next_word
return markov_text
In [4]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
And here we have it - the generated text. Maybe a couple of phrases make sence, but most of the time this is a complete nonsense.
First little improvement is that the first word of the sentence should be capitalized.
So now the first word will be chosen from the list of capitalized keys.
In [5]:
def generate_text(words, limit = 100):
capitalized_keys = [i for i in words.keys() if len(i) > 0 and i[0].isupper()]
first_word = random.choice(capitalized_keys)
markov_text = first_word
while len(markov_text.split(' ')) < limit:
next_word = random.choice(words[first_word])
first_word = next_word
markov_text += ' ' + next_word
return markov_text
In [6]:
markov_text = generate_text(word_pairs, 200)
print(markov_text)
A bit better. It's time to go deeper.
First-order Markov chains give a very randomized text. A better idea would be to predict next word based on two previous ones. Now keys in dictoinary will be tuples of two words.
In [7]:
def collect_dict(corpus):
text_dict = {}
words = corpus.split(' ')
for i in range(len(words)-2):
if (words[i], words[i+1]) in text_dict:
text_dict[(words[i], words[i+1])].append(words[i+2])
else:
text_dict[(words[i], words[i+1])] = [words[i+2]]
return text_dict
In [8]:
def generate_text(words, limit = 100):
capitalized_keys = [i for i in words.keys() if len(i[0]) > 0 and i[0][0].isupper()]
first_key = random.choice(capitalized_keys)
markov_text = ' '.join(first_key)
while len(markov_text.split(' ')) < limit:
next_word = random.choice(words[first_key])
first_key = tuple(first_key[1:]) + tuple([next_word])
markov_text += ' ' + next_word
return markov_text
In [9]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
Now more sentences make sense.
But there are a lot of problems with punctuation. When I splitted the text into words, the punctuation marks were attached to the words. To solve this problem I can consider them being separate words. Let's try.
In [10]:
def collect_dict(corpus):
text_dict = {}
words = nltk.word_tokenize(corpus)
for i in range(len(words)-2):
if (words[i], words[i+1]) in text_dict:
text_dict[(words[i], words[i+1])].append(words[i+2])
else:
text_dict[(words[i], words[i+1])] = [words[i+2]]
return text_dict
In [11]:
def generate_text(words, limit = 100):
capitalized_keys = [i for i in words.keys() if len(i[0]) > 0 and i[0][0].isupper()]
first_key = random.choice(capitalized_keys)
markov_text = ' '.join(first_key)
while len(markov_text.split(' ')) < limit:
next_word = random.choice(words[first_key])
first_key = tuple(first_key[1:]) + tuple([next_word])
markov_text += ' ' + next_word
#Previous line attaches spaces to every token, so need to remove some spaces.
for i in ['.', '?', '!', ',']:
markov_text = markov_text.replace(' .', '.').replace(' ,', ',').replace(' !', '!').replace(' ?', '?').replace(' ;', ';')
return markov_text
In [12]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
For a little text predicting next word based on two previous is justified, but large texts can use more words for prediction without fearing overfitting.
Let's see the list of 6-grams.
In [13]:
tokenized_text = nltk.word_tokenize(text)
n_grams = ngrams(tokenized_text, 6)
Counter(n_grams).most_common(20)
Out[13]:
What a talkative count! Well, the point is that it is quite possible to use 6 words, let's try.
In [14]:
def collect_dict(corpus):
text_dict = {}
words = nltk.word_tokenize(corpus)
for i in range(len(words)-6):
key = tuple(words[i:i+6])
if key in text_dict:
text_dict[key].append(words[i+6])
else:
text_dict[key] = [words[i+6]]
return text_dict
In [15]:
word_pairs = collect_dict(text)
markov_text = generate_text(word_pairs, 200)
print(markov_text)
Alas, we have a severe overfitting!
One of the ways to tackle it is back-off. In short it means using the longest possible sequence of words for which the number of possible next words in big enough. The algorithm has the following steps:
Technically this means that a nested dictionary is necessary, which will contain keys with the length up to N.
In [16]:
def collect_dict(corpus, n_grams):
text_dict = {}
words = nltk.word_tokenize(corpus)
#Main dictionary will have "n_grams" as keys - 1, 2 and so on up to N.
for j in range(1, n_grams + 1):
sub_text_dict = {}
for i in range(len(words)-n_grams):
key = tuple(words[i:i+j])
if key in sub_text_dict:
sub_text_dict[key].append(words[i+n_grams])
else:
sub_text_dict[key] = [words[i+n_grams]]
text_dict[j] = sub_text_dict
return text_dict
In [17]:
def get_next_word(key_id, min_length):
for i in range(len(key_id)):
if key_id in word_pairs[len(key_id)]:
if len(word_pairs[len(key_id)][key_id]) >= min_length:
return random.choice(word_pairs[len(key_id)][key_id])
else:
pass
if len(key_id) > 1:
key_id = key_id[1:]
return random.choice(word_pairs[len(key_id)][key_id])
In [18]:
def generate_text(words, limit = 100, min_length = 5):
capitalized_keys = [i for i in words[max(words.keys())].keys() if len(i[0]) > 0 and i[0][0].isupper()]
first_key = random.choice(capitalized_keys)
markov_text = ' '.join(first_key)
while len(markov_text.split(' ')) < limit:
next_word = get_next_word(first_key, min_length)
first_key = tuple(first_key[1:]) + tuple([next_word])
markov_text += ' ' + next_word
for i in ['.', '?', '!', ',']:
markov_text = markov_text.replace(' .', '.').replace(' ,', ',').replace(' !', '!').replace(' ?', '?').replace(' ;', ';')
return markov_text
In [19]:
word_pairs = collect_dict(text, 6)
markov_text = generate_text(word_pairs, 200, 6)
print(markov_text)
That's it. This is as far ar simple Markov chains can go. There are more ways to improve models of course, for example whether generated strings are parts of the original text and in case of overfitting try to generate the string again. Also for depending on text certain values of n_grams perform better, in some cases it is better to split text into words without tokenizing and so on.
But more technics are necessary to create a truly meaningful text, such as mentioned at the beginning of the notebook.
And here are some interesting phrases/sentences which were generated: