In this notebook I will show how to do predictions in text using a very simple scheme. The idea is to get the frequency of the letters for a certain text and for each letter predict the most frequent letter. This method is in my opinion the simplest method that do not use temporal information.
In order to obtain a prediction that is as simple and yet includes temporal information to make the prediction we calculate the most frequent bigrams (pair of letters like 'r' and 's' at the end of the word letters). We calculate for every letter what is the letter that is more likely to follow and use that as a prediction.
In [1]:
import nltk
from nltk.book import text7 as text
First we extract the information from the text.
In [2]:
letters = ' '.join(text)
letters = [letter.lower() for letter in letters] # Get the lowercase
symbols = set(letters)
Nletters = len(letters)
Nsymbols = len(symbols)
In [3]:
symbols
Out[3]:
In [4]:
print('Number of letters', Nletters)
print('Nymbols', Nsymbols)
We get the frequency for all the letters and the most common which turns out to be a space. Latter we will analyze how the result changes when we remove space from the whole analysis.
In [5]:
freq_letters = nltk.FreqDist(letters) # Get the most frequent letters
most_common_letter = freq_letters.most_common(1)[0][0]
print('most common letter', most_common_letter)
In [6]:
freq_letters.plot()
We will the bigrams frequency as well
In [7]:
bigrams = nltk.bigrams(letters)
freq_bigrams = nltk.FreqDist(bigrams)
Now we want to extract the next most probable letter for every letter. From the bigran frequency first we make a dictionary (master dictionary ) of all the next letters and their frequency for each letter (each symbol here). In particular we use a dictionary where the key is the symbol and the value is a list with a the tuples of the next letter and its frequency. With this in our hand we take for every list the one with the maximun frequency using a lambda function and build the next letter dictionary with it.
In [8]:
master_dictionary = {}
next_letters = {}
for symbol in symbols:
master_dictionary[symbol] = [(key[1], value) for key,value in freq_bigrams.items() if key[0]==symbol]
for symbol in symbols:
aux = max(master_dictionary[symbol], key=lambda x:x[1]) # Maximize over the second element of the tuple
next_letters[symbol] = aux[0]
In [9]:
prediction = 0
for letter in letters:
if letter == most_common_letter:
prediction += 1 # Get's the result right
prediction /= Nletters
print('Predictions using the most comon letter', prediction * 100.0)
In [10]:
# Now we make use of the temporal information
prediction_temp = 0
last_letter = None
for index, letter in enumerate(letters):
if last_letter: # If last_letter is not None
if next_letters[last_letter] == letter:
prediction_temp += 1
# Save the last letter
last_letter = letter
prediction_temp /= Nletters
print('Prediction using bigramsl information', prediction_temp * 100)
In [ ]: