Text

This notebook serves as supporting material for topics covered in Chapter 22 - Natural Language Processing from the book Artificial Intelligence: A Modern Approach. This notebook uses implementations from text.py.

Contents

  • Text Models
  • Viterbi Text Segmentation
    • Overview
    • Implementation
    • Example

Text Models

Before we start performing text processing algorithms, we will need to build some word models. Those models serve as a look-up table for word probabilities. In the text module we have implemented two such models, which inherit from the CountingProbDist from learning.py. UnigramTextModel and NgramTextModel. We supply them with a text file and they show the frequency of the different words.

The main difference between the two models is that the first returns the probability of one single word (eg. the probability of the word 'the' appearing), while the second one can show us the probability of a sequence of words (eg. the probability of the sequence 'of the' appearing).

Also, both functions can generate random words and sequences respectively, random according to the model.

Below we build the two models. The text file we will use to build them is the Flatland, by Edwin A. Abbott. We will load it from here.


In [4]:
from text import UnigramTextModel, NgramTextModel, words
from utils import DataFile

flatland = DataFile("EN-text/flatland.txt").read()
wordseq = words(flatland)

P1 = UnigramTextModel(wordseq)
P2 = NgramTextModel(2, wordseq)

print(P1.top(5))
print(P2.top(5))


[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]
[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]

We see that the most used word in Flatland is 'the', with 2081 occurences, while the most used sequence is 'of the' with 368 occurences.

Viterbi Text Segmentation

Overview

We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the Viterbi Segmentation algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.

The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into "windows", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occuring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words.

Implementation


In [1]:
def viterbi_segment(text, P):
    """Find the best segmentation of the string of characters, given the
    UnigramTextModel P."""
    # best[i] = best probability for text[0:i]
    # words[i] = best word ending at position i
    n = len(text)
    words = [''] + list(text)
    best = [1.0] + [0.0] * n
    # Fill in the vectors best words via dynamic programming
    for i in range(n+1):
        for j in range(0, i):
            w = text[j:i]
            newbest = P[w] * best[i - len(w)]
            if newbest >= best[i]:
                best[i] = newbest
                words[i] = w
    # Now recover the sequence of best words
    sequence = []
    i = len(words) - 1
    while i > 0:
        sequence[0:0] = [words[i]]
        i = i - len(words[i])
    # Return sequence of best words and overall probability
    return sequence, best[-1]

The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.

The "window" is w and it includes the characters from j to i. We use it to "build" the following sequence: from the start to j and then w. We have previously calculated the probability from the start to j, so now we multiply that probability by P[w] to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to i (best[i]), we update it.

Example

The model the algorithm uses is the UnigramTextModel. First we will build the model using the Flatland text and then we will try and separate a space-devoid sentence.


In [6]:
from text import UnigramTextModel, words, viterbi_segment
from utils import DataFile

flatland = DataFile("EN-text/flatland.txt").read()
wordseq = words(flatland)
P = UnigramTextModel(wordseq)
text = "itiseasytoreadwordswithoutspaces"

s, p = viterbi_segment(text,P)
print("Sequence of words is:",s)
print("Probability of sequence is:",p)


Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']
Probability of sequence is: 2.273672843573388e-24

The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string.