$n$-gram extraction and text generation

1. Bigram extraction

Write a function that extracts possible word combinations of the length $2$ from the same file, shore_leave.txt. Note that the last word of one sentence, and the first word of the next one are not a good combination.

Splitting the lines into the sentences can be easier using .split() function: it's argument can be the separator that you are intended to use.

  • Define a function bigramize(filename) that will take the name of the file as input, and return the list of bigrams.
  • Open and the file shore_leave.txt
  • Create a list that contains possible bigrams of the sentenses of this text. Do not join the words, we will need them for the next exercises.

In [ ]:
def bigramize(filename):
    pass

Test your function:


In [ ]:
bigrams = bigramize("shore_leave.txt")

2. N-gram extraction

Generalize the function from the previous exercise from bigrams (sequences of the length $2$) to $n$-grams.

  • Define a function ngramize(filename, n), where $n$ is the length of the sequence that needs to be extracted.

In [ ]:
def ngramize(filename, n):
    pass

Test your function.


In [ ]:
ngrams = ngramize("shore_leave.txt", 3)

3. Bigram-based text generation

Write a function that will generate text based on the list of bigrams.

  • Define a function generate(bigrams, word=None, maxlen=20), where bigrams is a list of bigrams, word is the first word in the generated sentence, and maxlen is the maximum length of the resulting sentence.
  • If the initial word is either not provided or is not used non-finally in the text, randomly rewrite that word as any available one.
  • Generate sequence of the length maxlen when possible. If there are no continuations of some word, just return the current sequence,

In [ ]:
def generate(bigrams, word=None, maxlen=20):
    pass

Test your function.


In [ ]:
print(generate(bigrams))
print(generate(bigrams, "flowers"))        # shouldn't rewrite the word
print(generate(bigrams, "sequential", 10)) # should rewrite the word

In [ ]: