In this exercise, we will look at a python class that performs a analysis on a given text. The code as it is appears to run fine for a few 'normal' cases, however as it is untested it is likely that it will not do so well for all input data.
Your task is to design a set of tests that ensure the code functions correctly for all possible input data. It should be able to deal with edge cases and suitably fail (e.g. terminate with an exception) for invalid data.
When designing your tests, have in mind the following:
A few examples of 'normal' cases have been given. You may wish to create some more input data for running your tests in order to cover the full range of valid input data (and to test the code fails for invalid input data).
In [1]:
import n_grams
In [2]:
files = {"alice": "http://www.gutenberg.org/files/11/11-0.txt",
"dracula": "http://www.gutenberg.org/ebooks/345.txt.utf-8",
"sherlock": "http://www.gutenberg.org/ebooks/1661.txt.utf-8",
"poe": "the_raven.txt"}
txt = n_grams.Text(files["alice"])
txt.text_report()
blank.txt - a blank text file. The code has nothing to stop it dividing by zero when it calculates mean word length.repeat.txt - this repeats the first verse of the Edward Lear poem The Jumblies 68 times. The longest n-gram function only looks for n-grams 50 words or shorter, so fails to spot this and instead slices up the poem. There are a couple of ways to deal with this. One would be that if the code finds an n-gram of length 50, check to see if this is a substring of a longer one until no longer find n-grams. Alternatively, could start looking for n-grams of length 2 and increase until no more are found, then eliminate substrings.http://www.gutenberg.org/ebooks/844.epub.images?session_id=0fa3233ff1abe287d4a1ce534e052624dc702aec - this is an EPUB file, so the code will not be able to read itshort.txt - Length of file is only 5 words, so when code prints out '10 longest words', it only actually prints out 5. This is a minor error, but nonetheless is not correct behaviour.nile.csv - this is just a load of numbers, so results are pretty meaningless. Code should really check it's looking at an actual text file.
In [ ]: