Exercise: Text analysis

In this exercise, we will look at a python class that performs a analysis on a given text. The code as it is appears to run fine for a few 'normal' cases, however as it is untested it is likely that it will not do so well for all input data.

Your task is to design a set of tests that ensure the code functions correctly for all possible input data. It should be able to deal with edge cases and suitably fail (e.g. terminate with an exception) for invalid data.

When designing your tests, have in mind the following:

What range of cases should the code be able to deal with?
How should the code deal with edge cases?
What should the code do if it encounters invalid input data?
Even for valid input data, does the code always give the same output or is there some randomness? If so, how can the tests be designed to deal with that?

A few examples of 'normal' cases have been given. You may wish to create some more input data for running your tests in order to cover the full range of valid input data (and to test the code fails for invalid input data).



In [1]:

    
import n_grams



In [2]:

    
files = {"alice": "http://www.gutenberg.org/files/11/11-0.txt", 
         "dracula": "http://www.gutenberg.org/ebooks/345.txt.utf-8",
         "sherlock": "http://www.gutenberg.org/ebooks/1661.txt.utf-8",
         "poe": "the_raven.txt"}

txt = n_grams.Text(files["alice"])

txt.text_report()









    



There are 26611 words in the text.

Mean, median and mode word length is 4.03351997294352, 4, 3.

10 longest words:
disappointment
contemptuously
affectionately
multiplication
uncomfortable
straightening
extraordinary
conversations
uncomfortably
inquisitively

Most common words:
462 x said
385 x alice
365 x you
246 x her
180 x all
178 x had
178 x with
170 x but
145 x not
144 x very

Longest n-grams:
6 x will you wont you will you
5 x and the moral of that is
6 x as well as she could
5 x as she said this she
19 x said the mock turtle
16 x she said to herself
11 x a minute or two
8 x said the march hare
7 x said alice in a
6 x in a great hurry
6 x in a tone of
5 x said the duchess and
5 x said the king and
5 x the poor little thing
5 x the little golden key
5 x out of the way
5 x i beg your pardon

Tests to make the code fail

blank.txt - a blank text file. The code has nothing to stop it dividing by zero when it calculates mean word length.
repeat.txt - this repeats the first verse of the Edward Lear poem The Jumblies 68 times. The longest n-gram function only looks for n-grams 50 words or shorter, so fails to spot this and instead slices up the poem. There are a couple of ways to deal with this. One would be that if the code finds an n-gram of length 50, check to see if this is a substring of a longer one until no longer find n-grams. Alternatively, could start looking for n-grams of length 2 and increase until no more are found, then eliminate substrings.
http://www.gutenberg.org/ebooks/844.epub.images?session_id=0fa3233ff1abe287d4a1ce534e052624dc702aec - this is an EPUB file, so the code will not be able to read it
Longest words and n-grams are both stored in dictionaries, so the in which these are printed will often vary between runs. Care should be taken when designing tests to account for this!
short.txt - Length of file is only 5 words, so when code prints out '10 longest words', it only actually prints out 5. This is a minor error, but nonetheless is not correct behaviour.
nile.csv - this is just a load of numbers, so results are pretty meaningless. Code should really check it's looking at an actual text file.



In [ ]: