Lecture 14. Data structure selection


Case study

At this point you have learned about Julia’s core data structures, and you have seen some of the algorithms that use them.

This chapter presents a case study with exercises that let you think about choosing data structures and practice using them.

Word frequency analysis

Exercise 1

Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase.

Hint: the function isalpha tests whether a character is alphabetic.

Exercise 2

Go to Project Gutenberg (http://gutenberg.org) and download your favorite out-of-copyright book in plain text format. Modify your program from the previous exercise to read the book you downloaded, skip over the header information at the beginning of the file, and process the rest of the words as before.

Then modify the program to count the total number of words in the book, and the number of times each word is used.

Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

Exercise 3

Modify the program from the previous exercise to print the 20 most frequently used words in the book.

Exercise 4

Modify the previous program to read a word list and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

Random numbers

Given the same inputs, most computer programs generate the same outputs every time, so they are said to be deterministic. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more.

Making a program truly nondeterministic turns out to be difficult, but there are ways to make it at least seem nondeterministic. One of them is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.

The function rand returns a random float between 0.0 and 1.0 (including 0.0 but not 1.0). Each time you call random, you get the next number in a long series. To see a sample, run this loop:


In [1]:
for i in 1:10
    x = rand()
    println(x)
end


0.10731525410072007
0.7789747330849832
0.37657254127066153
0.32368457967329833
0.8520578409319299
0.3848167724925189
0.325550059088463
0.14858166564191677
0.6715577315536914
0.9618222886461032

The function rand can take an iterator or array as argument and returns a random element:


In [10]:
for i in 1:10
    x = rand(1:6)
    print(x, " ")
end


3 5 4 1 4 2 6 6 6 3 

In [14]:
for i in 1:10
    x = rand(['a','b','c'])
    print(x, " ")
end


c b c b a b a a c c 

Word histogram

Here is a program that reads a file and builds a histogram of the words in the file:


In [15]:
function process_file(filename)
    hist = Dict()
    for line in eachline(filename)
        process_line(line, hist)
    end
    hist
end


Out[15]:
process_file (generic function with 1 method)

In [24]:
function process_line(line, hist)
    line = replace(line, '-', ' ')
    for word in split(line)
        word = string(filter(isalpha, [word...])...)
        word = lowercase(word)
        hist[word] = get!(hist, word, 0) + 1
    end
end


Out[24]:
process_line (generic function with 1 method)

In [25]:
hist = process_file("../data/emma.txt")


Out[25]:
Dict{Any,Any} with 7380 entries:
  "prejudices"      => 1
  "practise"        => 3
  "adviser"         => 1
  "forbade"         => 1
  "offend"          => 3
  "quicksighted"    => 1
  "contemplate"     => 1
  "enjoy"           => 9
  "diffuse"         => 1
  "unreserved"      => 1
  "transplantation" => 1
  "fight"           => 1
  "dulness"         => 3
  "everywhere"      => 3
  "exult"           => 1
  "indelicacy"      => 1
  "inattentive"     => 1
  "helping"         => 1
  "whose"           => 39
  "sleepless"       => 1
  "hurried"         => 8
  "gout"            => 1
  "henry"           => 21
  "drawers"         => 1
  "incommoded"      => 2
  ⋮                 => ⋮

This program reads "emma.txt", which contains the text of Emma by Jane Austen.

process_file loops through the lines of the file, passing them one at a time to process_line. The histogram hist is being used as an accumulator.

process_line uses the string method replace to replace hyphens with spaces before using split to break the line into an array of strings. It traverses the array of words and uses filter, isalpha and lowercase to remove punctuation and convert to lower case. (It is a shorthand to say that strings are “converted”; remember that strings are immutable, so a function like lowercase return new strings.)

Finally, process_line updates the histogram by creating a new item or incrementing an existing one.

To count the total number of words in the file, we can add up the frequencies in the histogram:


In [29]:
function total_words(hist)
    sum(values(hist))
end
total_words(hist)


Out[29]:
162742

The number of different words is just the number of items in the dictionary:


In [30]:
function different_words(hist)
    length(hist)
end
different_words(hist)


Out[30]:
7380

Most common words

To find the most common words, we can make an array of tuples, where each tuple contains a word and its frequency, and sort it. The following function takes a histogram and returns an array of word-frequency tuples:


In [51]:
function most_common(hist)
    t = []
    for (key, value) in hist
        push!(t, (value, key))
    end
    reverse!(sort!(t))
end


Out[51]:
most_common (generic function with 1 method)

In each tuple, the frequency appears first, so the resulting list is sorted by frequency.

Here is a loop that prints the ten most common words:


In [53]:
t = most_common(hist)
println("The most common words are: ")
for (freq, word) in t[1:10]
    println(word, "\t", freq)
end


The most common words are: 
to	5295
the	5266
and	4931
of	4339
i	3191
a	3155
it	2546
her	2483
was	2400
she	2364

Optional parameters

We have seen built-in functions and methods that take optional arguments. It is possible to write programmer-defined functions with optional arguments, too. For example, here is a function that prints the most common words in a histogram:


In [54]:
function print_most_common(hist, num=10)
    t = most_common(hist)
    println("The most common words are: ")
    for (freq, word) in t[1:num]
        println(word, "\t", freq)
    end
end


Out[54]:
print_most_common (generic function with 2 methods)

The first parameter is required; the second is optional. The default value of num is 10. If you only provide one argument:

print_most_common(hist)

num gets the default value. If you provide two arguments:

print_most_common(hist, 20)

num gets the value of the argument instead. In other words, the optional argument overrides the default value. If a function has both required and optional parameters, all the required parameters have to come first, followed by the optional ones.

Dictionary subtraction

Finding the words from the book that are not in the word list from "words.txt" is a problem you might recognize as set subtraction; that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list).

subtract takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. Since we don’t really care about the values, we set them all to nothing.


In [56]:
function subtract(d1, d2)
    res = Dict()
    for key in keys(d1)
        if key  keys(d2)
            res[key] = nothing
        end
    end
    res
end


Out[56]:
subtract (generic function with 1 method)

To find the words in the book that are not in "words.txt", we can use process_file to build a histogram for "words.txt", and then subtract:


In [58]:
words = process_file("../data/words.txt")
diff = subtract(hist, words)

println("Words in the book that aren't in the word list:")
for word in keys(diff)
    print(word, " ")
end


Words in the book that aren't in the word list:
outree quicksighted outwardly adelaide jeffereys unreserved dixons betweens groundless jamess westons remembrance unsuccessfully hawkinss rencontre deservedly favourably incommoded unfavourable wiltshire bateses recollecting newsletters recollected irresistibly pianoforte undiscerned ungraciously humourist bitnet placidity swisserland il xxxxxxxxx internet prosings constitutions ult christian xvi illinois outstepped november woodhouse experienced favourite neptune ocr wallises companionably tremblings recollect uninterruptedly ing emma smallridges disingenuousness injustice admirably appellation familiarise hartuiucvmd surry ix c ii xiii cromer shakespeare vii unconvinced coxs tranquillised complimenter broadwood bellas goodnatured dont richardson doatingly braithwaites unimpeded michaelmas reanimation endthe neighbourhood william influenced harriets disclaimer churchill twelvemonth tunbridge england gentlemans imaginist unexampled hodges dir apologised grahams etextsverend stimulative a winkd sympathise pressingly asparagus adventuring michael mutterings s novitiate maam administered otway hetty destind kramer beaufet hesitatingly wingfields etextsstart soothings illiberal augusta lieut mickleham  isabella cramer inconsideration suitableness feelingly shant mantelpiece caroline promptitude williams penetrated electronically oclock authorised hartvmdcsouiucedu saturday unconcern tupman fastidious d unbleached embrocation wakefield patriae behindhand elizabeth foolishly grandpapa buyings xix penetrating exultingly mcimail al monday hypertext delightfully brunswick wretchedly windsor benedictine unequalled charles b apologise fairfax observable mitchells langham reprobating december unpersuadable grandmama non harmonise goddard persuadable teazed collation aimable o perrys puppyism humouredly sposo visitings m commandingly improvidently tete london n f goddards pre inferiorities aggrandise sufferable honourable blameable xiv emmas smilingly taylors nobodys churchilld merchantability impartial waverings disengage weston yorkshire secresy excusable email doubtingly etextarticles adair clara unexamined unreserve housebreaking silversmith bickerton mr hawkins gentlemens emmazip patroness ascii complacently disclaimers hughes donwell unmentioned sunday deedily uncouthness sweetbread untouched misunderstood hannah honourably friday mistresss recantation dostalis greatcoat arrowroot etext unsuitable childs amor valetudinarian undesigned confidantes pointedly caro xii disengaged theodore unpretending february neighbour penetrate italian broadwoods remembrances conversable rumination james favourable complaisance arthur hindrance churchills unseasonableness designedly eltons july compuservecom enscombe mens hartfield chuse amiableness baly attmail woodhouses tupmans printfor solitarily tm bragges endeavoured charitable transcribing undesignedly controul holyhead unsentimental inquietudes cultivation serle comtesse october tel apartment austen collectedly birmingham taylor nash chuses unsuspected philippics fancifulness randalls stomacher unpolished iii barnes encumbrance gutenbergibc wallis rationality paradings greensward newgut campbell irish unmirthful june knightley behaviour larkinss ungallant hartfields etextetext rd april bragge thats connexions gentlemanlike th indexgut robert e irksomeness manoeuvring x substance ungracious endeavour heartedness desirableness insignificance delightful transcribed endeavours emmaatxt selinas abruptness astleys solicitously disagreeableness otways genlis crossness viii alderneys abdys elton goodhumoured wonderings l acquirements clifton catherine george slightingly champaign kingston philip aladdins xvii scotland mrs disapprobation adversarys encourager dalmane garricks unceremoniousness heartfelt unobjectionable gutenberg st particularity pembroke ftp unsullied etc bodys unexceptionably iv bella complimented tuesday undistinguishing blockhead ceaseless discomposed ireland yourlogin p incommoding weymouth neighbours ebcdic stilton underbred fragrance houseroom grandpapas unpardonably cowper fairfaxs mrcnextcsouiucedu york inseparably stept broadway anne composedly richmond solemnity exultation tautology blanche xv cd dorking grandmamas discordancies unperceived flatterers dublin wingfield prepossession latters endeavouring unexceptionable treachery discourse fidgetiness portionless admirable knightleys anothers unfeignedly inebriety splendour k mget sympathiser undoubting connexion batess successless talkativeness sanguinely cobham entertainingly milmans gradations companionableness dr churchwardens clayton privately scepticism january saunders manchester highbury acquittal unostentatious christmas campbells indiscreetly i unmodulated unprovided w venice compuserve dexterity untowardly richard confederates chusing expediency apartments vi dixon startthe hyperbolical proportionably unexpensively delicately craig ladys unfastidious unpermitted emmatxt submissively login cellery hazle southward larkins sauciness smallridge ibc selina xviii untainted agreeably v criticising coxe isabellas complaisant unfrequently unsuspicious etexts harriet unaccountable 

Some of these words are names and possessives. Others, like "rencontre", are no longer in common use. But a few are common words that should really be in the list.

Random words

To choose a random word from the histogram, the simplest algorithm is to build an array with multiple copies of each word, according to the observed frequency, and then choose from the array:


In [69]:
function random_word(h)
    t = []
    for (word, freq) in h
        for i in 1:freq
            push!(t, word)
        end
    end
    rand(t)
end


Out[69]:
random_word (generic function with 1 method)

This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.

Debugging

When you are debugging a program, and especially if you are working on a hard bug, there are five things to try:

  • Reading:

    Examine your code, read it back to yourself, and check that it says what you meant to say.

  • Running:

    Experiment by making changes and running different versions. Often if you display the right thing at the right place in the program, the problem becomes obvious, but sometimes you have to build scaffolding.

  • Ruminating:

    Take some time to think! What kind of error is it: syntax, runtime, or semantic? What information can you get from the error messages, or from the output of the program? What kind of error could cause the problem you’re seeing? What did you change last, before the problem appeared?

  • Rubberducking:

    If you explain the problem to someone else, you sometimes find the answer before you finish asking the question. Often you don’t need the other person; you could just talk to a rubber duck. And that’s the origin of the well-known strategy called rubber duck debugging.

  • Retreating:

    At some point, the best thing to do is back off, undoing recent changes, until you get back to a program that works and that you understand. Then you can start rebuilding.