# Case study

At this point you have learned about Julia’s core data structures, and you have seen some of the algorithms that use them.

This chapter presents a case study with exercises that let you think about choosing data structures and practice using them.

# Word frequency analysis

## Exercise 1

Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase.

Hint: the function `isalpha` tests whether a character is alphabetic.

## Exercise 2

Then modify the program to count the total number of words in the book, and the number of times each word is used.

Print the number of different words used in the book. Compare different books by different authors, written in different eras. Which author uses the most extensive vocabulary?

## Exercise 3

Modify the program from the previous exercise to print the 20 most frequently used words in the book.

## Exercise 4

Modify the previous program to read a word list and then print all the words in the book that are not in the word list. How many of them are typos? How many of them are common words that should be in the word list, and how many of them are really obscure?

# Random numbers

Given the same inputs, most computer programs generate the same outputs every time, so they are said to be deterministic. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are an obvious example, but there are more.

Making a program truly nondeterministic turns out to be difficult, but there are ways to make it at least seem nondeterministic. One of them is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.

The function `rand` returns a random float between `0.0` and `1.0` (including 0.0 but not 1.0). Each time you call random, you get the next number in a long series. To see a sample, run this loop:

``````

In [1]:

for i in 1:10
x = rand()
println(x)
end

``````
``````

0.10731525410072007
0.7789747330849832
0.37657254127066153
0.32368457967329833
0.8520578409319299
0.3848167724925189
0.325550059088463
0.14858166564191677
0.6715577315536914
0.9618222886461032

``````

The function `rand` can take an iterator or array as argument and returns a random element:

``````

In [10]:

for i in 1:10
x = rand(1:6)
print(x, " ")
end

``````
``````

3 5 4 1 4 2 6 6 6 3

``````
``````

In [14]:

for i in 1:10
x = rand(['a','b','c'])
print(x, " ")
end

``````
``````

c b c b a b a a c c

``````

# Word histogram

Here is a program that reads a file and builds a histogram of the words in the file:

``````

In [15]:

function process_file(filename)
hist = Dict()
for line in eachline(filename)
process_line(line, hist)
end
hist
end

``````
``````

Out[15]:

process_file (generic function with 1 method)

``````
``````

In [24]:

function process_line(line, hist)
line = replace(line, '-', ' ')
for word in split(line)
word = string(filter(isalpha, [word...])...)
word = lowercase(word)
hist[word] = get!(hist, word, 0) + 1
end
end

``````
``````

Out[24]:

process_line (generic function with 1 method)

``````
``````

In [25]:

hist = process_file("../data/emma.txt")

``````
``````

Out[25]:

Dict{Any,Any} with 7380 entries:
"prejudices"      => 1
"practise"        => 3
"offend"          => 3
"quicksighted"    => 1
"contemplate"     => 1
"enjoy"           => 9
"diffuse"         => 1
"unreserved"      => 1
"transplantation" => 1
"fight"           => 1
"dulness"         => 3
"everywhere"      => 3
"exult"           => 1
"indelicacy"      => 1
"inattentive"     => 1
"helping"         => 1
"whose"           => 39
"sleepless"       => 1
"hurried"         => 8
"gout"            => 1
"henry"           => 21
"drawers"         => 1
"incommoded"      => 2
⋮                 => ⋮

``````

This program reads `"emma.txt"`, which contains the text of Emma by Jane Austen.

`process_file` loops through the lines of the file, passing them one at a time to process_line. The histogram `hist` is being used as an accumulator.

`process_line` uses the string method replace to replace hyphens with spaces before using split to break the line into an array of strings. It traverses the array of words and uses `filter`, `isalpha` and `lowercase` to remove punctuation and convert to lower case. (It is a shorthand to say that strings are “converted”; remember that strings are immutable, so a function like `lowercase` return new strings.)

Finally, `process_line` updates the histogram by creating a new item or incrementing an existing one.

To count the total number of words in the file, we can add up the frequencies in the histogram:

``````

In [29]:

function total_words(hist)
sum(values(hist))
end
total_words(hist)

``````
``````

Out[29]:

162742

``````

The number of different words is just the number of items in the dictionary:

``````

In [30]:

function different_words(hist)
length(hist)
end
different_words(hist)

``````
``````

Out[30]:

7380

``````

# Most common words

To find the most common words, we can make an array of tuples, where each tuple contains a word and its frequency, and sort it. The following function takes a histogram and returns an array of word-frequency tuples:

``````

In [51]:

function most_common(hist)
t = []
for (key, value) in hist
push!(t, (value, key))
end
reverse!(sort!(t))
end

``````
``````

Out[51]:

most_common (generic function with 1 method)

``````

In each tuple, the frequency appears first, so the resulting list is sorted by frequency.

Here is a loop that prints the ten most common words:

``````

In [53]:

t = most_common(hist)
println("The most common words are: ")
for (freq, word) in t[1:10]
println(word, "\t", freq)
end

``````
``````

The most common words are:
to	5295
the	5266
and	4931
of	4339
i	3191
a	3155
it	2546
her	2483
was	2400
she	2364

``````

# Optional parameters

We have seen built-in functions and methods that take optional arguments. It is possible to write programmer-defined functions with optional arguments, too. For example, here is a function that prints the most common words in a histogram:

``````

In [54]:

function print_most_common(hist, num=10)
t = most_common(hist)
println("The most common words are: ")
for (freq, word) in t[1:num]
println(word, "\t", freq)
end
end

``````
``````

Out[54]:

print_most_common (generic function with 2 methods)

``````

The first parameter is required; the second is optional. The default value of `num` is `10`. If you only provide one argument:

```print_most_common(hist)
```

`num` gets the default value. If you provide two arguments:

```print_most_common(hist, 20)
```

`num` gets the value of the argument instead. In other words, the optional argument overrides the default value. If a function has both required and optional parameters, all the required parameters have to come first, followed by the optional ones.

# Dictionary subtraction

Finding the words from the book that are not in the word list from `"words.txt"` is a problem you might recognize as set subtraction; that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list).

`subtract` takes dictionaries `d1` and `d2` and returns a new dictionary that contains all the keys from `d1` that are not in `d2`. Since we don’t really care about the values, we set them all to `nothing`.

``````

In [56]:

function subtract(d1, d2)
res = Dict()
for key in keys(d1)
if key ∉ keys(d2)
res[key] = nothing
end
end
res
end

``````
``````

Out[56]:

subtract (generic function with 1 method)

``````

To find the words in the book that are not in `"words.txt"`, we can use process_file to build a histogram for `"words.txt"`, and then `subtract`:

``````

In [58]:

words = process_file("../data/words.txt")
diff = subtract(hist, words)

println("Words in the book that aren't in the word list:")
for word in keys(diff)
print(word, " ")
end

``````
``````

Words in the book that aren't in the word list:

``````

Some of these words are names and possessives. Others, like `"rencontre"`, are no longer in common use. But a few are common words that should really be in the list.

# Random words

To choose a random word from the histogram, the simplest algorithm is to build an array with multiple copies of each word, according to the observed frequency, and then choose from the array:

``````

In [69]:

function random_word(h)
t = []
for (word, freq) in h
for i in 1:freq
push!(t, word)
end
end
rand(t)
end

``````
``````

Out[69]:

random_word (generic function with 1 method)

``````

This algorithm works, but it is not very efficient; each time you choose a random word, it rebuilds the list, which is as big as the original book. An obvious improvement is to build the list once and then make multiple selections, but the list is still big.

# Debugging

When you are debugging a program, and especially if you are working on a hard bug, there are five things to try:

Examine your code, read it back to yourself, and check that it says what you meant to say.

• Running:

Experiment by making changes and running different versions. Often if you display the right thing at the right place in the program, the problem becomes obvious, but sometimes you have to build scaffolding.

• Ruminating:

Take some time to think! What kind of error is it: syntax, runtime, or semantic? What information can you get from the error messages, or from the output of the program? What kind of error could cause the problem you’re seeing? What did you change last, before the problem appeared?

• Rubberducking:

If you explain the problem to someone else, you sometimes find the answer before you finish asking the question. Often you don’t need the other person; you could just talk to a rubber duck. And that’s the origin of the well-known strategy called rubber duck debugging.

• Retreating:

At some point, the best thing to do is back off, undoing recent changes, until you get back to a program that works and that you understand. Then you can start rebuilding.