Assignment 5. Arrays, Dictionaries and Tuples


Arrays

Exercise 1

Write a function called nested_sum that takes an array of arrays of integers and adds up the elements from all of the nested arrays. For example:

t = [[1, 2], [3], [4, 5, 6]]
nested_sum(t)
21

Exercise 2

Write a function called cumsum that takes an array of numbers and returns the cumulative sum; that is, a new array where the ith element is the sum of the first i+1 elements from the original array. For example:

t = [1, 2, 3]
cumsum(t)
[1, 3, 6]

Exercise 3

Write a function called middle that takes an array and returns a new array that contains all but the first and last elements. For example:

t = [1, 2, 3, 4]
middle(t)
[2, 3]

Exercise 4

Write a function called chop that takes an array, modifies it by removing the first and last elements, and returns None. For example:

t = [1, 2, 3, 4]
chop(t)
t
[2, 3]

Exercise 5

Write a function called is_sorted that takes an array as a parameter and returns true if the list is sorted in ascending order and false otherwise. For example:

is_sorted([1, 2, 2])
true
is_sorted(['b', 'a'])
false

Exercise 6

Write a function called has_duplicates that takes an array and returns true if there is any element that appears more than once. It should not modify the original array.

Exercise 7*

To check whether a word is in the word array, you could use the in operator, but it would be slow because it searches through the words in order.

Because the words are in alphabetical order, we can speed things up with a bisection search (also known as binary search), which is similar to what you do when you look a word up in the dictionary. You start in the middle and check to see whether the word you are looking for comes before the word in the middle of the list. If so, you search the first half of the list the same way. Otherwise you search the second half.

Either way, you cut the remaining search space in half. If the word array has 113,809 words, it will take about 17 steps to find the word or conclude that it’s not there.

Write a function called in_bisect that takes a sorted array and a target value and returns true if the word is in the list and false if it’s not.

Dictionaries

Exercise 1

Write a function that reads the words in "words.txt" and stores them as keys in a dictionary. It doesn’t matter what the values are. Then you can use the in operator as a fast way to check whether a string is in the dictionary.

Exercise 2

Memoize the Ackermann function and see if memoization makes it possible to evaluate the function with bigger arguments.

Exercise 3

You already have a function named has_duplicates that takes an array as a parameter and returns true if there is any object that appears more than once in the array.

Use a dictionary to write a faster, simpler version of has_duplicates.

Exercise 4

Two words are “rotate pairs” if you can rotate one of them and get the other.

Write a program that reads a wordlist and finds all the rotate pairs.

Tuples

Exercise 1

Write a function called sumall that takes any number of arguments and returns their sum.

Markov Analysis

Exercise 1*

The function random_word of lecture 14 is not very efficient. An alternative is:

  • Use keys to get an array of the words in the book from the histogram.

  • Build an array that contains the cumulative sum of the word frequencies. The last item in this array is the total number of words in the book, n.

  • Choose a random number from 1 to n. Use a bisection search to find the index where the random number would be inserted in the cumulative sum.

  • Use the index to find the corresponding word in the word array.

Write a program that uses this algorithm to choose a random word from the book.

Exercise 2*

The “rank” of a word is its position in an array of words sorted by frequency: the most common word has rank 1, the second most common has rank 2, etc.

Zipf’s law describes a relationship between the ranks and frequencies of words in natural languages (http://en.wikipedia.org/wiki/Zipf's_law). Specifically, it predicts that the frequency, $f$, of the word with rank $r$ is:

$$f = c r^{−s}$$

where $s$ and $c$ are parameters that depend on the language and the text. If you take the logarithm of both sides of this equation, you get: $$\log f = \log c − s \log r$$ So if you plot $\log f$ versus $\log r$, you should get a straight line with slope $−s$ and intercept $\log c$.

Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency, with $\log f$ and $\log r$.

Install a plotting library:

Pkg.add("Plots")

Its usage is very similar to plot in MATLAB:

using Plots
x = 1:10
y = x.^2
plot(x, y)

Use the Plots library to plot the results and check whether they form a straight line.

Exercise 3*

If you choose words from the book at random, you can get a sense of the vocabulary, but you probably won’t get a sentence:

this the small regard harriet which knightley's it most things

A series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you would expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb.

One way to measure these kinds of relationships is Markov analysis, which characterizes, for a given sequence of words, the probability of the words that might come next. For example, the song Eric, the Half a Bee begins:

Half a bee, philosophically, 
Must, ipso facto, half not be. 
But half the bee has got to be 
Vis a vis, its entity. D’you see? 

But can a bee be said to be 
Or not to be an entire bee 
When half the bee is not a bee 
Due to some ancient injury?

In this text, the phrase “half the” is always followed by the word “bee”, but the phrase “the bee” might be followed by either “has” or “is”.

The result of Markov analysis is a mapping from each prefix (like “half the” and “the bee”) to all possible suffixes (like “has” and “is”).

Given this mapping, you can generate a random text by starting with any prefix and choosing at random from the possible suffixes. Next, you can combine the end of the prefix and the new suffix to form the next prefix, and repeat.

For example, if you start with the prefix “Half a”, then the next word has to be “bee”, because the prefix only appears once in the text. The next prefix is “a bee”, so the next suffix might be “philosophically”, “be” or “due”.

In this example the length of the prefix is always two, but you can do Markov analysis with any prefix length.

  1. Write a program to read a text from a file and perform Markov analysis. The result should be a dictionary that maps from prefixes to a collection of possible suffixes. The collection might be a list, tuple, or dictionary; it is up to you to make an appropriate choice. You can test your program with prefix length two, but you should write the program in a way that makes it easy to try other lengths.

  2. Add a function to the previous program to generate random text based on the Markov analysis. Here is an example from Emma with prefix length 2:

    He was very clever, be it sweetness or be angry, ashamed or only amused, at such a stroke. She had never thought of Hannah till you were never meant for me?" "I cannot make speeches, Emma:" he soon cut it all himself.
    
    

    For this example, I left the punctuation attached to the words. The result is almost syntactically correct, but not quite. Semantically, it almost makes sense, but not quite.

    What happens if you increase the prefix length? Does the random text make more sense?

  3. Once your program is working, you might want to try a mash-up: if you combine text from two or more books, the random text you generate will blend the vocabulary and phrases from the sources in interesting ways.

Credit: This case study is based on an example from Kernighan and Pike, The Practice of Programming, Addison-Wesley, 1999.