1. The most common parallelization task is to run the same function on different sets of data. For example, we may need to read in many files, do some processing on each file, and finally merge the results.

  • Write a function to perform a word count given an iterable of strings. The function strips leading and trailing spaces from each string, removes punctuation, converts to lower case and splits by blank space. It returns a dictionary whose keys are words and whose values are the counts of that word. For example
word_count(["This is a string.", "This is another string!"])

returns

{'a': 1, 'another': 1, 'is': 2, 'string': 2, 'this': 2}

In [ ]:

2. We first need to download the text files to run word_count on. This is network rather than CPU bound, so threading or asynchronous calls work well here. Read a list of names of files that can be found at http://http://people.duke.edu/~ccc14/jokes.

Time how long it takes to download all named files using the %%time cell magic

  • Use a for loop
  • Use a ThreadPoolExecutor from concurrent.futures
  • Using asynchronous calls and the asyncio event loop

To download a file, first save the response from requests.get(url) as a variable r. Then write r.text into a file using standard Python syntax.


In [ ]:

3. Now use processes from either concurrent.futures or multiprocessing to parallelize the counting of words in each downloaded file. Finally, merge the returned dictionaries and print the 10 most common words and their counts from all files.


In [ ]:

4. Recall the cdist function from the previus worksheet. Suppose we had a large number of vectors to work with. Parallelize the cdist call for the following data set.


In [51]:
import numpy as np

np.random.seed(123)

n1 = 50000
n2 = 100
p = 10
XA = np.random.normal(0, 1, (n1, p))
XB = np.random.normal(0, 1, (n2, p))

In [ ]: