1. The most common parallelization task is to run the same function on different sets of data. For example, we may need to read in many files, do some processing on each file, and finally merge the results.
word_count(["This is a string.", "This is another string!"])
returns
{'a': 1, 'another': 1, 'is': 2, 'string': 2, 'this': 2}
In [ ]:
2. We first need to download the text files to run word_count on. This is network rather than CPU bound, so threading or asynchronous calls work well here. Read a list of names of files that can be found at http://http://people.duke.edu/~ccc14/jokes.
Time how long it takes to download all named files using the %%time cell magic
ThreadPoolExecutor from concurrent.futures asyncio event loopTo download a file, first save the response from requests.get(url) as a variable r. Then write r.text into a file using standard Python syntax.
In [ ]:
3. Now use processes from either concurrent.futures or multiprocessing to parallelize the counting of words in each downloaded file. Finally, merge the returned dictionaries and print the 10 most common words and their counts from all files.
In [ ]:
4. Recall the cdist function from the previus worksheet. Suppose we had a large number of vectors to work with. Parallelize the cdist call for the following data set.
In [51]:
import numpy as np
np.random.seed(123)
n1 = 50000
n2 = 100
p = 10
XA = np.random.normal(0, 1, (n1, p))
XB = np.random.normal(0, 1, (n2, p))
In [ ]: