The goal is to design parallel programs that are flexible, efficient and simple.
Step 0: Start by profiling a serial program to identify bottlenecks
Step 1: Are there for opportunities for parallelism?
Step 2: What is the nature of the parallelism?
Step 3: What is the granularity?
Step 4: Choose an algorithm
Organize by tasks
Organize by data
Organize by flow
Step 5: Map to program and data structures
Step 6: Map to parallel environment
Step 7: Execute, debug, tune in parallel environment
Many statistical problems are embarrassingly parallel and can be easily decomposed into independent tasks or data sets. Here are several examples:
Other problems are serial at small scale, but can be parallelized at large scales. For example, EM and MCMC iterations are inherently serial since there is a dependence on the previous state, but within a single iteration, there can be many thousands of density calculations (one for each data point to calculate the likelihood), and this is an embarrassingly parallel problem within a single iteration.
These "low hanging fruits" are great because they offer a path to easy parallelism with minimal complexity.
The bigger the problem, the more scope there is for parallelism
Amhdahls' law says that the speedup from parallelization is bounded by the ratio of parallelizable to irreducibly serial code in the algorithm. However, for big data analysis, Gustafson's Law is more relevant. This says that we are nearly always interested in increasing the size of the parallelizable bits, and the ratio of parallelizable to irreducibly serial code is not a static quantity but depends on data size. For example, Gibbs sampling has an irreducibly serial nature, but for large samples, each iteration may be able perform PDF evaluations in parallel for zillions of data points.
sklearn
pymc3
pystan
target=paraallel
in numba.vectorize
and numb.guvectorize
openmp
with cython.parallel
, cython.prange
and cython.nogil
concurrent.futures
multiprocessing
ipyparallel
within Jupytermemmap
HDF5
and h5py
dask
blaze
pyspark