The goal is to design parallel programs that are flexible, efficient and simple.
Step 0: Start by profiling a serial program to identify bottlenecks
Step 1: Are there for opportunities for parallelism?
Step 2: What is the nature of the parallelism?
Step 3: What is the granularity?
Step 4: Choose an algorithm
Organize by tasks
Organize by data
Organize by flow
Step 5: Map to program and data structures
Step 6: Map to parallel environment
Step 7: Execute, debug, tune in parallel environment
Many statistical problems are embarrassingly parallel and can be easily decomposed into independent tasks or data sets. Here are several examples:
Other problems are serial at small scale, but can be parallelized at large scales. For example, EM and MCMC iterations are inherently serial since there is a dependence on the previous state, but within a single iteration, there can be many thousands of density calculations (one for each data point to calculate the likelihood), and this is an embarrassingly parallel problem within a single iteration.
These "low hanging fruits" are great because they offer a path to easy parallelism with minimal complexity.
The bigger the problem, the more scope there is for parallelism
Amhdahls' law says that the speedup from parallelization is bounded by the ratio of parallelizable to irreducibly serial code in the algorithm. However, for big data analysis, Gustafson's Law is more relevant. This says that we are nearly always interested in increasing the size of the parallelizable bits, and the ratio of parallelizable to irreducibly serial code is not a static quantity but depends on data size. For example, Gibbs sampling has an irreducibly serial nature, but for large samples, each iteration may be able perform PDF evaluations in parallel for zillions of data points.
sklearnpymc3pystantarget=paraallel in numba.vectorize and numb.guvectorizeopenmp with cython.parallel, cython.prange and cython.nogilconcurrent.futuresmultiprocessingipyparallel within JupytermemmapHDF5 and h5pydaskblazepyspark