The problem should be able to
The execution framework should be able to
The compute resource might be
Given $p$ processors, speedup, $S(p)$ is calculated as the ratio of the time it takes to run the program using a single processor over the time it takes to run the program using p processor.
In [1]:
# A program takes 30 seconds to run on a single-core machine,
# and 15 seconds to run on a dual-core machine
ts = 30
tp = 20
S = (ts / tp)
print (S)
Theoretical Max: Let $f$ be the fraction of the program that is not parallelizable. Assume no communication overhead.
This is known as Amdahl's Law
The efficiency $E$ is then defined as the ratior of speedup over the number of processors, $p$.
In [16]:
# Suppose that 4% of my application is serial.
# What is my predicted speedup according to Amdahl’s Law on 5 processors?
f = (0.25/9)
E = 1/((10 - 1)*f + 1)
print (E)
print (E * 10)
In [4]:
# Suppose that I get a speedup of 8 when I run my application on
# 10 processors.
# According to Amdahl's Law, # What portion is serial?
# What is the speedup on 20 processors? What is the efficiency?
# What is the best speedup that I could hope for?
Since $S(p)=\frac{p}{(p-1)f + 1}$, we have $S(p) \leq p $
Limiting factors:
Superlinear speedup: $S(p)>p$
- Threads (pthread) – programmer manages all parallelism
- OpenMP: Compiler extensions handle parallelization through in-code markers
- Vendor libraries (e.g. Intel math kernel libraries)
- Processor unit on graphic cards designed to support graphic rendering (numerical manipulation)
- Significant advantage for certain classes of scientific problem
- CUDA – Library developed by NVIDIA for their GPUs
- OpenACC – Standard devides by NVIDIA, Cray, and Portal Compiler (PGI).
- OpenAMP – Extensions to Visual C++ (Microsoft) to direct computation to GPU
- OpenCL – Set of standards by the group behind OpenGL
- Dynamically reconfigurable circuit board
- Expensive, difficult to program
- Power efficient, low heat
- Scales well
- Commodity parts
- Expandable
- Heterogenous
- MPI: standardized message passing library
- MPI + OpenMP (hybrid model)
- MapReduce programming model
- HPL (LINPACK to solve linear system of equation)
- DGEMM (Double Precision General Matric Multiply)
- STREAM (Memory bandwidth)
- PTRANS (Parallel Matrix Transpose to measure processors communication)
- RandomAccess (Random memory updates)
- FFT (double precision complex discrete fourier transform)
- Communication bandwidth and latency
- Non-traditional systems (GPU)
- I/O Performance of MapReduce/Hadoop Distributed File System