Introduction to Parallel and Distributed Computing

Linh B. Ngo

What is Parallel Computing?

_{*https://computing.llnl.gov/tutorials/parallel_comp/*}

Problem
Execution framework
Compute resource

The problem should be able to

Be broken apart into discrete pieces of work that can be solved simultaneously
Be solved in less time with multiple compute resources than with a single compute resource

The execution framework should be able to

Execute multiple program instructions concurrently at any moment in time

The compute resource might be

A single computer with multiple processors
An arbitrary number of computers connected by a network
A combination of both
A special computational component inside a single compute, separate from the main processors (GPU)

Progress of Parallel and Distributed Computing

Single computer, single core

_{*http://www.intel.com/pressroom/kits/pentiumee/*}

Single site, single computer, multiple cores

_{*http://www.intel.com/pressroom/kits/pentiumee/*}

Single site, multiple computers, multiple cores

Multiple sites, multiple computers, multiple cores

Multiple sites, multiple computers, multiple cores, virtual unified domain

Distributed Computing Systems

"A collection of individual computing devices that can communicate with each other." (Attiya and Welch, 2004)

Distributed Computing Systems

"A collection of individual computing devices that can communicate with each other." (Attiya and Welch, 2004)

Quantification of Performance Improvement

Can we just throw more compute resources at the problem?

Parallel Speedup: How much faster the program becomes once some computing resources are added

Parallel Efficiency: Ratio of performance improvement per individual unit of computing resource

Given $p$ processors, speedup, $S(p)$ is calculated as the ratio of the time it takes to run the program using a single processor over the time it takes to run the program using p processor.

$S(p) = \frac{\textrm{Sequential runtime}}{\textrm{Parallel runtime}} = \frac{t_{s}}{t_{p}}$



In [1]:

    
# A program takes 30 seconds to run on a single-core machine, 
# and 15 seconds to run on a dual-core machine
ts = 30 
tp = 20 
S = (ts / tp)
print (S)

1.5

Theoretical Max: Let $f$ be the fraction of the program that is not parallelizable. Assume no communication overhead.

${t_{p}} = f{t_{s}} + \frac{(1-f)t_{s}}{p}$

$S(p) = \frac{t_{s}}{f{t_{s}} + \frac{(1-f)t_{s}}{p}} = \frac{p}{pf+1-f} = \frac{p}{(p-1)f + 1}$

This is known as Amdahl's Law

The efficiency $E$ is then defined as the ratior of speedup over the number of processors, $p$.

$E = \frac{t_{s}}{t_{p} \times p} = \frac{S(p)}{p} 100\% = \frac{1}{(p-1)f + 1} 100\%$



In [16]:

    
# Suppose that 4% of my application is serial.  
# What is my predicted speedup according to Amdahl’s Law on 5 processors?
f = (0.25/9)
E = 1/((10 - 1)*f + 1)
print (E)
print (E * 10)



In [4]:

    
# Suppose that I get a speedup of 8 when I run my application on 
#  10 processors.  
# According to Amdahl's Law, # What portion is serial?  
# What is the speedup on 20 processors?  What is the efficiency?  
# What is the best speedup that I could hope for?

Since $S(p)=\frac{p}{(p-1)f + 1}$, we have $S(p) \leq p $

Limiting factors:

Non-parallelizable code
Communication overhead

Superlinear speedup: $S(p)>p$

Poor sequential reference implementation
Memory caching
I/O Blocking

Types of Distributed Computing Systems

**Flynn's Taxonomy** _{*http://www.slideshare.net/hlshih/high-performance-computing-building-blocks-production-perspective*}

Streaming SIMD extensions for x86 architectures
Shared
Distributed Shared Memory
Heterogeneous Computing (Accelerators)
Message Passing

**Streaming SIMD**

_{https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions}

**Shared Memory (Distributed Shared Memory)**

_{*https://computing.llnl.gov/tutorials/parallel_comp/*}

One processor, multiple threads
All threads have read/write access to the same memory

Programming models:

- Threads (pthread) – programmer manages all parallelism
- OpenMP: Compiler extensions handle parallelization through in-code markers
- Vendor libraries (e.g. Intel math kernel libraries)

**Heterogeneous Computing (Accelerators)**

_{*http://www.nvidia.com/docs/IO/143716/how-gpu-acceleration-works.png*}

GPU (Graphic Processing Units)

  - Processor unit on graphic cards designed to support graphic rendering (numerical manipulation)
  - Significant advantage for certain classes of scientific problem
  - CUDA – Library developed by NVIDIA for their GPUs
  - OpenACC – Standard devides by NVIDIA, Cray, and Portal Compiler (PGI). 
  - OpenAMP – Extensions to Visual C++ (Microsoft) to direct computation to GPU
  - OpenCL – Set of standards by the group behind OpenGL

FPGA (field programmable gate array)

  - Dynamically reconfigurable circuit board
  - Expensive, difficult to program
  - Power efficient, low heat

**Message Passing**

_{*https://computing.llnl.gov/tutorials/parallel_comp/*}

Processes handle their own memory, data is passed between processes via messages.
```
  - Scales well
  - Commodity parts
  - Expandable
  - Heterogenous
```

Programming Models:

  - MPI: standardized message passing library
  - MPI + OpenMP (hybrid model)
  - MapReduce programming model

Benchmarking

LINPACK (Linear Algebra Package: Dense Matrix Solver

HPCC: High-Performance Computing Challenge

  - HPL (LINPACK to solve linear system of equation) 
  - DGEMM (Double Precision General Matric Multiply)
  - STREAM (Memory bandwidth)
  - PTRANS (Parallel Matrix Transpose to measure processors communication)
  - RandomAccess (Random memory updates)
  - FFT (double precision complex discrete fourier transform)
  - Communication bandwidth and latency

SHOC: Scalable Heterogeneous Computing
```
  - Non-traditional systems (GPU)
```

TestDFSIO

  - I/O Performance of MapReduce/Hadoop Distributed File System

Ranking

TOP500: Rank the supercomputers based on their LINPACK score
GREEN500: Rank the supercomputers with emphasis on energy usage (LINPACK / power consumption)
GRAPH500: Rank systems based on benchmarks degisned for data-intensive computing