Consider a multi-core CPU with p cores. The peak single-precision performance of each core is 10 GFlops/s. The peak bandwidth for the memory bus is 60 GB/s. Assume that you are evaluating a polynomial of the form, $x = y^2 + z^3 + yz$, which can be implemented as follows:
float x[N], y[N], z[N];
for (i=0; i < N; ++i)
x[i] = y[i]*y[i] + z[i]*z[i]*z[i] + y[i]*z[i];
A. What is the arithmetic intensity (flops/byte) of each iteration of the loop? Think of how many bytes need to be transferred over the memory bus (both reads and writes) and how many floating point operations are performed.
$f = $ total # of floating point arithmetic (flop) operations
$m = $ total # of memory elements (words) moved between fast and slow memory
float x[N], y[N], z[N];
\\ read y(1:n) into fast memory
\\ read z(1:n) into fast memory
for (i=0; i < N; ++i)
x[i] = y[i]*y[i] + z[i]*z[i]*z[i] + y[i]*z[i];
\\ write x(1:n) back to slow memory
$f = (4$ multiplications $ + 2 $ additions $) \times n = 6n$ operations
3 slow memory operations: read y[i] and z[i], write x[i]
float = 4 bytes
$m = (4 $ bytes $\times 3) \times n = 12n$ bytes
$q = \frac{f}{m} = \frac{6n}{12n} \approx 0.5$
B. Plot the theoretical peak performance of the system (GFlops/s) as a function of the number of cores, where 1 <= p <= 8. For what values of p would this computation’s performance be compute-bound and for what values would it be memory-bound?
For p <= 3 the performance is compute-bound, and for p > 3 it is memory-bound.
In [24]:
FLOPS_PER_CORE = 10 # GFlops/s
MEMORY_BUS = 60 # GB/s
Q = .5 # Flops/byte
def tpp(cores):
return min(FLOPS_PER_CORE * cores, Q * MEMORY_BUS)
plt.plot(range(1, 9), [tpp(i) for i in range(1, 9)], label="Theoretical peak performance")
plt.plot(range(1, 9), [FLOPS_PER_CORE * i for i in range(1, 9)], '--', label="CPU-bound peak performance")
plt.plot(range(1, 9), [Q * MEMORY_BUS for i in range(1, 9)], '--', label="Memory-bound peak performance")
plt.xlabel("number of cores")
plt.ylabel("theoretical peak performance")
plt.ylim((1, 40))
_ = plt.legend(loc='best')
C. Repeat A and B for the same computation but this time in double-precision (i.e. double x[N], y[N], z[N]). The peak double-precision performance per core is half of its single- precision peak - 5 GFlops/s.
double x[N], y[N], z[N];
\\ read y(1:n) into fast memory
\\ read z(1:n) into fast memory
for (i=0; i < N; ++i)
x[i] = y[i]*y[i] + z[i]*z[i]*z[i] + y[i]*z[i];
\\ write x(1:n) back to slow memory
$f = (4$ multiplications $ + 2 $ additions $) \times n = 6n$ operations
3 slow memory operations: read y[i] and z[i], write x[i]
double = 8 bytes
$m = (8 $ bytes $\times 3) \times n = 24n$ bytes
$q = \frac{f}{m} = \frac{6n}{24n} \approx 0.25$
For p <= 3 the performance is compute-bound, and for p > 3 it is memory-bound.
In [25]:
FLOPS_PER_CORE = 5 # GFlops/s
MEMORY_BUS = 60 # GB/s
Q = .25 # Flops/byte
def tpp(cores):
return min(FLOPS_PER_CORE * cores, Q * MEMORY_BUS)
plt.plot(range(1, 9), [tpp(i) for i in range(1, 9)], label="Theoretical peak performance")
plt.plot(range(1, 9), [FLOPS_PER_CORE * i for i in range(1, 9)], '--', label="CPU-bound peak performance")
plt.plot(range(1, 9), [Q * MEMORY_BUS for i in range(1, 9)], '--', label="Memory-bound peak performance")
plt.xlabel("number of cores")
plt.ylabel("theoretical peak performance")
plt.ylim((1, 40))
_ = plt.legend(loc='best')
Consider the same architecture as above - p cores, each with 5 GFlops/s peak double- precision performance and 60 GB/s memory bandwidth. Let there be only a single level of cache with a total bandwidth of 300 GB/s (from cache to all cores). Now assume that your computational kernel has an arithmetic intensity of 0.125 flops/byte and a cache hit ratio of 50%. Plot the performance of the system as a function of the number of cores p. Denote the ranges for p, where the peak performance is bound by the CPU, the memory bandwidth and the cache bandwidth (if applicable).
I modeled the theoretical peak performance as
$$ tpp(n) = min \begin{cases} n \times \frac{flops}{core} \\ Q * (\verb|L1_CACHE_BANDWIDTH| * \verb|CACHE_HIT_RATIO|) \end{cases} $$In the first case the computation is CPU-bound, and in the second one it is cache-bound.
The cache modeling is somewhat optimistic, since the effective bandwidth is even smaller when the main memory needs to be accessed, but I think it is a valid upper bound because it won't be faster than that.
With $\verb|CACHE_HIT_RATIO| = .5$ the computation is CPU-bound for p <= 3 and cache-bound for p > 3
In [40]:
FLOPS_PER_CORE = 5 # GFlops/s
MEMORY_BUS = 60 # GB/s
Q = .125 # Flops/byte
CACHE_HIT_RATIO = .5 #
L1_CACHE_BANDWIDTH = 300 # GB/s
def tpp(cores):
return min(FLOPS_PER_CORE * cores, Q * (L1_CACHE_BANDWIDTH * CACHE_HIT_RATIO))
plt.plot(range(1, 9), [tpp(i) for i in range(1, 9)], label="Theoretical peak performance")
plt.plot(range(1, 9), [FLOPS_PER_CORE * i for i in range(1, 9)], '--', label="CPU-bound peak performance")
plt.plot(range(1, 9), [Q * MEMORY_BUS for i in range(1, 9)], '--', label="Memory-bound peak performance")
plt.plot(range(1, 9), [Q * L1_CACHE_BANDWIDTH for i in range(1, 9)], '--', label="Cache-bound peak performance")
plt.xlabel("number of cores")
plt.ylabel("theoretical peak performance")
plt.ylim((1, 40))
_ = plt.legend(loc='best')
In [38]:
FLOPS_PER_CORE = 5 # GFlops/s
MEMORY_BUS = 60 # GB/s
Q = .125 # Flops/byte
CACHE_HIT_RATIO = .7
L1_CACHE_BANDWIDTH = 300 # GB/s
def tpp(cores):
return min(FLOPS_PER_CORE * cores, Q * (L1_CACHE_BANDWIDTH * CACHE_HIT_RATIO))
plt.plot(range(1, 9), [tpp(i) for i in range(1, 9)], label="Theoretical peak performance")
plt.plot(range(1, 9), [FLOPS_PER_CORE * i for i in range(1, 9)], '--', label="CPU-bound peak performance")
plt.plot(range(1, 9), [Q * MEMORY_BUS for i in range(1, 9)], '--', label="Memory-bound peak performance")
plt.plot(range(1, 9), [Q * L1_CACHE_BANDWIDTH for i in range(1, 9)], '--', label="Cache-bound peak performance")
plt.xlabel("number of cores")
plt.ylabel("theoretical peak performance")
plt.ylim((1, 40))
_ = plt.legend(loc='best')
In [41]:
FLOPS_PER_CORE = 5 # GFlops/s
MEMORY_BUS = 60 # GB/s
Q = .125 # Flops/byte
CACHE_HIT_RATIO = .9
L1_CACHE_BANDWIDTH = 300 # GB/s
def tpp(cores):
return min(FLOPS_PER_CORE * cores, Q * (L1_CACHE_BANDWIDTH * CACHE_HIT_RATIO))
plt.plot(range(1, 9), [tpp(i) for i in range(1, 9)], label="Theoretical peak performance")
plt.plot(range(1, 9), [FLOPS_PER_CORE * i for i in range(1, 9)], '--', label="CPU-bound peak performance")
plt.plot(range(1, 9), [Q * MEMORY_BUS for i in range(1, 9)], '--', label="Memory-bound peak performance")
plt.plot(range(1, 9), [Q * L1_CACHE_BANDWIDTH for i in range(1, 9)], '--', label="Cache-bound peak performance")
plt.xlabel("number of cores")
plt.ylabel("theoretical peak performance")
plt.ylim((1, 40))
_ = plt.legend(loc='best')
Assume you are the director of Simulation University’s HPC center and you are given a $$1 million budget (you may go over the budget by up to 2% if you have a good justification to do so) to purchase a new system. You did some market research and found out that the compute nodes you like cost $5,000/node and each interconnect link (together with the necessary network hardware) costs $1000/piece. You did a survey among the HPC center users and (on average) they rated the utility of an additional compute node as 5 and the utility of increasing the bisection bandwidth of the network by one link as 3. Think of the network topologies we discussed in class - bus, ring, 2D/3D torus (or k-D torus), hypercube, tree or a fat-tree. How many nodes would you buy and how would you connect them? What if the users rate the compute vs. communicate utilities as 5 vs 1?
How many nodes would you buy and how would you connect them?
What if the users rate the compute vs. communicate utilities as 5 vs 1?
When computation is more valuable than communication a bus or a ring would allow buying more nodes, and with 170 nodes and ~170 connections the price would be very close to \$ 1 million + 2%
The real question is why the users rated this way. Maybe they are not aware of the importance of communication, and when the system is deployed they will likely complain about slowness. Since I'm the director they would blame me, even if I argue I prioritized what they asked for.
I would recommend going with a 2-d torus with ~145 nodes, since it will have better communication and the node count wouldn't be much smaller.
In [158]:
kd_cost, _ = cost(kd_torus, 145)
print("2d torus cost for 145 nodes is $%d" % kd_cost)
ring_cost, ring_bissection = cost(ring, 170)
print("Ring cost for 170 nodes is $%d" % ring_cost)
In [159]:
cost_for_nodes(170)
Out[159]:
In [151]:
BUDGET = 1000000
MAX_BUDGET = BUDGET * 1.02
COMPUTE_NODE_PRICE = 5000
INTERCONNECT_LINK = 1000
ADDITIONAL_NODE = 5
INCREASE_BISECTION_BW = 3
# Diameter, Bissection, Connections
def full_network(p):
return 1, math.floor(p/2.), p * (p - 1) / 2
def bus(p):
return p - 1, 1, p - 1 # p * 2 - 1
def ring(p):
return math.floor(p/2.), 2, p
def hypercube(p):
d = int(math.log(p, 2))
return d, (d - 1) ** 2, (p * d) / 2
def kd_torus(p, k=2):
return k * math.floor(p / 2.), 2 * p, k * p
def tree(p):
return 2 * math.log(p, 2), 1, p - 1 # TODO: check bissection!
def fat_tree(p):
pass
def cost(topology, nodes):
diameter, bissection, connections = topology(nodes)
return nodes * COMPUTE_NODE_PRICE + connections * INTERCONNECT_LINK, bissection
def cost_for_nodes(nodes):
cost_for_topology = {}
for topology in (full_network, bus, ring, hypercube, kd_torus): #, tree, fat_tree):
if topology is kd_torus:
for d in (2, 3, 5):
kd = partial(topology, k=d)
cost_for_topology['%dd torus' % d] = cost(kd, nodes)
else:
cost_for_topology[topology.__name__] = cost(topology, nodes)
return sorted(cost_for_topology, key=cost_for_topology.get), cost_for_topology
The outer product is a commonly used tensor operation. The input is two vectors of length N and M. The result is a matrix of dimensions N by M. In code it would look as follows:
double x[N], y[M], A[N][M];
for (i=0; i<N; ++i)
for (j=0; j<M; ++j)
A[i][j] = x[i]*y[j];
A. Assume that there is only a single level of cache of size 64 KBs, a cache line is 64 bytes and the Least Recently Used (LRU) policy is adopted for cache eviction. For N = M = 1024, how many cache misses would the implementation above incur?
There are 1026 cache misses: 1024 misses for A[i], one miss for x and one for y, both when starting the first iteration.
x in bytes = x in bytes = A[i] in bytes = N * 8 bytes = 8192 bytes
Only need to keep x, y and A[i] in cache. LRU will always keep x and y somewhere in cache.
16384 bytes for x and y, 8192 bytes for A[i] = 24576 bytes < 64 KB
A[i] need to be loaded into cache once during the inner loop. After some iterations some A[g] for g < i will drop from cache, but we don't need it anymore.
B. Repeat A, but now for N = M = 10240.
There are 3 * (N / 8) = 3840 cache misses: x, y and A[i].
x in bytes = x in bytes = A[i] in bytes = N * 8 bytes = 81920 bytes
163840 bytes for x and y, 81920 bytes for A[i] = 245760 bytes > 64 KB
In this case the data won't fill into cache. Worse, even x or y won't fit fully into cache and be reused before LRU remove it from cache, so there will be N / 8 cache misses for each.
C. Based on your insights from 4.A and 4.B, give the pseudo-code for a high performance outer product implementation with a reduced number of cache misses.
I would use a parameter b to determine a block size which fits into cache. Since ideally we need to keep 3 arrays of the same size (for x[, y and A[i]) in cache, we can set b to be the same size as N (1024) in 4.A, but values slighter bigger would still fit in cache.
This is similar to the blocked matrix multiplication.
In [160]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn
import math
from functools import partial