Introduction to GPU computing

GPUs can calculate faster than CPUs for three architectural reasons

Many-core architecture allows for massive parallelism. Each core does a very similar operation on a few elements a large dataset.
High memory bandwidth helps move to and from the GPU compute cores.
Oversubscription (more threads than cores) hides memory-transfer latency.

First, look at performance on the CPU

We create an array with a million elements and do a silly operation on it to create some busy work. The operation involves square roots and sine functions to create The function we choose is this silly thing: $$ f(x) = \sqrt{x\mod 13} + \sum_{n=50}^{53} \sin(x\mod n) $$ (The mod function returns the division remainder.)



In [ ]:

    
import numpy as np
import math
from numba import jit, vectorize
from timeit import default_timer as timer



In [ ]:

    
N = 10000000
a = np.linspace(0, N, N) # 0, 1, 2, ..., N
b = np.empty_like(a) # Allocated but uninitialized memory



In [ ]:

    
@jit
def silly_operation(a, b):
    """
    Some mathematical busywork to illustrate GPU vs CPU
    """
    for i in range(len(a)):
        b[i] = math.sqrt(a[i] % 13)
        for j in range(50, 54):
            b[i] += math.sin(a[i] % j)



In [ ]:

    
start = timer()
silly_operation(a,b)
end = timer()
print("function took %g s" % (end - start))



In [ ]:

    
b

We should note that python is naturally single core, so we're missing a factor of 12x to 24x if we could exploit all CPU cores.

Vectorized CPU code

One performance improvement is to note that this function operates on each element individually, so it can be vectorized. This helps performance some.



In [ ]:

    
@vectorize('float64(float64)')
def vectorized_silly_operation(a):
    """
    Some mathematical busywork to illustrate GPU vs CPU
    """
    b = math.sqrt(a % 13)
    for j in range(50, 54):
        b += math.sin(a % (j))
    return b



In [ ]:

    
start = timer()
b = vectorized_silly_operation(a)
end = timer()
print("function took %g s" % (end - start))



In [ ]:

    
b

GPU Acceleration using CUDA

We can run this faster using CPU hardware.

The first step is to set up the arrays in GPU device memory. The input array needs to be copied over, and we need to allocate space for the output array on the GPU device. After the GPU operation, we will need to copy the output back to host memory



In [ ]:

    
from numba import cuda



In [ ]:

    
dev_a = cuda.to_device(a) # Copy the input data to GPU (device) memory
dev_b = cuda.device_array_like(a) # Allocate but do not initialize output in GPU memory



In [ ]:

    
dev_a.alloc_size, dev_b.alloc_size # Ask how many bytes are allocated on the GPU device

Next we write a CUDA kernel. This looks a lot like the vectorized function. CUDA will set this up with blocks of threads. In C++/CUDA, we'd need to set up the blocks and threads and do a bit of bookkeeping. In numba cuda we can simply use the cuda.grid(1) to automatically turn the block and thread ids into an index i.



In [ ]:

    
@cuda.jit
def cu_silly_operation(a, b):
    """
    Some mathematical busywork to illustrate GPU vs CPU
    """
    i = cuda.grid(1)
    b[i] = math.sqrt(a[i] % 13)
    for n in range(4):
        b[i] += math.sin(a[i] % (n + 50))

Numba CUDA also allows us to call the kernel without explicity supplying the thread and block size. While we can do this on our own, it's convenient to let numba sort this out from the array dimensions.



In [ ]:

    
start = timer()
#cu_silly_operation[N//32, 32](dev_a, dev_b)
cu_silly_operation(dev_a, dev_b)
end = timer()
print("function took %g s" % (end - start))

At the end of the calcuation, use the copy_to_host method to copy back to the CPU array.



In [ ]:

    
dev_b.copy_to_host(b)