GPUs can calculate faster than CPUs for three architectural reasons
We create an array with a million elements and do a silly operation on it to create some busy work. The operation involves square roots and sine functions to create The function we choose is this silly thing: $$ f(x) = \sqrt{x\mod 13} + \sum_{n=50}^{53} \sin(x\mod n) $$ (The mod function returns the division remainder.)
In [ ]:
import numpy as np
import math
from numba import jit, vectorize
from timeit import default_timer as timer
In [ ]:
N = 10000000
a = np.linspace(0, N, N) # 0, 1, 2, ..., N
b = np.empty_like(a) # Allocated but uninitialized memory
In [ ]:
@jit
def silly_operation(a, b):
"""
Some mathematical busywork to illustrate GPU vs CPU
"""
for i in range(len(a)):
b[i] = math.sqrt(a[i] % 13)
for j in range(50, 54):
b[i] += math.sin(a[i] % j)
In [ ]:
start = timer()
silly_operation(a,b)
end = timer()
print("function took %g s" % (end - start))
In [ ]:
b
In [ ]:
@vectorize('float64(float64)')
def vectorized_silly_operation(a):
"""
Some mathematical busywork to illustrate GPU vs CPU
"""
b = math.sqrt(a % 13)
for j in range(50, 54):
b += math.sin(a % (j))
return b
In [ ]:
start = timer()
b = vectorized_silly_operation(a)
end = timer()
print("function took %g s" % (end - start))
In [ ]:
b
We can run this faster using CPU hardware.
The first step is to set up the arrays in GPU device memory. The input array needs to be copied over, and we need to allocate space for the output array on the GPU device. After the GPU operation, we will need to copy the output back to host memory
In [ ]:
from numba import cuda
In [ ]:
dev_a = cuda.to_device(a) # Copy the input data to GPU (device) memory
dev_b = cuda.device_array_like(a) # Allocate but do not initialize output in GPU memory
In [ ]:
dev_a.alloc_size, dev_b.alloc_size # Ask how many bytes are allocated on the GPU device
Next we write a CUDA kernel. This looks a lot like the vectorized function. CUDA will set this up with blocks of threads. In C++/CUDA, we'd need to set up the blocks and threads and do a bit of bookkeeping. In numba cuda we can simply use the cuda.grid(1) to automatically turn the block and thread ids into an index i.
In [ ]:
@cuda.jit
def cu_silly_operation(a, b):
"""
Some mathematical busywork to illustrate GPU vs CPU
"""
i = cuda.grid(1)
b[i] = math.sqrt(a[i] % 13)
for n in range(4):
b[i] += math.sin(a[i] % (n + 50))
Numba CUDA also allows us to call the kernel without explicity supplying the thread and block size. While we can do this on our own, it's convenient to let numba sort this out from the array dimensions.
In [ ]:
start = timer()
#cu_silly_operation[N//32, 32](dev_a, dev_b)
cu_silly_operation(dev_a, dev_b)
end = timer()
print("function took %g s" % (end - start))
At the end of the calcuation, use the copy_to_host method to copy back to the CPU array.
In [ ]:
dev_b.copy_to_host(b)