```
In [ ]:
```import math
def hypot(x, y):
x = abs(x);
y = abs(y);
t = min(x, y);
x = max(x, y);
t = t / x;
return x * math.sqrt(1+t*t)

```
In [ ]:
```%%timeit
hypot(3.0, 4.0)

```
In [ ]:
```from numba import jit
import math
@jit
def hypot_jit(x, y):
x = abs(x);
y = abs(y);
t = min(x, y);
x = max(x, y);
t = t / x;
return x * math.sqrt(1+t*t)

```
In [ ]:
```%%timeit
hypot_jit(3.0, 4.0)

```
In [ ]:
```%%timeit
hypot_jit.py_func(3.0, 4.0)

```
In [ ]:
```%%timeit
math.hypot(3.0, 4.0)

```
In [ ]:
```hypot_jit.inspect_types()

```
In [ ]:
```from numba import vectorize
@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
return x + y

```
In [ ]:
```a = 12
b = 17
b_col = 11
c = 1
print('a+b:\n', add_ufunc(a, b))
print()
print('b_col + c:\n', add_ufunc(b_col, c))

This is to be expected because we have (deliberately) misused the GPU in several ways in this example:

**Our inputs are too small**: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.**Our calculation is too simple**: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.**We copy the data to and from the GPU**: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.**Our data types are larger than necessary**: Our example uses`int64`

when we probably don't need it. Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU. Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. NumPy defaults to 64-bit data types when creating arrays, so it is important to set the`dtype`

attribute or use the`ndarray.astype()`

method to pick 32-bit types when you need them.

Given the above, let's try an example that is faster on the GPU:

```
In [ ]:
```import numpy as np
import math # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy
from numba import vectorize
# This gets inlined at compile time
SQRT_2PI = np.float32((2*math.pi)**0.5)
@vectorize(['float32(float32, float32, float32)'], target='cuda')
# Probability Distribution Function
def gaussian_pdf(x, mean, sigma):
'''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''
return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)

```
In [ ]:
```# Evaluate the Gaussian a million times!
x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)
mean = np.float32(0.0)
sigma = np.float32(1.0)
# Quick test
gaussian_pdf(x[0], 0.0, 1.0)

```
In [ ]:
```import scipy.stats # for definition of gaussian distribution
norm_pdf = scipy.stats.norm
%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)

`device=True`

keeps the code on the GPU. A CPU-based kernel is not created.

```
In [ ]:
```from numba import cuda
@cuda.jit(device=True)
def polar_to_cartesian(rho, theta):
x = rho * math.cos(theta)
y = rho * math.sin(theta)
return x, y # This is Python, so let's return a tuple
@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
def polar_distance(rho1, theta1, rho2, theta2):
x1, y1 = polar_to_cartesian(rho1, theta1)
x2, y2 = polar_to_cartesian(rho2, theta2)
return ((x1 - x2)**2 + (y1 - y2)**2)**0.5

```
In [ ]:
```n = 1000000
rho1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
theta1 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)
rho2 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
theta2 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)

```
In [ ]:
```polar_distance(rho1, theta1, rho2, theta2)

During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs. If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls. In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.

Let's create our example addition ufunc again:

```
In [ ]:
```from numba import vectorize
import numpy as np
@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
return x + y

```
In [ ]:
```n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x

```
In [ ]:
```%timeit add_ufunc(x, y) # Baseline performance with host arrays

`numba.cuda`

module includes a function that will copy host data to the GPU and return a CUDA device array:

```
In [ ]:
```from numba import cuda
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
print(x_device)
print(x_device.shape)
print(x_device.dtype)

Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:

```
In [ ]:
```%timeit add_ufunc(x_device, y_device)

`numba.cuda.device_array()`

function:

```
In [ ]:
```# Similar to np.empty()
## Just allocating memory buffer - not initializing data
out_device = cuda.device_array(shape=(n,), dtype=np.float32) # does not initialize the contents, like np.empty()

And then we can use a special `out`

keyword argument to the ufunc to specify the output buffer:

```
In [ ]:
```%timeit add_ufunc(x_device, y_device, out=out_device)

*much* faster than before. When we want to bring the device array back to the host memory, we can use the `copy_to_host()`

method:

```
In [ ]:
```out_host = out_device.copy_to_host()
print(out_host[:10])

That's a lot more typing than our ufunc example, and it is much more limited: only works on 1D arrays, doesn't verify input sizes match, etc. Most of the function is spent figuring out how to turn the block and grid indices and dimensions into unique offsets into the input arrays. The pattern of computing a starting index and a stride is a common way to ensure that your grid size is independent of the input size. The striding will maximize bandwidth by ensuring that threads with consecuitive indices are accessing consecutive memory locations as much as possible. Thread indices beyond the length of the input (`x.shape[0]`

, since `x`

is a NumPy array) automatically skip over the for loop.

Also note that we did not need to specify a type signature for the CUDA kernel. Unlike `@vectorize`

, Numba can infer the type signature from the inputs automatically, and much more reliably.

Let's create and run a function on some data:

```
In [ ]:
```from numba import cuda
@cuda.jit
def add_kernel(x, y, out):
tx = cuda.threadIdx.x # this is the unique thread ID within a 1D block
ty = cuda.blockIdx.x # Similarly, this is the unique block ID within the 1D grid
block_size = cuda.blockDim.x # number of threads per block
grid_size = cuda.gridDim.x # number of blocks in the grid
start = tx + ty * block_size
stride = block_size * grid_size
# assuming x and y inputs are same length
for i in range(start, x.shape[0], stride):
out[i] = x[i] + y[i]

```
In [ ]:
```import numpy as np
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
threads_per_block = 128
blocks_per_grid = 30
add_kernel[blocks_per_grid, threads_per_block](x, y, out)
print(out[:10])

The unusual syntax for calling the kernel function is designed to mimic the CUDA Runtime API in C, where the above call would look like:

`add_kernel<<<blocks_per_grid, threads_per_block>>>(x, y, out)`

The arguments within the square brackets define the size and shape of the thread grid, and the arguments with parentheses correspond to the kernel function arguments.

Note that, unlike the ufunc, the arguments are passed to the kernel as full NumPy arrays. The kernel can access any element in the array it wants, regardless of its position in the thread grid. This is why CUDA kernels are significantly more powerful that ufuncs. (But with great power, comes a greater amount of typing...)

Numba includes several helper functions to simplify the thread offset calculations above. You can write the function much more simply as:

```
In [ ]:
```@cuda.jit
def add_kernel(x, y, out):
start = cuda.grid(1) # 1 = one dimensional thread grid, returns a single value
stride = cuda.gridsize(1) # ditto
# assuming x and y inputs are same length
for i in range(start, x.shape[0], stride):
out[i] = x[i] + y[i]

```
In [ ]:
```x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)

```
In [ ]:
```%timeit add_kernel[blocks_per_grid, threads_per_block](x, y, out)

```
In [ ]:
```%timeit add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device); out_device.copy_to_host()

*One extremely important caveat should be mentioned here*: CUDA kernel execution is designed to be asynchronous with respect to the host program. This means that the kernel launch (`add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)`

) returns immediately, allowing the CPU to continue executing while the GPU works in the background. Only host<->device memory copies or an explicit synchronization call will force the CPU to wait until previously queued CUDA kernels are complete.

When you pass host NumPy arrays to a CUDA kernel, Numba has to synchronize on your behalf, but if you pass device arrays, processing will continue. If you launch multiple kernels in sequence without any synchronization in between, they will be queued up to run sequentially by the driver, which is usually what you want. If you want to run multiple kernels on the GPU in parallel (sometimes a good idea, but beware of race conditions!), take a look at CUDA streams.

Here's some sample timings (using `%time`

, which only runs the statement once to ensure our measurement isn't affected by the finite depth of the CUDA kernel queue):

```
In [ ]:
```# CPU input/output arrays, implied synchronization for memory copies
%time add_kernel[blocks_per_grid, threads_per_block](x, y, out)

```
In [ ]:
```# GPU input/output arrays, no synchronization (but force sync before and after)
cuda.synchronize()
%time add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)
cuda.synchronize()

```
In [ ]:
```# GPU input/output arrays, include explicit synchronization in timing
cuda.synchronize()
%time add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device); cuda.synchronize()

**Always be sure to synchronize with the GPU when benchmarking CUDA kernels!**

CUDA, like many general purpose parallel execution frameworks, makes it possible to have race condtions in your code. A race condition in CUDA arises when threads read or write a memory location that might be modified by another independent thread. Generally speaking, you need to worry about:

- read-after-write hazards: One thread is reading a memory location at the same time another thread might be writing to it.
- write-after-write hazards: Two threads are writing to the same memory location, and only one write will be visible when the kernel is complete.

A common strategy to avoid both of these hazards is to organize your CUDA kernel algorithm such that each thread has exclusive responsibility for unique subsets of output array elements, and/or to never use the same array for both input and output in a single kernel call. (Iterative algorithms can use a double-buffering strategy if needed, and switch input and output arrays on each iteration.)

However, there are many cases where different threads need to combine results. Consider something very simple, like: "every thread increments a global counter." Implementing this in your kernel requires each thread to:

- Read the current value of a global counter.
- Compute
`counter + 1`

. - Write that value back to global memory.

However, there is no guarantee that another thread has not changed the global counter between steps 1 and 3. To resolve this problem, CUDA provides "atomic operations" which will read, modify and update a memory location in one, indivisible step. Numba supports several of these functions, described here.

Let's make our thread counter kernel:

```
In [ ]:
```@cuda.jit
def thread_counter_race_condition(global_counter):
global_counter[0] += 1 # This is bad
@cuda.jit
def thread_counter_safe(global_counter):
cuda.atomic.add(global_counter, 0, 1) # Safely add 1 to offset 0 in global_counter array

```
In [ ]:
```# This gets the wrong answer
global_counter = cuda.to_device(np.array([0], dtype=np.int32))
thread_counter_race_condition[64, 64](global_counter)
print('Should be %d:' % (64*64), global_counter.copy_to_host())

```
In [ ]:
```# This works correctly
global_counter = cuda.to_device(np.array([0], dtype=np.int32))
thread_counter_safe[64, 64](global_counter)
print('Should be %d:' % (64*64), global_counter.copy_to_host())

Another common error occurs when a CUDA kernel has an invalid memory access, typically caused by running off the end of an array. The full CUDA toolkit from NVIDIA (not the `cudatoolkit`

conda package) contain a utility called `cuda-memcheck`

that can check for a wide range of memory access mistakes in CUDA code.

Let's debug the following code:

Note the `debug=True`

flag

```
In [ ]:
```%%bash
cat /root/src/main/python/numba/histogram.py

```
In [ ]:
```%%bash
cuda-memcheck python /root/src/main/python/numba/histogram.py

We briefly mention in notebook #4 that the CUDA programming model organizes threads into a two-layer structure. A grid is composed of many blocks, which are composed of many threads. Threads within the same block can communicate much more easily than threads in different blocks. The main mechanism for this communication is *shared memory*. Shared memory is discussed extensively in the CUDA C Programming Guide, as well as many other books on CUDA programming. We will only describe it very briefly here, and focus mainly on the Python syntax for using it.

Shared memory is a section of memory that is visible at the block level. Different blocks cannot see each other's shared memory, and all the threads within a block see the same shared memory. It does not persist after a CUDA kernel finishes executing. Shared memory is scarce hardware resource, so should be used sparingly or side effects such as lower performance or even kernel launch failure (if you exceed the hardware limit of 48 kB per block) will occur.

Shared memory is good for several things:

- caching of lookup tables that will be randomly accessed
- buffering output from threads so it can be coalesced before writing it back to device memory.
- staging data for scatter/gather operations within a block

As an example of the power of shared memory, let's write a transpose kernel that takes a 2D array in row-major order and puts it in column-major order. (This is based on Mark Harris' blog post at: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/)

First, let's do the naive approach where we let each thread read and write individual elements independently:

```
In [ ]:
```TILE_DIM = 32
BLOCK_ROWS = 8
@cuda.jit
def transpose(a_in, a_out):
x = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.x
y = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.y
for j in range(0, TILE_DIM, BLOCK_ROWS):
a_out[x, y + j] = a_in[y + j, x]

```
In [ ]:
```size = 1024
a_in = cuda.to_device(np.arange(size*size, dtype=np.int32).reshape((size, size)))
a_out = cuda.device_array_like(a_in)
print(a_in.copy_to_host())

```
In [ ]:
```grid_shape = (int(size/TILE_DIM), int(size/TILE_DIM))
%timeit transpose[grid_shape,(TILE_DIM, BLOCK_ROWS)](a_in, a_out); cuda.synchronize()
print(a_out.copy_to_host())

```
In [ ]:
```import numba.types
TILE_DIM_PADDED = TILE_DIM + 1 # Read Mark Harris' blog post to find out why this improves performance!
@cuda.jit
def tile_transpose(a_in, a_out):
# THIS CODE ASSUMES IT IS RUNNING WITH A BLOCK DIMENSION OF (TILE_SIZE x TILE_SIZE)
# AND INPUT IS A MULTIPLE OF TILE_SIZE DIMENSIONS
tile = cuda.shared.array((TILE_DIM, TILE_DIM_PADDED), numba.types.int32)
x = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.x
y = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.y
for j in range(0, TILE_DIM, BLOCK_ROWS):
tile[cuda.threadIdx.y + j, cuda.threadIdx.x] = a_in[y + j, x] # transpose tile into shared memory
cuda.syncthreads() # wait for all threads in the block to finish updating shared memory
#Compute transposed offsets
x = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.x
y = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.y
for j in range(0, TILE_DIM, BLOCK_ROWS):
a_out[y + j, x] = tile[cuda.threadIdx.x, cuda.threadIdx.y + j];

```
In [ ]:
```a_out = cuda.device_array_like(a_in) # replace with new array
%timeit tile_transpose[grid_shape,(TILE_DIM, BLOCK_ROWS)](a_in, a_out); cuda.synchronize()
print(a_out.copy_to_host())

That's a 30% speed up!