This tutorial is meant to get you started with writing your tests and tuning scripts using Kernel Tuner. We'll use a simple 2D Convolution kernel as an example kernel, but as you will find out shortly, much of the scripts that you write with Kernel Tuner can be reused for testing and tuning other kernels.
Convolution operations are essential to signal and image processing applications and are the main operation in convolutional neural networks used for deep learning. A convolution operation computes the linear combination of the weights in a convolution filter and a range of pixels from the input image for each output pixel. A 2D convolution of an input image $I$ of size $(w\times h)$ and a convolution filter $F$ of size $(F_w\times F_h)$ computes an output image $O$ of size $((w-F_w)\times (h-F_h))$: \begin{equation}\nonumber O(x,y) = \sum\limits_{j=0}^{F_h} \sum\limits_{i=0}^{F_w} I(x+i,y+j)\times F(i,j) \end{equation}
A naive CUDA kernel for 2D Convolution parallelizes the operation by creating one thread for each output pixel. Note that to avoid confusion around the term kernel, we refer to the convolution filter as a filter.
The code is shown in the following code block, make sure you execute all code blocks in this tutorial by selecting them and pressing shift+enter:
In [ ]:
naive_kernel_string = """
__global__ void convolution_kernel(float *output, float *input, float *filter) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int i, j;
float sum = 0.0;
if (y < image_height && x < image_width) {
for (j = 0; j < filter_height; j++) {
for (i = 0; i < filter_width; i++) {
sum += input[(y + j) * input_width + (x + i)] * filter[j * filter_width + i];
}
}
output[y * image_width + x] = sum;
}
}
"""
In [ ]:
import numpy
from kernel_tuner import run_kernel
filter_size = (17, 17)
output_size = (4096, 4096)
size = numpy.prod(output_size)
border_size = (filter_size[0]//2*2, filter_size[1]//2*2)
input_size = ((output_size[0]+border_size[0]) * (output_size[1]+border_size[1]))
output_image = numpy.zeros(size).astype(numpy.float32)
input_image = numpy.random.randn(input_size).astype(numpy.float32)
conv_filter = numpy.random.randn(filter_size[0]*filter_size[1]).astype(numpy.float32)
Now that we have our input and output data structures created, we can look at how to run our naive kernel on this data, by calling run_kernel
. The run_kernel function has the following signature:
run_kernel(kernel_name, kernel_string, problem_size, arguments, params, ...)
The ellipsis here indicate that there are many more optional arguments, which we won't need right now. If you're interested, the complete API documentation of run_kernel can be found here.
The five required arguments of run_kernel are:
The kernel_name is simply a string with the name of the kernel
in the code. The kernel_source can be a string containing the code, which we have already defined above as naive_kernel_string
.
In [ ]:
kernel_name = "convolution_kernel"
kernel_source = naive_kernel_string
The problem_size is what is used by Kernel Tuner to determine the grid dimensions of the kernel.
Our naive kernel needs one thread for each pixel in the output image. As defined above, our output_size
is $4096 \times 4096$.
Kernel Tuner computes the grid dimensions of a kernel by dividing the problem_size in each dimension with the grid divisors in that dimension. The grid divisors are, by default, simply the thread block dimensions. So for our naive kernel we do not need to specify any grid divisor lists.
In [ ]:
problem_size = output_size
The arguments is a list of arguments that are used to run the kernel on the GPU. arguments should be specified as a list of Numpy objects (arrays and/or scalars) that correspond with the function arguments of our CUDA kernel. Our naive convolution kernel has the following signature:
__global__ void convolution_kernel(float *output, float *input, float *filter) { }
Therefore, our list of Numpy objects should contain the output image, the input image, and the convolution filter, and exactly in that order, matching the type (32-bit floating-point arrays) and dimensions that are expected by the kernel.
In [ ]:
arguments = [output_image, input_image, conv_filter]
The final required argument is params, which is a dictionary with the user-defined parameters of the kernel. Remember that the user, is you! You can specify anything here and Kernel Tuner will insert a C-preprocessor #define
statement into the kernel with the value that you specify. For example, if you were to create a dictionary like so:
params = {"I_like_convolutions": 42}
Kernel Tuner will insert the following line into our naive convolution kernel:
#define I_like_convolutions 42
While we do like convolutions, this definition won't have much effect on the performance of our kernel. Unless of course somewhere in our kernel we are doing something differently depending on the value of this preprocessor token.
In addition to freely defined parameters there are a few special values. You may have noticed that we are about to call a CUDA kernel but we haven't specified any thread block dimensions yet. When using Kernel Tuner, thread block dimensions are basically just parameters to the kernel. Therefore, the parameters with the names "block_size_x"
, "block_size_y"
, and "block_size_z"
will be interpreted by Kernel Tuner as the thread block dimensions in x,y, and z.
Note that these are just the defaults, if you prefer to name your thread block dimensions differently, please use the block_size_names=
option.
Let's continue with the creation of our params dictionary such that we can run our naive convolution kernel. As thread block dimensions we will just select the trusty old $16 \times 16$:
In [ ]:
params = dict()
params["block_size_x"] = 16
params["block_size_y"] = 16
Finally, we specify a some input dimensions that are required by our kernel. As you may have noticed the kernel uses, currently undefined, constants, like image_height, image_width, filter_heigth, and filter_width. We also insert those values using the parameters using Kernel Tuner. Note that this is not required, we could also have specified these at runtime as arguments to the kernel.
In [ ]:
params["image_height"] = output_size[1]
params["image_width"] = output_size[0]
params["filter_height"] = filter_size[1]
params["filter_width"] = filter_size[0]
params["input_width"] = output_size[0] + border_size[0]
Now we have setup everything that should allow us to call run_kernel:
In [ ]:
answer = run_kernel(kernel_name, kernel_source, problem_size, arguments, params)
print("Done")
If you execute the above cell it will allocate GPU memory, move the contents of the arguments list to GPU memory, compile the kernel specified in kernel_source, and run the kernel name kernel_name with the thread block dimensions specified in params and the grid dimensions derived from the problem_size. After executing the kernel, run_kernel
will also retrieve the results from GPU memory, and free GPU memory. The run_kernel
function returns the data retrieved from the GPU in a list that we have named answer in the above example.
The answer list contains Numpy objects (arrays and/or scalars) in the same order and of the same type as the arguments list that we used to call the kernel with, but contrast to arguments it contains the data that was stored in GPU memory after our naive convolution kernel had finished executing. This feature is particularly useful for implementing tests for your GPU kernels. You can perform the same operation in Python and compare the results.
In many cases there are more tunable parameters than just the thread block dimensions. We have included a highly optimized 2D Convolution kernel that contains many parametrized code optimizations. It's a bit long to include here, so instead we just read it in from a file, you may need to adjust the path a little bit depending on where you've stored the Kernel Tuner's source code and where this notebook is executing.
In [ ]:
filename = "../examples/cuda/convolution.cu"
with open(filename, 'r') as f:
convolution_kernel_string = f.read()
Tuning a kernel with Kernel Tuner is done using the tune_kernel
function. The interface should look familiar, because it's exactly like run_kernel
:
tune_kernel(kernel_name, kernel_string, problem_size, arguments, tune_params, ...)
The only difference is that the params dictionary is replaced by a tune_params dictionary that works similarly, but instead of a single value per parameter tune_params should contain a list of possible values for that parameter.
Again, the ellipsis indicate that there are many more optional arguments, but we won't need those right now. If you're interested, the complete API documentation of tune_kernel can be found here.
We could create a tune_params dictionary in the following way:
In [ ]:
tune_params = dict()
tune_params["block_size_x"] = [16, 32, 64, 128]
tune_params["block_size_y"] = [8, 16]
Let's just try that out and see what happens:
In [ ]:
from kernel_tuner import tune_kernel
results, env = tune_kernel(kernel_name, convolution_kernel_string, problem_size, arguments, tune_params)
As you can see, Kernel Tuner takes the Cartesian product of all lists in tune_params and benchmarks a kernel for each possible combination of values for all the tunable parameters. For such a small set of combinations benchmarking all of them is not really a problem. However, if there are a lot of tunable parameters with many different options this can get problematic. Therefore, Kernel Tuner supports many different optimization strategies, how to use these is explain the API documentation of tune_kernel.
Some combinations of values are illegal and will be skipped automatically. For example, using thread block dimensions of $128 \times 16 = 2048$, which is more than the limit of 1024 that is currently the limit in all CUDA devices. Configurations that fail for other (to be expected) reasons like using too much shared memory, or requiring more registers than available on the device will also be skipped silently by Kernel Tuner, unless you specify "verbose=True" as an optional argument to tune_kernel
. Note that other errors, like an out-of-bounds memory access will not be ignored.
The tune_kernel
functions returns two things. The first is the results, which is a list of records that show the execution time of each benchmarked kernel and the parameters used to compile and run that specific kernel. Secondly, tune_kernel returns a dictionary that describes the environment in which the tuning took place. That means all the inputs to tune_kernel
are recorded, but also the software versions of your CUDA installation, OS and so on, along with GPU device information. This second dictionary can be stored along with the results so that you can always find out under what circumstances those results were obtained.
I promised that we would use more tunable parameters than just thread block dimensions. Our 2D Convolution kernel also also supports tiling factors in the x and y dimensions. Tiling factors indicate that the amount of work performed by the thread block in a particular dimension is increased with a certain factor.
In [ ]:
tune_params["tile_size_x"] = [1, 2, 4]
tune_params["tile_size_y"] = [1, 2, 4]
It's important to understand that if we increase the amount of work that is performed by every thread block, we also need fewer thread blocks, because the total amount of work stays the same. Remember that the Kernel Tuner computes the grid dimensions (the number of thread blocks the kernel is executed with) from the problem_size and the thread block dimensions.
So now we need to tell Kernel Tuner that we have a tunable parameter that influences the way that the grid dimensions are computed, for this we have the grid divisor lists. You may have noticed that we already have a tunable parameter that influences the grid dimensions, namely the thread block dimensions that we call "block_size_x" and "block_size_y". We did not yet need to specify any grid divisor lists because Kernel Tuner is dividing the problem size by the thread block dimensions by default. However, if we are going to use grid divisor lists we need to specify all tunable parameters that divide the problem size in a certain dimension to obtain the grid size in that dimension.
So to mimick the default behavior that we have been assuming so far we would need to specify:
In [ ]:
grid_div_x = ["block_size_x"]
grid_div_y = ["block_size_y"]
Now we should add the tiling factors to the grid divisor lists because, as the tiling factor is increased, the number of thread blocks in that dimension should be decreased correspondingly.
In [ ]:
grid_div_x = ["block_size_x", "tile_size_x"]
grid_div_y = ["block_size_y", "tile_size_y"]
Now we are ready to call tune_kernel again with our expanded search space. Note that this may take a bit longer since we have just increased our parameter space with a factor of 9.
In [ ]:
results, env = tune_kernel(kernel_name, convolution_kernel_string, problem_size, arguments, tune_params,
grid_div_x=grid_div_x, grid_div_y=grid_div_y)
And that's it for this tutorial! We have seen how to call run_kernel and tune_kernel for a 2D Convolution kernel using different thread block dimensions and other tunable parameters. You now know enough to be able to start tuning your own CUDA and/or OpenCL kernels!