Image Processing with Hybridizer

Image processing is most often an embarassingly parallel problem. It naturally fits on the GPU.

In this lab, we will study optimization techniques through the implementation and analysis of the median filter, a robust denoising filter.


Prerequisites

To get the most out of this lab, you should already be able to:

  • Write, compile, and run C# programs that both call CPU functions and launch GPU kernels.
  • Control parallel thread hierarchy using execution configuration.
  • Have some notions on images

Objectives

By the time you complete this lab, you will be able to:

  • Accelerate image processing algorithms with Hybridizer and GPUs
  • Explore three different work distribution patterns for image processing
  • Allocate data into registers
  • Some profiling elements for pipeline usage and cache usage

Median Filter

The median filter is a non-linear image filter. It is a robust filter used to remove noise in images. For a given window size, the median of values within that window is used to represent the output. Depending on the size of the window, the results will vary, and a size of 1x1 outputs the same image. Illustration of the filter:

Unlike gaussian or related filters, calculating the median requires a sort, which adds to the complexity for an efficient implementation. It is a very data intensive filter with no arithmetic operation: given window data, the median is extracted, by sorting the values in the window. From an output pixel to an adjacent, most of data is shared and we will see how to make use of this overlap.


Working Set

In this lab, we will be processing an reference image (on the left), onto which noise has been artificially added : white pixels have been randomly added on the input image (on the right).


First Naive Implementation

We start the implementation of the filter with a first naive approach as follows:

public static void NaiveCsharp(ushort[] output, ushort[] input, int width, int height)
{
    int windowCount = 2 * window + 1;
    var buffer = new ushort[windowCount * windowCount];
    for (int j = window; j < height - window; ++j)
    {
        for (int i = window; i < width - window; ++i)
        {
            for (int k = -window; k <= window; ++k)
            {
                for (int p = -window; p <= window; ++p)
                {
                    int bufferIndex = (k + window) * windowCount + p + window;
                    int pixelIndex = (j + k) * width + (i + p);
                    buffer[bufferIndex] = input[pixelIndex];
                }
            }

            Array.Sort(buffer, 0, windowCount * windowCount);
            output[j * width + i] = buffer[(windowCount * windowCount) / 2];
        }
    }
}

This approach has no inherent parallelism, yet each loop iteration is independent. We will focus on the core part of the calculation, and leave borders management outside of the scope of this lab.

The 01-naive-csharp.cs (<---- click on the link of the source file to open it in another tab for editing) contains a program that is already working. It will load the input noisy image, process it and save the image in an output directory.

Use the below code cell to execute the program and display the output.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-01-naive 
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-01-naive 

!hybridizer-cuda ./01-naive/01-naive-csharp.cs graybitmap.cs -o ./01-naive/01-naive-csharp.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-01-naive/denoised.bmp')
img.save('./output-01-naive/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-01-naive/denoised.png", width=384, height=384)

Distributing on CPU Threads

To accelerate calculations on CPU, we can distribute the work on threads. For this, make use Parallel.For construct on the lines to process lines in parallel. note that each thread may require a separate buffer.

Modify 01-parfor-csharp.cs to make use of CPU parallelism.

Should you need, have a look at the solution.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-02-parfor
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-02-parfor

!hybridizer-cuda ./02-parallel-for/01-parfor-csharp.cs graybitmap.cs -o ./02-parallel-for/01-parfor-csharp.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-02-parfor/denoised.bmp')
img.save('./output-02-parfor/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-02-parfor/denoised.png", width=384, height=384)

Running on the GPU

In order to run the filter on the GPU, three operations are needed:

  • Mark the method with an EntryPoint attribute to indicate it should run on GPU
  • Launch the kernel using the HybRunner, generated with static method HybRunner.Cuda()
  • Array.Sort is not a builtin mapped to an existing code, the sort will be changed (this is out of the scope of this lab and will be already done in the input file)

Creating an instance of HybRunner and wrapping an object is done as follow:

HybRunner runner = HybRunner.Cuda();
dynamic wrapper = runner.Wrap(new Program());

Note that the result of the Wrap method is a dynamic type generated on the fly by the runner. It exposes the same methods as the wrapped type, with the same signature. Hence, launching the kernel is simply done by calling the method using the wrapper instance instead of the base instance (or no instance for static methods).

We will start from the Parallel.For version of the code. This expression of parallelism is interpreted by Hybridizer and transformed into a grid stride loop on threads and blocks. Hence adding the EntryPoint attribute should suffice.

Modify 01-naive-gpu.cs to make sure the method runs on GPU.

Should you need, refer to the solution.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-03-naive-gpu
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-03-naive-gpu

!hybridizer-cuda ./03-naive-gpu/01-naive-gpu.cs graybitmap.cs -intrinsics bitonicsort.cuh=./ -o ./03-naive-gpu/01-naive-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-03-naive-gpu/denoised.bmp')
img.save('./output-03-naive-gpu/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-03-naive-gpu/denoised.png", width=384, height=384)

Memory Allocation

The obtained performance appears to be very low. Let's investigate.

In the Parallel.For, an array is allocated for each line in the image. A malloc on the GPU, done for each thread, is very expensive (given execution configuration - we will see that later on - there are thousands of calls).

To reduce this, we don't allocate memory dynamically on the heap, but rather on the stack (local memory - or registers in best cases), which is allocated once at kernel startup. To this aim, we expose a class for which constructor will be mapped by a C array declaration:

var buffer = new StackArray<byte>(size) ;

Will get translated into:

unsigned char buffer[size] ;

For this declaration to be valid, size shall be a compile-time constant. There are three ways for obtaining compile-time constants:

  • Litteral constants, e.g. buffer = new StackArray<byte>(42)
  • Constants defined using the HybridConstant attribute on static data, or IntrinsicConstant attribute on a property or method
  • Class constants, assuming compiler will replace those during MSIL generation

Here is an example

class Filter
{
    const int window = 5 ;
    const int windowCount = 2 * window + 1 ;

    [EntryPoint]
    public void F()
    {
        var buffer = new StackArray<byte>(windowCount * windowCount) ;

        ushort[] contents = buffer.data ;
    }
}

This leads to a limitation in variety of filters as the size needs to be provided at compile time. We will discuss this point later on.

Modify 01-stack-gpu.cs to allocate data on the stack instead of the heap.

Should you need, refer to the solution.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-04-stack-gpu
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-04-stack-gpu

!hybridizer-cuda ./04-stack-gpu/01-stack-gpu.cs graybitmap.cs -intrinsics bitonicsort.cuh=./ -o ./04-stack-gpu/01-stack-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-04-stack-gpu/denoised.bmp')
img.save('./output-04-stack-gpu/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-04-stack-gpu/denoised.png", width=384, height=384)

Feeding the Beast

A modern GPU is made of thousands of CUDA-cores, for which operations need to be stacked to hide latency. In our image processing example, the distribution of work is done on lines, that is a couple of thousand for the whole image. Then, each CUDA-thread will process a complete line, as illustrated in the following image:

Using CUDA API, we may query the number of CUDA cores multiprocessors, and the number of CUDA cores. HybRunner has a default execution configuration that is suited to most use-cases.

Run the following code to query information on the GPU and execution configuration - see also cudaGetDeviceProperties.


In [ ]:
!hybridizer-cuda ./05-dice-gpu/01-query-config.cs -o ./05-dice-gpu/01-query-config.exe -run

Distributing 1960 lines by blocks of 128 threads results in 16 busy blocks, which uses a fraction of most GPUs and does not hide latency.

Extracting more parallism can be achieved by distributing the work dicing the image in little squares instead of stripes as illustrated here:

The amount of work can be distributed up to 4 Million entries which is sufficiently above the number of GPUs working units.

In order to enable this with little effort, Parallel2D class exposes a static method For, very similar to System.Threading.Parallel.For that runs an action over a 2D domain:

[EntryPoint]
public static void Parallel2DStack(...)
{
    Parallel2D.For(fromI,toI, fromJ,toJ, (i,j) => 
    {
        ... // action to be executed for (i,j) domain
    });
}

Effectively dicing the processing, the execution configuration needs to be modified with SetDistrib. Both X and Y dimensions are used.

dim3 grid = new dim3(<nb blocks X>, <nb blocks Y>, <nb blocks Z - ignored>) ;
dim3 block = new dim3(<nb threads X>, <nb threads Y>, <nb threads Z>) ;

wrapper.SetDistrib (grid,block) ;

Modify 02-dice-gpu.cs to use a Parallel2D pattern. You may want to try different values for dimension X and Y.

Should you need, refer to the solution.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-05-dice-gpu
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-05-dice-gpu

!hybridizer-cuda ./05-dice-gpu/02-dice-gpu.cs graybitmap.cs -intrinsics bitonicsort.cuh=./ -o ./05-dice-gpu/02-dice-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-05-dice-gpu/denoised.bmp')
img.save('./output-05-dice-gpu/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-05-dice-gpu/denoised.png", width=384, height=384)

Profiling application

Profiling the application may be done with nvprof. The following execution box provides the command line. We are querying the following metrics:

  • local_{load,store}_transactions : to get the memory transactions on local memory
  • {gld_gst}_transactions : to get the memory transactions to global memory - cache misses
  • l2_{read,write}_transactions : to get the memory transactions on the L2 cache

In [ ]:
!cd 05-dice-gpu/hybrid ; nvprof --profile-child-processes --metrics local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,l2_read_transactions,l2_write_transactions ./run.sh

Static Sort

We can see that the number of local transactions are many: the local stack buffer has not been placed in registers. Indeed, when sorting the buffer, the size of the buffer is not known statically. We may change this with an alternate sort implementation: a static sort.

We can use a static sort template class with an intrinsic type:

public class StaticSort
{
    [IntrinsicFunction("::hybridizer::StaticSort<49>::sort<uint16_t>")]
    public static void Sort(ushort[] data)
    {
        Array.Sort(data);
    }
}

In the above code, when calling method Sort running C# with the dot net virtual machine, the Array.Sort method gets called. When Hybridizer processes this method, it replaces the call to StaticSort.Sort by a call to the intrinsic function with same parameters, here: ::hybridizer::StaticSort<49>::sort<uint16_t>, which is a template method of a template type.

Modify 01-regsort-gpu.cs to make use of static sort instead of the bitonic sort.

Should you need, refer to the solution.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-06-regsort-gpu
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-06-regsort-gpu

!hybridizer-cuda ./06-regsort-gpu/01-regsort-gpu.cs graybitmap.cs -o ./06-regsort-gpu/01-regsort-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-06-regsort-gpu/denoised.bmp')
img.save('./output-06-regsort-gpu/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-06-regsort-gpu/denoised.png", width=384, height=384)

In [ ]:
!cd 06-regsort-gpu/hybrid ; nvprof --profile-child-processes --metrics local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,l2_read_transactions,l2_write_transactions ./run.sh

Local load and store transactions should have been reduced to zero.

The display of kernel time actually includes some memory transfer and synchronization. The real execution time is returned by the profiler when run in summary mode.


In [ ]:
!cd 06-regsort-gpu/hybrid ; nvprof -s --profile-child-processes  ./run.sh

Cache Coherence

GPUs are equiped with L1 and L2 caches, which are used automatically. There are other cache types such as the texture cache or the constant cache. All of these will help improve data locality.

In the context of this application, the traversal of data can be predicted. Instead of dicing the image, an have all threads load around fifty values (in our example), we can arrange calculations to reuse the previously loaded data, and recude the cache pressure from 49 down to 7.

For this, we take a bit more control on the parallel loop using explicit work distribution:

for (int blockJ = blockIdx.x * chunk; blockJ < height; blockJ += gridDim.x * chunk)
{
    for (int blockI = threadIdx.x; blockI < width; blockI += blockDim.x)
    {
        <... process the shaft ...>
    }
}

The block j index distributes column parts accross blocks, and block i distributes colmumns accross threads.

Each thread is in charge of a column part of the image, iterating on the processing line with 7 loads into our register table at each iteration.

Finish implementation of 01-cache-aware-gpu.cs. intrinsics.cuh defined an intrinsic type for the rolling buffer.

Should you need, refer to the solution.


In [ ]:
import platform
if platform.system() == "Windows" : # create directory on Windows
    !mkdir output-07-cache-aware-gpu
if platform.system() == "Linux" : # create directory on Linux
    !mkdir -p ./output-07-cache-aware-gpu

!hybridizer-cuda ./07-cache-aware-gpu/01-cache-aware-gpu.cs -intrinsics intrinsics.cuh=./ graybitmap.cs -o ./07-cache-aware-gpu/01-cache-aware-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./output-07-cache-aware-gpu/denoised.bmp')
img.save('./output-07-cache-aware-gpu/denoised.png', 'png')
from IPython.display import Image
Image(filename="./output-07-cache-aware-gpu/denoised.png", width=384, height=384)

In [ ]:
!cd 07-cache-aware-gpu/hybrid ; nvprof --profile-child-processes --metrics local_load_transactions,local_store_transactions,gld_transactions,gst_transactions,l2_read_transactions,l2_write_transactions ./run.sh

The number of gld_transactions and l2_read_transactions should significantly reduce.


In [ ]:
!cd 07-cache-aware-gpu/hybrid ; nvprof -s --profile-child-processes  ./run.sh

It is also possible to query the utilization of the different pipelines. In this last version, we should have a high utilization on the single_precision_fu_utilization, and low utilization for the other units. We are compute bound !


In [ ]:
!cd 07-cache-aware-gpu/hybrid ; nvprof --profile-child-processes --metrics ldst_fu_utilization,single_precision_fu_utilization ./run.sh