ZeroCopy on GPU

Compute large scale of data can be a problem with a gpu, it may not fit on the GPU.

Then in this lab, we will study how to allocate data on the host and use them on the GPU with the sobel filter.


Prerequisites

To get the most out of this lab, you should already be able to:

  • Write, compile, and run C# programs that both call CPU functions and launch GPU kernels.
  • Control parallel thread hierarchy using execution configuration.
  • Have some notions on images

Objectives

By the time you complete this lab, you will be able to:

  • Accelerate image processing algorithms with Hybridizer and GPUs
  • Allocate data on the host and compute them on the GPU

Working Set

In this lab, we will be processing an reference image(following image)


First GPU Implementation

We start the implementation of the filter with a first gpu approach as follow:

[EntryPoint]
public static void ComputeSobel(byte[] outputPixel, byte[] inputPixel, int width, int height)
{
    for (int i = threadIdx.y + blockIdx.y * blockDim.y; i < height; i += blockDim.y * gridDim.y)
    {
        for (int j = threadIdx.x + blockIdx.x * blockDim.x; j < width; j += blockDim.x * gridDim.x)
        {
            int output = 0;
            if (i != 0 && j != 0 && i != height - 1 && j != width - 1)
            {
                int pixelId = i * width + j;
                byte topl = inputPixel[pixelId - width - 1];
                byte top = inputPixel[pixelId - width];
                byte topr = inputPixel[pixelId - width + 1];
                byte l = inputPixel[pixelId - 1];
                byte r = inputPixel[pixelId + 1];
                byte botl = inputPixel[pixelId + width - 1];
                byte bot = inputPixel[pixelId + width];
                byte botr = inputPixel[pixelId + width + 1];

                int sobelx = (topl) + (2 * l) + (botl) - (topr) - (2 * r) - (botr);
                int sobely = (topl + 2 * top + topr - botl - 2 * bot - botr);

                int squareSobelx = sobelx * sobelx;
                int squareSobely = sobely * sobely;

                output = (int)Math.Sqrt((squareSobelx + squareSobely));

                if (output < 0)
                {
                    output = -output;
                }
                if (output > 255)
                {
                    output = 255;
                }

                outputPixel[pixelId] = (byte)output;
            }
        }
    }
}

this approach has parallelism and as you can see in the source file, the image is load and it's a basic use of the hybridizer. All is copied on the GPU to process it.


In [ ]:
!hybcc 01-parallel-gpu.cs -additionalDotNetAssemblies System.Drawing -o 01/01-parallel-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./01/hybrid/lena_highres_sobel.bmp')
img.save('./01/hybrid/lena_highres_sobel.png', 'png')
from IPython.display import Image
Image(filename="./01/hybrid/lena_highres_sobel.png", width=384, height=384)

Zero Copy implementation

Now, we want to leave the data on the host instead of copy it on the GPU,then we have to do some changes :

  • lock the memory on the host for the base image and the result image with Bitmap.LockBits method
  • take the pointer of the two images with BitmapData.Scan0 property
  • pin images memory for cuda
  • modify the kernel to use byte pointer instead of array
  • and don't forget to unpinned ans unlock memory

Modify 02-lock-gpu.cs to zero copy the image and process it on the GPU.

If you get stuck, you can refer to the solution.


In [ ]:
!hybcc 02-lock-gpu.cs -additionalDotNetAssemblies System.Drawing -o 02/02-lock-gpu.exe -run

# convert bmp to png to have interactive display
from PIL import Image
img = Image.open('./01/hybrid/lena_highres_sobel.bmp')
img.save('./01/hybrid/lena_highres_sobel.png', 'png')
from IPython.display import Image
Image(filename="./01/hybrid/lena_highres_sobel.png", width=384, height=384)