Stream on GPU

Vector Add

In the world of computing, the addition of two vectors is the standard "Hello World".

Given two sets of scalar data, such as the image above, we want to compute the sum, element by element.

We start by implementing the algorithm in plain C#.

Edit the file 01-naive-add.cs and implement this algorithm in plain C# until it displays OK

If you get stuck, you can refer to the solution.

In [ ]:
!hybridizer-cuda ./01-naive-add/01-naive-add.cs -o ./01-naive-add/naive-add.exe -run

With Parallelism

As we can see in the solution, a plain scalar iterative approach only uses one thread, while modern CPUs have typically 4 cores and 8 threads.

Fortunately, .Net and C# provide an intuitive construct to leverage parallelism : Parallel.For.

Modify 01-naive-add.cs to distribute the work among multiple threads.

If you get stuck, you can refer to the solution.

In [ ]:
!hybridizer-cuda ./01-naive-add/01-naive-add.cs -o ./01-naive-add/parallel-add.exe -run

Run Code on the GPU

Using Hybridizer to run the above code on a GPU is quite straightforward. We need to

  • Decorate methods we want to run on the GPU
    This is done by adding [EntryPoint] attribute on methods of interest.
  • "Wrap" current object into a dynamic object able to dispatch code on the GPU This is done by the following boilerplate code:
    dynamic wrapped = HybRunner.Cuda().Wrap(new Program());
    wrapped object has the same methods signatures (static or instance) as the current object, but dispatches calls to GPU.

Modify the 02-gpu-add.cs so the Add method runs on a GPU.

If you get stuck, you can refer to the solution.

In [ ]:
!hybridizer-cuda ./02-gpu-add/02-gpu-add.cs -o ./02-gpu-add/gpu-add.exe -run

Manage Memory

Now you can manage your memory yourself. Even if you want to have your data on the device. With the hybridizer all is implemented to let you choose where you want to stock your data.

For that we need to :

  • Allow the use of unsafe code
  • Create an IntPtr for the device and allocate it with
    IntPtr d_a;
    //N is the size of the array you want to allocate 
    cuda.Malloc(out d_a, N * sizeof(datatype));
  • Use GCHandle to pin a c# array (Alloc & AddrOfPinnedObject):
    float[] a = new float[N];
    GCHandle handle_a = GCHandle.Alloc(a, GCHandleType.Pinned);
    IntPtr h_a = handle_a.AddrOfPinnedObject();
  • Copy the data on the device with your device pointer and your pinned c# pointer

              N * sizeof(float),
  • After you launch the kernel you can return the device data on the host

               N * sizeof(float),
  • Make sure before each copy between the host and the device, the device is synchronize.

  • Don't forget to free the memory of your GChandle (free)


Modify the 03-malloc-add.cs so you allocate and use some device pointer.

If you get stuck, you can refer to the solution.

In [ ]:
!hybridizer-cuda ./03-malloc-add/03-malloc-add.cs -o ./03-malloc-add/maloc-add.exe -run


the purpose of this example is to allow you to use streams with the Hybridizer, on one very big vector without cut it. We will use 8 streams for this example.

  • You can create a stream with the object cudaStream_t and cuda.StreamCreate(out yourStream).
  • To set a stream on a kernel you have to use the SetStream(stream) function on wrapped.
  • You have the possibility to make an asynchronous cudaMemCpy when you copy data
    cuda.MemcpyAsync(IntPtr dst, IntPtr src, size_t size, cudaMemcpyKind kindOfCopy, cudaStream_t stream =0);
  • You can block until the stream finish to compute with cuda.StreamSynchronize(stream).
  • Finally destroy your stream with cuda.StreamDestroy(stream).

Modify the 04-stream-add.cs so you can create and use multiple streams.

If you get stuck, you can refer to the solution.

In [ ]:
!hybridizer-cuda ./04-stream-add/04-stream-add.cs -o ./04-stream-add/stream-add.exe -run