In the world of computing, the addition of two vectors is the standard "Hello World".
Given two sets of scalar data, such as the image above, we want to compute the sum, element by element.
We start by implementing the algorithm in plain C#.
Edit the file 01-naive-add.cs and implement this algorithm in plain C# until it displays OK
If you get stuck, you can refer to the solution.
In [ ]:
!hybridizer-cuda ./01-naive-add/01-naive-add.cs -o ./01-naive-add/naive-add.exe -run
As we can see in the solution, a plain scalar iterative approach only uses one thread, while modern CPUs have typically 4 cores and 8 threads.
Fortunately, .Net and C# provide an intuitive construct to leverage parallelism : Parallel.For.
Modify 01-naive-add.cs to distribute the work among multiple threads.
If you get stuck, you can refer to the solution.
In [ ]:
!hybridizer-cuda ./01-naive-add/01-naive-add.cs -o ./01-naive-add/parallel-add.exe -run
Using Hybridizer to run the above code on a GPU is quite straightforward. We need to
[EntryPoint]
attribute on methods of interest. dynamic wrapped = HybRunner.Cuda().Wrap(new Program());
wrapped.mymethod(...);
wrapped
object has the same methods signatures (static or instance) as the current object, but dispatches calls to GPU.Modify the 02-gpu-add.cs so the Add
method runs on a GPU.
If you get stuck, you can refer to the solution.
In [ ]:
!hybridizer-cuda ./02-gpu-add/02-gpu-add.cs -o ./02-gpu-add/gpu-add.exe -run
Now you can manage your memory yourself. Even if you want to have your data on the device. With the hybridizer all is implemented to let you choose where you want to stock your data.
For that we need to :
IntPtr
for the device and allocate it withIntPtr d_a;
//N is the size of the array you want to allocate
cuda.Malloc(out d_a, N * sizeof(datatype));
GCHandle
to pin a c# array (Alloc & AddrOfPinnedObject): float[] a = new float[N];
GCHandle handle_a = GCHandle.Alloc(a, GCHandleType.Pinned);
IntPtr h_a = handle_a.AddrOfPinnedObject();
Copy the data on the device with your device pointer and your pinned c# pointer
cuda.Memcpy(d_a,
h_a,
N * sizeof(float),
cudaMemcpyKind.cudaMemcpyHostToDevice);
After you launch the kernel you can return the device data on the host
cuda.Memcpy(h_a,
d_a,
N * sizeof(float),
cudaMemcpyKind.cudaMemcpyDeviceToHost);
Make sure before each copy between the host and the device, the device is synchronize.
Don't forget to free the memory of your GChandle (free)
handle_a.Free();
Modify the 03-malloc-add.cs so you allocate and use some device pointer.
If you get stuck, you can refer to the solution.
In [ ]:
!hybridizer-cuda ./03-malloc-add/03-malloc-add.cs -o ./03-malloc-add/maloc-add.exe -run
the purpose of this example is to allow you to use streams with the Hybridizer, on one very big vector without cut it. We will use 8 streams for this example.
cudaStream_t
and cuda.StreamCreate(out yourStream)
.SetStream(stream)
function on wrapped
.wrapped.SetStream(stream).mymethod(...);
cuda.MemcpyAsync(IntPtr dst, IntPtr src, size_t size, cudaMemcpyKind kindOfCopy, cudaStream_t stream =0);
cuda.StreamSynchronize(stream)
.cuda.StreamDestroy(stream)
.Modify the 04-stream-add.cs so you can create and use multiple streams.
If you get stuck, you can refer to the solution.
In [ ]:
!hybridizer-cuda ./04-stream-add/04-stream-add.cs -o ./04-stream-add/stream-add.exe -run