Resident Array on GPU

Prerequisites

To get the most out of this lab, you should already be able to:

Write, compile, and run C# programs that both call CPU functions and launch GPU kernels.
Control parallel thread hierarchy using execution configuration.
Have some notions on mathematics

Conjugate Gradient Method

In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.

Working Set

In this lab, we will be processing the conjugate gradient method on a lapacian 1D Matrix with a size of 1000000 by 1000000 because this is a symmetric and positive-definite matrix.

This is a sparse Matrix, then we use the compressed sparse row format to use this matrix. The CSR format stores a sparse m × n matrix in row form using three (one-dimensional) arrays.

The first array (data) is of length number of non zero holds all the non zero entry of the matrix.
The second (rows) is of length number of row + 1 stores the number of non zero entry at the row -1.
The third (indices) is of length number of non zero stores the index of each non zero number in the first array.

Naive implementation

The first implementation is without parallelism, for that we use a sparse matrix class SparsematrixNaive and the main program 01-naive.cs, this is already working and compute the conjugate gradient method.



In [ ]:

    
!hybridizer-cuda ./01-Naive/01-naive.cs ./Common_Files/SparseMatrixNaive.cs -o ./01-Naive/naive.exe -run

Parallelism

Now we can distributing the work on each CPU Thread, for this make use of Parallel.For and ParallelEnumerable with the method Range(Int32, Int32) and Sum(Funct<Int32>) like : ParallelEnumerable.Range(begin, end).Sum(i => compute) when you have to use an atomic add.

Modify 01-parallelism.cs to make the use of parallelism.

If you stuck, have a look at the solution



In [ ]:

    
!hybridizer-cuda ./02-Parallel/01-parallelism.cs ./Common_Files/SparseMatrixNaive.cs -o ./02-Parallel/01-parallelism.exe -run

GPU

Now you should be familiar with the program, then modify the program to run it on the GPU like the previous labs but we modify the ScalarProd function to allow the run with the hybridizer. When you initialise the Hybrunner, the setDistrib function like that :

const int BLOCK_DIM = 256;
runner = HybRunner.Cuda().SetDistrib(16 * prop.multiProcessorCount, 1, BLOCK_DIM, 1, 1, BLOCK_DIM * sizeof(float));

The second parameters need to have the value of 1 to have a good compute with the scalarProd function.

The AtomicAdd have the [IntrinsicFunction('atomicAdd')] attribut to use the atomicAdd cuda function when it run on the GPU.

Now modify 01-gpu.cs, to launch some kernel on the gpu

Should you need, refer to the solution



In [ ]:

    
!hybridizer-cuda ./03-GPU/01-gpu.cs ./Common_Files/SparseMatrixNaive.cs -o ./03-GPU/01-gpu.exe -run

Resident Array

This is the final step, now you have you program run on the GPU, but this program makes a lot of exchange between the host memory and the device memory, it reduce the performance of the program.

As you can see with the following image, the previous program does a lot of copy(surrounded by red circle) between host and device memory.

With the hybridizer you have the possibility to use the ResidentArray, you can choose IntResidentArray, DoubleResidentArrayor or FloatResidentArray depending of the type of data you want to compute. You can access to the data with the operator [] and choose when you want to move data on the host (ResfreshHost()) or the device (RefreshDevice()). You can have the size of the Resident Array with the Count attribute.

FloatResidentArray a = new FloatResidentArray(N); //create a float resident array with a size of N
a.RefreshDevice(); //Copy data on the device
a[i] = 0.0f; 
Console.Out.WriteLine(a.Count); //Print the size of the ResidentArray
a.RefreshHost(); //Copy data on the host

Now modify 01-resident.cs and SparseMatrixResident.cs by replacing all float and int array by Resident Array where it's needed.

If you are stuck you can see the resident-solution and SparseMatrix-solution



In [ ]:

    
!hybridizer-cuda ./04-Resident/01-resident.cs ./Common_Files/SparseMatrixResident.cs -o ./04-Resident/01-resident.exe -run

Then with Resident Array, when we profile the application, we can see we have less of exchange between host and device memory and it accelerate the compute