In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition. Large sparse systems often arise when numerically solving partial differential equations or optimization problems.
This is a sparse Matrix, then we use the compressed sparse row format to use this matrix. The CSR format stores a sparse m × n matrix in row form using three (one-dimensional) arrays.
The first implementation is without parallelism, for that we use a sparse matrix class SparsematrixNaive and the main program 01-naive.cs, this is already working and compute the conjugate gradient method.
In [ ]:
!hybridizer-cuda ./01-Naive/01-naive.cs ./Common_Files/SparseMatrixNaive.cs -o ./01-Naive/naive.exe -run
Now we can distributing the work on each CPU Thread, for this make use of Parallel.For and ParallelEnumerable with the method Range(Int32, Int32) and Sum(Funct<Int32>) like : ParallelEnumerable.Range(begin, end).Sum(i => compute) when you have to use an atomic add.
Modify 01-parallelism.cs to make the use of parallelism.
If you stuck, have a look at the solution
In [ ]:
!hybridizer-cuda ./02-Parallel/01-parallelism.cs ./Common_Files/SparseMatrixNaive.cs -o ./02-Parallel/01-parallelism.exe -run
Now you should be familiar with the program, then modify the program to run it on the GPU like the previous labs but we modify the ScalarProd function to allow the run with the hybridizer.
When you initialise the Hybrunner, the setDistrib function like that :
const int BLOCK_DIM = 256;
runner = HybRunner.Cuda().SetDistrib(16 * prop.multiProcessorCount, 1, BLOCK_DIM, 1, 1, BLOCK_DIM * sizeof(float));
The second parameters need to have the value of 1 to have a good compute with the scalarProd function.
The AtomicAdd have the [IntrinsicFunction('atomicAdd')] attribut to use the atomicAdd cuda function when it run on the GPU.
Now modify 01-gpu.cs, to launch some kernel on the gpu
Should you need, refer to the solution
In [ ]:
!hybridizer-cuda ./03-GPU/01-gpu.cs ./Common_Files/SparseMatrixNaive.cs -o ./03-GPU/01-gpu.exe -run
This is the final step, now you have you program run on the GPU, but this program makes a lot of exchange between the host memory and the device memory, it reduce the performance of the program.
As you can see with the following image, the previous program does a lot of copy(surrounded by red circle) between host and device memory.
With the hybridizer you have the possibility to use the ResidentArray, you can choose IntResidentArray, DoubleResidentArrayor or FloatResidentArray depending of the type of data you want to compute. You can access to the data with the operator [] and choose when you want to move data on the host (ResfreshHost()) or the device (RefreshDevice()). You can have the size of the Resident Array with the Count attribute.
FloatResidentArray a = new FloatResidentArray(N); //create a float resident array with a size of N
a.RefreshDevice(); //Copy data on the device
a[i] = 0.0f;
Console.Out.WriteLine(a.Count); //Print the size of the ResidentArray
a.RefreshHost(); //Copy data on the host
Now modify 01-resident.cs and SparseMatrixResident.cs by replacing all float and int array by Resident Array where it's needed.
If you are stuck you can see the resident-solution and SparseMatrix-solution
In [ ]:
!hybridizer-cuda ./04-Resident/01-resident.cs ./Common_Files/SparseMatrixResident.cs -o ./04-Resident/01-resident.exe -run
Then with Resident Array, when we profile the application, we can see we have less of exchange between host and device memory and it accelerate the compute