Julia -- CUDA Native programming

Author: Dr. Rahul Remanan

A test notebook for CUDAnative package in Julia.
Based on the Julia CUDAnative reduce example.
CUDA is Nvidia's general purpose compute using GPU library.
Using CUDAnative, high-performance GPU code using CUDA kernels can be written in Julia.
CUDAnative sits at the same abstraction level as CUDA C.
Highly integrated with Julia compiler and LLVM framework.

Julia version check



In [1]:

    
versioninfo()









    



Julia Version 1.1.0-DEV.255
Commit 26b6a5811c (2018-09-13 21:44 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)

Part 01: Vector addition using GPU -- Using CUDAdrv.jl

Small demo for GPU programming capabilities in Julia.
Uses functionality from CUDAdrv.jl
User friendly wrapper for interacting with CUDA hardware
Provides and array type: CuArray
Performs memory management through garbage collection integration with Julia
@elapsed using GPU events

Import dependent libraries



In [2]:

    
using CUDAdrv, CUDAnative
using Test

Define a vector addition function using CUDA kernel



In [3]:

    
function kernel_vadd(a, b, c)
    # from CUDAnative: (implicit) CuDeviceArray type,
    #                  and thread/block intrinsics
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]

    return nothing
end









    Out[3]:





kernel_vadd (generic function with 1 method)

Specify target CUDA device and context



In [4]:

    
dev = CuDevice(0)
ctx = CuContext(dev)









    Out[4]:





CuContext(Ptr{Nothing} @0x00005558ab9cb880, true, true)

Generate some data



In [5]:

    
len = 512
a = rand(Int, len)
b = rand(Int, len)









    Out[5]:





512-element Array{Int64,1}:
  1151473784990668484
 -6692638737358339108
  5495901842879926995
 -4365724979076222052
  4132301730327605992
   409785175223872992
  4982314156775213577
  2058401323534442502
  4718958573725913407
  2205070514877316856
  4931703306446677228
  4204568393402667677
  1051147931774966124
                    ⋮
  4978556971122208675
  4612504113036714519
   102540236536864854
 -3423040325127731899
 -3536441245098491419
   730745698407999672
  6057859308388106197
   395626996951311529
  2107981004745222417
   585002564411366184
  2052220610761356901
  -657052027309837740

Allocate resources and upload variables on the GPU



In [6]:

    
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a)









    Out[6]:





512-element CuArray{Int64,1}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

Evaluate vector addition on the CPU



In [7]:

    
cpu_vec_add = a + b









    Out[7]:





512-element Array{Int64,1}:
 -5755846246263059736
  3442839440114678137
  7051898515014027772
 -8748231496797167349
  2534520827018767100
 -6952912482972500279
 -4523992071379349402
  1620950993980060908
  -518243859391571720
 -3893544391639956498
  4604472282854575839
  2972815275239018202
 -3390882875002673091
                    ⋮
 -3117296677987418979
  1936660235719613108
 -9202933128196141761
 -2308148397874079164
  5741125414648568738
 -2898776153425978701
 -4345102733848115266
 -7858494769315516813
  6958912400460114155
  -994407698588199230
 -7845301211386210242
  1223351249926011761

Run the code on the GPU and fetch the results



In [8]:

    
@cuda threads=(len) kernel_vadd(d_a, d_b, d_c)    # from CUDAnative.jl
gpu_vec_add = Array(d_c)
@test gpu_vec_add == cpu_vec_add









    Out[8]:





Test Passed

Free up the GPU resources



In [9]:

    
destroy!(ctx)

Part 02: Parallel reduction using CUDA kernel -- Using CUDAnative.jl

CUDAnative.jl takes care of all things related to native GPU programming

1) interfacing with Julia: repurpose the compiler to emit GPU-compatible LLVM IR (no calls to CPU libraries, simplified exceptions, …)

2) interfacing with LLVM (using LLVM.jl): optimize the IR, and compile to PTX

3) interfacing with CUDA (using CUDAdrv.jl): compile PTX to SASS, and upload it to the GPU

These functionalities hidden behind the call to @cuda
Intrinsics: special functions and macros that provide functionality hard or impossible to express using normal functions. For example, the {thread,block,grid}{Idx,Dim} functions provide access to the size and index of each level of work.
Creating local shared memory: @cuStaticSharedMem and @cuDynamicSharedMem macros
Display formatted string from within a kernel function: @cuprintf
Math functions similar to the Julia standard library

Import dependent libraries



In [10]:

    
using Test
using CUDAdrv, CUDAnative



In [11]:

    
#
# Main implementation
#

# Reduce a value across a warp
@inline function reduce_warp(op::F, val::T)::T where {F<:Function,T}
    offset = CUDAnative.warpsize() ÷ 2
    # TODO: this can be unrolled if warpsize is known...
    while offset > 0
        val = op(val, shfl_down(val, offset))
        offset ÷= 2
    end
    return val
end









    Out[11]:





reduce_warp (generic function with 1 method)



In [12]:

    
# Reduce a value across a block, using shared memory for communication
@inline function reduce_block(op::F, val::T)::T where {F<:Function,T}
    # shared mem for 32 partial sums
    shared = @cuStaticSharedMem(T, 32)

    wid, lane = fldmod1(threadIdx().x, CUDAnative.warpsize())

    # each warp performs partial reduction
    val = reduce_warp(op, val)

    # write reduced value to shared memory
    if lane == 1
        @inbounds shared[wid] = val
    end

    # wait for all partial reductions
    sync_threads()

    # read from shared memory only if that warp existed
    @inbounds val = (threadIdx().x <= fld(blockDim().x, CUDAnative.warpsize())) ? shared[lane] : zero(T)

    # final reduce within first warp
    if wid == 1
        val = reduce_warp(op, val)
    end

    return val
end









    Out[12]:





reduce_block (generic function with 1 method)



In [13]:

    
# Reduce an array across a complete grid
function reduce_grid(op::F, input::CuDeviceVector{T}, output::CuDeviceVector{T},
                     len::Integer) where {F<:Function,T}

    # TODO: neutral element depends on the operator (see Base's 2 and 3 argument `reduce`)
    val = zero(T)

    # reduce multiple elements per thread (grid-stride loop)
    # TODO: step range (see JuliaGPU/CUDAnative.jl#12)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    step = blockDim().x * gridDim().x
    while i <= len
        @inbounds val = op(val, input[i])
        i += step
    end

    val = reduce_block(op, val)

    if threadIdx().x == 1
        @inbounds output[blockIdx().x] = val
    end

    return
end









    Out[13]:





reduce_grid (generic function with 1 method)



In [14]:

    
"""
Reduce a large array.
Kepler-specific implementation, ie. you need sm_30 or higher to run this code.
"""
function gpu_reduce(op::Function, input::CuVector{T}, output::CuVector{T}) where {T}
    len = length(input)

    # TODO: these values are hardware-dependent, with recent GPUs supporting more threads
    threads = 512
    blocks = min((len + threads - 1) ÷ threads, 1024)

    # the output array must have a size equal to or larger than the number of thread blocks
    # in the grid because each block writes to a unique location within the array.
    if length(output) < blocks
        throw(ArgumentError("output array too small, should be at least $blocks elements"))
    end

    @cuda blocks=blocks threads=threads reduce_grid(op, input, output, len)
    @cuda threads=1024 reduce_grid(op, output, output, blocks)
end









    Out[14]:





gpu_reduce (generic function with 1 method)



In [15]:

    
if capability(device()) < v"3.0"
    @warn("this example requires a newer GPU")
    exit(0)
end



In [16]:

    
len = 10^7
input = ones(Int32, len)









    Out[16]:





10000000-element Array{Int32,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

CPU Reduce example



In [17]:

    
cpu_val = reduce(+, input)









    Out[17]:





10000000

Specify target CUDA device and context



In [18]:

    
dev = CuDevice(0)
ctx = CuContext(dev)









    Out[18]:





CuContext(Ptr{Nothing} @0x00005558adc2eb80, true, true)

CUDAnative Reduce example



In [19]:

    
gpu_input = CuArray(input)
gpu_output = similar(gpu_input)
gpu_reduce(+, gpu_input, gpu_output)



In [20]:

    
Array(gpu_output)[1]









    Out[20]:





10000000

Run test



In [21]:

    
let
    gpu_input = CuArray(input)
    gpu_output = similar(gpu_input)
    gpu_reduce(+, gpu_input, gpu_output)
    gpu_val = Array(gpu_output)[1]
    @assert cpu_val == gpu_val
    Test.@test cpu_val == gpu_val
end









    Out[21]:





Test Passed

Free up the GPU resources



In [22]:

    
destroy!(ctx)