Julia -- CUDA Native programming

Author: Dr. Rahul Remanan

  • A test notebook for CUDAnative package in Julia.
  • Based on the Julia CUDAnative reduce example.
  • CUDA is Nvidia's general purpose compute using GPU library.
  • Using CUDAnative, high-performance GPU code using CUDA kernels can be written in Julia.
  • CUDAnative sits at the same abstraction level as CUDA C.
  • Highly integrated with Julia compiler and LLVM framework.

Julia version check


In [1]:
versioninfo()


Julia Version 1.1.0-DEV.255
Commit 26b6a5811c (2018-09-13 21:44 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)

Part 01: Vector addition using GPU -- Using CUDAdrv.jl

  • Small demo for GPU programming capabilities in Julia.
  • Uses functionality from CUDAdrv.jl
  • User friendly wrapper for interacting with CUDA hardware
  • Provides and array type: CuArray
  • Performs memory management through garbage collection integration with Julia
  • @elapsed using GPU events

Import dependent libraries


In [2]:
using CUDAdrv, CUDAnative
using Test

Define a vector addition function using CUDA kernel


In [3]:
function kernel_vadd(a, b, c)
    # from CUDAnative: (implicit) CuDeviceArray type,
    #                  and thread/block intrinsics
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]

    return nothing
end


Out[3]:
kernel_vadd (generic function with 1 method)

Specify target CUDA device and context


In [4]:
dev = CuDevice(0)
ctx = CuContext(dev)


Out[4]:
CuContext(Ptr{Nothing} @0x00005558ab9cb880, true, true)

Generate some data


In [5]:
len = 512
a = rand(Int, len)
b = rand(Int, len)


Out[5]:
512-element Array{Int64,1}:
  1151473784990668484
 -6692638737358339108
  5495901842879926995
 -4365724979076222052
  4132301730327605992
   409785175223872992
  4982314156775213577
  2058401323534442502
  4718958573725913407
  2205070514877316856
  4931703306446677228
  4204568393402667677
  1051147931774966124
                    ⋮
  4978556971122208675
  4612504113036714519
   102540236536864854
 -3423040325127731899
 -3536441245098491419
   730745698407999672
  6057859308388106197
   395626996951311529
  2107981004745222417
   585002564411366184
  2052220610761356901
  -657052027309837740

Allocate resources and upload variables on the GPU


In [6]:
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a)


Out[6]:
512-element CuArray{Int64,1}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

Evaluate vector addition on the CPU


In [7]:
cpu_vec_add = a + b


Out[7]:
512-element Array{Int64,1}:
 -5755846246263059736
  3442839440114678137
  7051898515014027772
 -8748231496797167349
  2534520827018767100
 -6952912482972500279
 -4523992071379349402
  1620950993980060908
  -518243859391571720
 -3893544391639956498
  4604472282854575839
  2972815275239018202
 -3390882875002673091
                    ⋮
 -3117296677987418979
  1936660235719613108
 -9202933128196141761
 -2308148397874079164
  5741125414648568738
 -2898776153425978701
 -4345102733848115266
 -7858494769315516813
  6958912400460114155
  -994407698588199230
 -7845301211386210242
  1223351249926011761

Run the code on the GPU and fetch the results


In [8]:
@cuda threads=(len) kernel_vadd(d_a, d_b, d_c)    # from CUDAnative.jl
gpu_vec_add = Array(d_c)
@test gpu_vec_add == cpu_vec_add


Out[8]:
Test Passed

Free up the GPU resources


In [9]:
destroy!(ctx)

Part 02: Parallel reduction using CUDA kernel -- Using CUDAnative.jl

  • CUDAnative.jl takes care of all things related to native GPU programming

    1) interfacing with Julia: repurpose the compiler to emit GPU-compatible LLVM IR (no calls to CPU libraries, simplified exceptions, …)

    2) interfacing with LLVM (using LLVM.jl): optimize the IR, and compile to PTX

    3) interfacing with CUDA (using CUDAdrv.jl): compile PTX to SASS, and upload it to the GPU

  • These functionalities hidden behind the call to @cuda

  • Intrinsics: special functions and macros that provide functionality hard or impossible to express using normal functions. For example, the {thread,block,grid}{Idx,Dim} functions provide access to the size and index of each level of work.

  • Creating local shared memory: @cuStaticSharedMem and @cuDynamicSharedMem macros

  • Display formatted string from within a kernel function: @cuprintf

  • Math functions similar to the Julia standard library

Import dependent libraries


In [10]:
using Test
using CUDAdrv, CUDAnative

In [11]:
#
# Main implementation
#

# Reduce a value across a warp
@inline function reduce_warp(op::F, val::T)::T where {F<:Function,T}
    offset = CUDAnative.warpsize() ÷ 2
    # TODO: this can be unrolled if warpsize is known...
    while offset > 0
        val = op(val, shfl_down(val, offset))
        offset ÷= 2
    end
    return val
end


Out[11]:
reduce_warp (generic function with 1 method)

In [12]:
# Reduce a value across a block, using shared memory for communication
@inline function reduce_block(op::F, val::T)::T where {F<:Function,T}
    # shared mem for 32 partial sums
    shared = @cuStaticSharedMem(T, 32)

    wid, lane = fldmod1(threadIdx().x, CUDAnative.warpsize())

    # each warp performs partial reduction
    val = reduce_warp(op, val)

    # write reduced value to shared memory
    if lane == 1
        @inbounds shared[wid] = val
    end

    # wait for all partial reductions
    sync_threads()

    # read from shared memory only if that warp existed
    @inbounds val = (threadIdx().x <= fld(blockDim().x, CUDAnative.warpsize())) ? shared[lane] : zero(T)

    # final reduce within first warp
    if wid == 1
        val = reduce_warp(op, val)
    end

    return val
end


Out[12]:
reduce_block (generic function with 1 method)

In [13]:
# Reduce an array across a complete grid
function reduce_grid(op::F, input::CuDeviceVector{T}, output::CuDeviceVector{T},
                     len::Integer) where {F<:Function,T}

    # TODO: neutral element depends on the operator (see Base's 2 and 3 argument `reduce`)
    val = zero(T)

    # reduce multiple elements per thread (grid-stride loop)
    # TODO: step range (see JuliaGPU/CUDAnative.jl#12)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    step = blockDim().x * gridDim().x
    while i <= len
        @inbounds val = op(val, input[i])
        i += step
    end

    val = reduce_block(op, val)

    if threadIdx().x == 1
        @inbounds output[blockIdx().x] = val
    end

    return
end


Out[13]:
reduce_grid (generic function with 1 method)

In [14]:
"""
Reduce a large array.
Kepler-specific implementation, ie. you need sm_30 or higher to run this code.
"""
function gpu_reduce(op::Function, input::CuVector{T}, output::CuVector{T}) where {T}
    len = length(input)

    # TODO: these values are hardware-dependent, with recent GPUs supporting more threads
    threads = 512
    blocks = min((len + threads - 1) ÷ threads, 1024)

    # the output array must have a size equal to or larger than the number of thread blocks
    # in the grid because each block writes to a unique location within the array.
    if length(output) < blocks
        throw(ArgumentError("output array too small, should be at least $blocks elements"))
    end

    @cuda blocks=blocks threads=threads reduce_grid(op, input, output, len)
    @cuda threads=1024 reduce_grid(op, output, output, blocks)
end


Out[14]:
gpu_reduce (generic function with 1 method)

In [15]:
if capability(device()) < v"3.0"
    @warn("this example requires a newer GPU")
    exit(0)
end

In [16]:
len = 10^7
input = ones(Int32, len)


Out[16]:
10000000-element Array{Int32,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

CPU Reduce example


In [17]:
cpu_val = reduce(+, input)


Out[17]:
10000000

Specify target CUDA device and context


In [18]:
dev = CuDevice(0)
ctx = CuContext(dev)


Out[18]:
CuContext(Ptr{Nothing} @0x00005558adc2eb80, true, true)

CUDAnative Reduce example


In [19]:
gpu_input = CuArray(input)
gpu_output = similar(gpu_input)
gpu_reduce(+, gpu_input, gpu_output)

In [20]:
Array(gpu_output)[1]


Out[20]:
10000000

Run test


In [21]:
let
    gpu_input = CuArray(input)
    gpu_output = similar(gpu_input)
    gpu_reduce(+, gpu_input, gpu_output)
    gpu_val = Array(gpu_output)[1]
    @assert cpu_val == gpu_val
    Test.@test cpu_val == gpu_val
end


Out[21]:
Test Passed

Free up the GPU resources


In [22]:
destroy!(ctx)