Machine Representation of Numbers

Julia supports multiple float-point types, including Float16 (half), Float32 (single) and Float64 (double). Half-precision floating-point numbers are only for storage format, when Float16 type is involved in computation, it will be automatically promoted into Float32.


In [1]:
@printf("size of Float64 1.0 is %d.\n", sizeof(1.0))
@printf("size of Float32 1.0 is %d.\n", sizeof(float32(1.0)))
@printf("size of Float16 1.0 is %d.\n", sizeof(float16(1.0)))


size of Float64 1.0 is 8.
size of Float32 1.0 is 4.
size of Float16 1.0 is 2.

NaN and Inf (as well as -Inf) are special floating-points(IEC559), and can be cast into all floating-point types also be used in arithmetic operations.


In [2]:
@printf("Inf * Inf + Inf = %d\n", Inf * Inf + Inf) 
@printf("Inf + (-Inf)    = %d\n", Inf + (-Inf))
@printf("Inf * Inf + Inf = %lf\n", Inf * Inf + Inf) 
@printf("Inf + (-Inf)    = %lf\n", Inf + (-Inf))


Inf * Inf + Inf = Inf
Inf + (-Inf)    = NaN
Inf * Inf + Inf = Inf
Inf + (-Inf)    = NaN

Epsilon function eps gives machine accuracy. For single and double accuracy, epsilon will be float32(2.0^-23) and 2.0^-53 relatively. Notice eps also can be used on Float16, the result will be also a half precision number.


In [3]:
@printf("%4.52lf\n", eps(Float16))
@printf("%4.52lf\n" ,eps(Float32))
@printf("%4.52lf\n", eps(Float64))


0.0009765625000000000000000000000000000000000000000000
0.0000001192092895507812500000000000000000000000000000
0.0000000000000002220446049250313080847263336181640625

The default rounding mode is RoundNearest, to change the mode, we can use with_rounding. In the following example, 1.2000000000000001 cannot be represented, thus Julia rounds it to the nearest representable floating-point number.


In [4]:
# unable to represent, use default rounding
@printf("%4.16lf\n", 1.2000000000000001)

# rounding up
with_rounding(Float64, RoundUp) do 
    @printf("%4.16lf\n", 1.1 + 0.1)
end

# rounding down
with_rounding(Float64, RoundDown) do 
    @printf("%4.16lf\n", 1.1 + 0.1)
end


1.2000000000000002
1.2000000000000002
1.2000000000000000

For understanding of machine numbers, we take a look at IEEE standard for Single Precision(Float32), it consists of one bit of sign, and 8 bits of exponent bitstring, 23 bits of numerical value. Single Precision is ranged from -Inf32 to Inf32.


In [5]:
- typemax(Float32) == typemin(Float32)


Out[5]:
true

Julia provides bits representation of a floating-point number by using bits.


In [6]:
# Float32 type 1.000001 is first rounded to 1.0000009536743164
@printf("1.000001f0 is rounded to %4.16lf\n", 1.000001f0)

# And then its bit representation as 0 01111111 00000000000000000001000
@printf("its bits reprenstation is %s\n", bits(1.000001f0))

# The representation means it is positive, exponent as 127(convert to base 10).
# And numerical value part as 1 + 2^(-20). 
# Thus the answer will be (1 + 2^(-20))*2^(127 - 127)
@printf("Convert to Float32 %4.16lf\n", float32(1 + 2.0^(-20)))


1.000001f0 is rounded to 1.0000009536743164
its bits reprenstation is 00111111100000000000000000001000
Convert to Float32 1.0000009536743164

So according to the IEEE standard, the Inf32 is defined as 0 11111111 00000000000000000000000, if any of the last 23 bits is nonzero, then it is an NaN, Julia sets its system NaN as 0 11111111 10000000000000000000000.


In [7]:
# bits of Inf32
@printf("%s\n", bits(Inf32))
# bits of NaN32
@printf("%s\n", bits(NaN32))


01111111100000000000000000000000
01111111110000000000000000000000

Fixed Precision Arithmetic

We are not going to talk about arbitrary-precision arithmetic here, there are many multiple-precision libraries nowadays.

Summation over array

For simple loop like following would be a bad idea. We shall talk about the algorithm's accuracy as well as cost and stablity.


In [8]:
# simple loop of summation over array(double precision)
function simple_sum(arr::AbstractArray, first::Int, last::Int)
    b = arr[first];
    for i = first + 1 : last
        @inbounds b += arr[i]
    end
    return b
end

function pair_sum(arr::AbstractArray, first::Int, last::Int)
    if first + 1024 >= last
        return simple_sum(arr, first, last)
    end
    mid = (last + first) >>> 1;
    return pair_sum(arr, first, mid) + pair_sum(arr, mid + 1, last)
end

function kahan_sum(arr::AbstractArray)
    # preallocate memory
    n = length(arr)
    arr_i = 0.
    if (n == 0)
        return 0
    end
    @inbounds s = arr[1]
    c = 0.
    for i =  2:n
        @inbounds arr_i = arr[i]
        t = s + arr_i
        if abs(s) >= abs(arr_i)
            c += ((s-t) + arr_i)
        else
            c += ((arr_i-t) + s)
        end
        s = t
    end
    return s + c
end


Out[8]:
kahan_sum (generic function with 1 method)

Consider the worst case of above routine simple_sum, the accumulative error would grow as O(eps n), where eps is the machine accuracy. On the other hand, the cost of computing is fine, O(n). However, we have other algorithms of summation, such as pairwise summation and Kahan summation. For pairwise summation, the algorithm is simply divide and conquer, however, this reduces the growth of error to O(eps log n), and for Kahan summation, it reduces the error to machine accuracy level as O(eps), however, this makes it slower than other methods.

For the cost of each routine, Kahan costs most, the others are the same as O(n), but pairwise(cascade) summation can be parallelized.

For the accuracy of each routine, we have to look deeper into this. For Kahan's method, since there is always compensation, we can always expect very accurate answer at each step in recursive summation, if performance is not our goal. However, for the simple summation method, the ordering of input array might play a great role in controlling the error.

First, we have to think about why summation of harmonic series converges on our fix-precision computer(maybe after a long time). When we are adding two numbers a + b, if |a| is much larger than |b|, then b probably could not contribute to the sum. Just like following case, 10^(-15) is too small for log(10^15).


In [9]:
log(10^15) + 0.000000000000001 - log(10^15) == 0.000000000000001


Out[9]:
false

Therefore, the ordering here really matters. If all numbers of the array A are postive, then ascending order will produce less error than other ordering, especially descending order.


In [10]:
rand_arr = rand(500000);

# use ascending order will give better result
sorted_rand_arr = sort(rand_arr,rev=false)

# simple loop 
@printf("simple sum:\n")
@time @inbounds ret_1 = simple_sum(rand_arr, 1, length(rand_arr))
@time @inbounds sorted_ret_1 = simple_sum(sorted_rand_arr, 1, length(rand_arr))


@printf("\npairwise sum:\n")
# pairwise/ uses simple loop for small block
@time @inbounds ret_2 = pair_sum(rand_arr, 1, length(rand_arr))
@time @inbounds sorted_ret_2 = pair_sum(sorted_rand_arr, 1, length(rand_arr))

# Julia's mapreduce
@printf("\nJulia's mapreduce:\n")
@time @inbounds ret_3 = sum(rand_arr)
@time @inbounds sorted_ret_3 = sum(sorted_rand_arr)

# Kahan 
@printf("\nKahan sum:\n")
@time @inbounds ret_4 = kahan_sum(rand_arr)
@time @inbounds sorted_ret_4 = kahan_sum(sorted_rand_arr)


@printf("\nunsorted simple   sum is %4.16lf.\n",ret_1);
@printf("\nsorted simple     sum is %4.16lf.\n",sorted_ret_1);
@printf("\nunsorted pairwise sum is %4.16lf.\n",ret_2);
@printf("\nsorted pairwise   sum is %4.16lf.\n",sorted_ret_2);
@printf("\nunsorted Julia's  sum is %4.16lf.\n",ret_3);
@printf("\nsorted Julia's    sum is %4.16lf.\n",sorted_ret_3);
@printf("\nunsorted Kahan    sum is %4.16lf.\n",ret_4);
@printf("\nsorted Kahan      sum is %4.16lf.\n",sorted_ret_4);


simple sum:
elapsed time: 0.005809853 seconds (106924 bytes allocated)
elapsed time: 0.000699786 seconds (160 bytes allocated)

pairwise sum:
elapsed time: 0.00277834 seconds (59772 bytes allocated)
elapsed time: 0.000565415 seconds (160 bytes allocated)

Julia's mapreduce:
elapsed time: 0.015840218 seconds (806376 bytes allocated)
elapsed time: 0.000323421 seconds (144 bytes allocated)

Kahan sum:
elapsed time: 0.00656717 seconds (181864 bytes allocated)
elapsed time: 0.001592985 seconds (144 bytes allocated)

unsorted simple   sum is 250069.8555895331955981.

sorted simple     sum is 250069.8555895321769640.

unsorted pairwise sum is 250069.8555895306926686.

sorted pairwise   sum is 250069.8555895307217725.

unsorted Julia's  sum is 250069.8555895299941767.

sorted Julia's    sum is 250069.8555895304889418.

unsorted Kahan    sum is 250069.8555895306926686.

sorted Kahan      sum is 250069.8555895306926686.

Comparison

Here we can see Julia has faster summation algorithm. Let's look at how Julia deals with summation. The following is simplified version of Julia's summation(refer base/reduce.jl) , the original code uses mapreduce to reduce temporary floating-point. The method used here is called Loop Unrolling. It is used for code optimization, trades the size of code with performance, by reducing instructions that control the loop and eliminating delay in communicating with memory.

The following code unrolling its loop into four-way summation, and achieved 2x applausible performance. However, even though the performance is pleasant, the error growth is still at level O(eps n).


In [11]:
function julia_sum(a::AbstractArray, ifirst::Int, ilast::Int)
    @inbounds if ifirst + 6 >= ilast # length(a) < 8
        i = ifirst
        s =  a[i] + a[i+1]
        i = i+1
        while i < ilast
            s +=a[i+=1]
        end
        return s

    else # length(a) >= 8, manual unrolling
        @inbounds s1 = a[ifirst] +  a[ifirst + 4]
        @inbounds s2 = a[ifirst + 1] + a[ifirst + 5]
        @inbounds s3 = a[ifirst + 2] +  a[ifirst + 6]
        @inbounds s4 = a[ifirst + 3] +  a[ifirst + 7]
        i = ifirst + 8
        il = ilast - 3
        while i <= il
          @inbounds  s1 +=  a[i]
          @inbounds  s2 += a[i+1]
          @inbounds  s3 += a[i+2]
          @inbounds  s4 += a[i+3]
            i += 4
        end
        while i <= ilast
          @inbounds  s1 += a[i]
            i += 1
        end
        return s1 + s2 + s3 + s4
    end
end


Out[11]:
julia_sum (generic function with 1 method)

In [19]:
rand_arr = rand(500000);

# use ascending order will give better result
sorted_rand_arr = sort(rand_arr,rev=false)

# Julia's mapreduce
@printf("\nJulia's mapreduce:\n")
@time @inbounds ret = julia_sum(rand_arr,1,500000)
@time @inbounds sorted_ret = julia_sum(sorted_rand_arr, 1, 500000)


Julia's mapreduce:
elapsed time: 0.000383906 seconds (96 bytes allocated)
elapsed time: 0.000350468 seconds (96 bytes allocated)

Although it seems compensated method takes much longer time than other methods, we still have to use these methods for high accuracy computing, like long term integration, simple loop will explode our accuracy at some time, but Kahan will never ruin our result too much. Following example shows the integration of function sin(x) over [0, Pi], using simple loop and Kahan both, we can see the error of simple loop at the last case begins to raise, which would generate a U-shape plot, while Kahan makes the error continue to decrease.


In [14]:
# Caution: this may takes time
f(x) = sin(x)

@time for i = 1:8
    SIZE = 10^i
    nodes = zeros(SIZE)
    
    # most time spent on evaluating function as assignment
    for j = 1:SIZE
        @inbounds nodes[j] = sin(j*pi/SIZE)*pi/SIZE
    end

    
    ret_1 = simple_sum(nodes,1,SIZE)
    ret_2 = kahan_sum(nodes)
    # unsorted julia summation
    ret_3 = julia_sum(nodes,1, SIZE)
    # sorted julia summation
    sort!(nodes, rev=false)
    ret_4 = julia_sum(nodes,1,SIZE)
    @printf("%8d\t%4.16lf \t%4.16lf\t%4.16lf\t%4.16lf\n", 
        i, log(abs(ret_1 - 2.0)), log(abs(ret_2 - 2.0)), log(abs(ret_3 - 2.0)),log(abs(ret_4 - 2.0)))
end


       1	-4.1058224322423200 	-4.1058224322423333	-4.1058224322423200	-4.1058224322423333
       2	-8.7126236199139360 	-8.7126236199139360	-8.7126236199139360	-8.7126236199125859
       3	-13.3178100916878037 	-13.3178100910128698	-13.3178100907428956	-13.3178100907428956
       4	-17.9229807587693628 	-17.9229804348006709	-17.9229805022941413	-17.9229805292915287
       5	-22.5280972348821109 	-22.5281512282301435	-22.5281431290420855	-22.5281228813589287
       6	-27.1296995302229007 	-27.1333376127911414	-27.1321234473805966	-27.1379378103230664
       7	-30.4341815939321947 	-31.7395882959129878	-31.3251545178220603	-30.4452314301187812
       8	-27.8666995208913768 	-36.0436533891171536	-29.3946688390923789	-30.5142243016057328
elapsed time: 16.852989512 seconds (888997872 bytes allocated, 0.37% gc time)

We can see the effect of Kahan method from above, when simple loop summation accumulates the error, kahan_sum still can make the curve goes down. For the comparison of sorted and unsorted Julia summation, the error curve can be flatten by sorting the data before operation on it.

Random Numbers

There is no real random numbers in Julia, the pseudorandom generator uses Double precision SIMD-oriented Fast Mersenne Twister (dSFMT)(refer the paper ) which is described in base/dFSMT.jl, for those C++ users, probably they are more familiar with mt19937 and mt19937_64.

Julia also provides a randn function, which uses Ziggurat algorithm, it is a rejection sampling algorithm. For more detailed information about RNG, please refer Chapter Probability.