3976 Assignment 1: Desigining a Float128 type

Please complete the following assignment in Jupyter, and email the completed .ipynb file to Sheehan.Olver@sydney.edu.au by midnight on 14 April 2016.

You are strongly encouraged to work out the problems on your own. If you use existing solutions, existing code, help online or collaborate with other students, you will still receive full credit provided that you make this clear and cite your sources. If you collaborate with other students, please submit individual solutions and be specific in giving credit for what each student contributed.

There are two ways to complete the assignment: using bits and string manipulation, or using <<, >> on UInt128s. Either approach will be accepted, though the latter is faster and hence recommended.

In this assignment you will construct a Float128 bits-type, represented by 128 bits:


In [4]:
import Base: show, bits, exponent, significand, rand, +, -, *, /, <, >, sign, BigFloat, inv, Float64

bitstype 128 Float128 <: AbstractFloat

For this types, we will interpret the 128-bits in data as: 1 sign bit, 63 exponent bits and 64 signficiand bits. That is, numbers will be represented by

$$x = \pm 2^{q-S}*(1.b_1b_2b_3b_4\ldots b_{64})_2$$

where $S = 2^62-1$, $q$ is an unsigned 63-bit integer in the 2nd through 64th bits and $b_1b_2\cdots b_{64}$ are in the last 64 bits.

We will not implement Inf, NaN or subnormal numbers. We will however implement both $±0$.

We will use a globally defined constant S to represent $S$:


In [5]:
const S=Int64(2)^Int64(62)-Int64(1)


Out[5]:
4611686018427387903

We will use the bits to get access to the bits. The following function defines bits for a Float8:


In [7]:
bits(x::Float128)=bits(reinterpret(UInt128,x))


Out[7]:
bits (generic function with 6 methods)

Exercise 1

(a) Complete the following exponent function, that returns $q-S$ (as an Int64).

(b) Check that

y=parse(UInt128,"01000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
x=reinterpret(Float128,y)
exponent(x)

returns 2.


In [60]:
function exponent(x::Float128)
    y=reinterpret(UInt128,x)
    
    ## TODO: interpret the bits and return an integer
end


Out[60]:
exponent (generic function with 6 methods)

Exercise 2

Explain the following significand_uint64 function, that returns the significand bits of a Float128 as a UInt64.


In [11]:
function significand_uint64(x::Float128)
    y=reinterpret(UInt128,x)
    y % UInt64 
end


Out[11]:
significand_uint64 (generic function with 1 method)

Exercise 3

Complete the following function sign to return the sign of a Float128:


In [14]:
function sign(x::Float128)
    #TODO: return the sign of x as +1 or -1
end


Out[14]:
sign (generic function with 8 methods)

Exercise 4

An important skill to learn is how to read code that uses functionality that you are not yet familiar with.

(a) What is a BigFloat? (Hint: use ?BigFloat)

(b) Add comments to the function BigFloat(x::Float128) and show(io::IO,x::Float128) explaining how it works. Why do I display precisely 20 digits in the definition of show?

(c) Check that

y=parse(UInt128,"00111111111111111111111111111111111111111111111111111111111111110100110011001100110011001100110011001100110011001101000000000000",2)
x=reinterpret(Float128,y)

returns 1.300000000000000044


In [18]:
function BigFloat(x::Float128)
    if reinterpret(UInt128,x)==0
        0.0
    elseif reinterpret(UInt128,x)==(UInt128(1) << 127)
        -0.0
    else
        ret=BigFloat(1)
        hf=BigFloat(0.5)
        sig = bits(significand_uint64(x))
        for k=1:length(sig)
            if sig[k]=='1'
                ret+=hf
            end
            hf/=2
        end
       sign(x)*2^BigFloat(exponent(x))*ret
    end
end

function show(io::IO,x::Float128)
    str=string(BigFloat(x))
    print(io,str[1:20])
    k=search(str,'e')           
    if k > 0
        print(io,str[k:end])
    end
end


Out[18]:
show (generic function with 106 methods)

Exercise 5

(a) Complete the function Float128(x::Float64) that converts a Float64 to a Float128

(b) Check that

Float128(2.5)

returns 2.500000000000000000 and

Float128(1.3)

returns 1.300000000000000044

(c) Explain why 2.5 was converted exactly but 1.3 was not.


In [26]:
function Float128(x::Float64)
    if x === 0.0
        reinterpret(Float128,UInt128(0))
    elseif x === -0.0
         reinterpret(Float128,UInt128(1) << 127)
    else
        #TODO: return a Float128 that equals the input
    end
end


Out[26]:
Float128

In [25]:
Float128(2.5),Float128(1.3)


Out[25]:
(2.500000000000000000,1.300000000000000044)

Exercise 6

(a) Complete the following definition Float64(x::Float128) that converts a Float128 to a Float64 by chopping

(b) Check that

Float64(Float128(1.3))

returns 1.3


In [28]:
function Float64(x::Float128)
    y=reinterpret(UInt128,x)
    
    if y == 0
        0.0
    elseif y == (UInt128(1) << 127)
        -0.0
    else
        #TODO: return a Float64 that is the input x with the extra bits chopped off
    end
end


Out[28]:
Float64

Exercise 7

In the remaining exercises, we can assume all Float128s are positive normal numbers.

(a) Complete the following addition function for two Float128s, by reducing the problem to addition with UInt128s. Round the computation towards zero.

(b) Check that

Float128(1.25)+Float128(2.5)

returns 3.750000000000000000


In [30]:
function +(x::Float128,y::Float128)
    qx=exponent(x)
    qy=exponent(y)
    
    if qx  qy
        qx=exponent(x)
        qy=exponent(y)


        newexp = UInt64(qx+S)

        two64 = UInt128(1) << 64

        sigx = UInt128(significand_uint64(x)) + two64
        sigy = UInt128(significand_uint64(y)) + two64 

        # TODO: add sigx and sigy and return the resulting Float128.  
        # Remember: sigy needs to be shifted to account for different exponents.  
    else
        y+x
    end
end


Out[30]:
+ (generic function with 172 methods)

Exercise 8

(a) Complete the following multiplication of two Float128s. Again, assume that x and y are positive normal numbers. Hint: (1+sigx)*(1+sigy) = 1 + sigx + sigy + sigx*sigy

(b) Check that Float128(1.25)*Float128(2.5) returns 3.125000000000000000.


In [61]:
function *(x::Float128,y::Float128)
    # TODO: Implement multiplication by reducing to multiplication with UInt128
end


Out[61]:
* (generic function with 139 methods)

Exercise 9

The following function computes division for Float128.

(a) Add comments explaining how the following functions work.

(b) Confirm that

Float128(1.3)/Float128(0.4)

returns 3.249999999999999930.


In [38]:
function inv(x::Float128)
    qx=exponent(x)

    sig = UInt128(significand_uint64(x))
    if sig == 0
        reinterpret(Float128,UInt128(S-qx)<<64)
    else
        newexp=UInt128(S-qx-1)<<64

        two64=UInt128(1) << 64
        sig = sig + two64
        two127=UInt128(1) << 127
        reinterpret(Float128,newexp+UInt128((div(two127,sig) << 2) % UInt64))
    end
end

function /(x::Float128,y::Float128)
    x*inv(y)
end


Out[38]:
/ (generic function with 49 methods)

Exercise 10

(a) Write a routine that calculates

$$\sum_{k=0}^{28} {1 \over k!}$$

using Float128 arithmetic, by using the / and + functions defined above.

(b) Compare the number of accurate digits to the higher accuracy approximation of e:

println(BigFloat(e))

In [ ]: