Float128 typePlease complete the following assignment in Jupyter, and email the completed .ipynb file to Sheehan.Olver@sydney.edu.au by midnight on 14 April 2016.
You are strongly encouraged to work out the problems on your own. If you use existing solutions, existing code, help online or collaborate with other students, you will still receive full credit provided that you make this clear and cite your sources. If you collaborate with other students, please submit individual solutions and be specific in giving credit for what each student contributed.
There are two ways to complete the assignment: using bits and string manipulation, or using <<, >> on UInt128s. Either approach will be accepted, though the latter is faster and hence recommended.
In this assignment you will construct a Float128 bits-type, represented by 128 bits:
In [4]:
import Base: show, bits, exponent, significand, rand, +, -, *, /, <, >, sign, BigFloat, inv, Float64
bitstype 128 Float128 <: AbstractFloat
For this types, we will interpret the 128-bits in data as: 1 sign bit, 63 exponent bits and 64 signficiand bits. That is, numbers will be represented by
where $S = 2^62-1$, $q$ is an unsigned 63-bit integer in the 2nd through 64th bits and $b_1b_2\cdots b_{64}$ are in the last 64 bits.
We will not implement Inf, NaN or subnormal numbers. We will however implement both $±0$.
We will use a globally defined constant S to represent $S$:
In [5]:
const S=Int64(2)^Int64(62)-Int64(1)
Out[5]:
We will use the bits to get access to the bits. The following function defines bits for a Float8:
In [7]:
bits(x::Float128)=bits(reinterpret(UInt128,x))
Out[7]:
Exercise 1
(a) Complete the following exponent function, that returns $q-S$ (as an Int64).
(b) Check that
y=parse(UInt128,"01000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
x=reinterpret(Float128,y)
exponent(x)
returns 2.
In [60]:
function exponent(x::Float128)
y=reinterpret(UInt128,x)
## TODO: interpret the bits and return an integer
end
Out[60]:
Exercise 2
Explain the following significand_uint64 function, that returns the significand bits of a Float128 as a UInt64.
In [11]:
function significand_uint64(x::Float128)
y=reinterpret(UInt128,x)
y % UInt64
end
Out[11]:
Exercise 3
Complete the following function sign to return the sign of a Float128:
In [14]:
function sign(x::Float128)
#TODO: return the sign of x as +1 or -1
end
Out[14]:
Exercise 4
An important skill to learn is how to read code that uses functionality that you are not yet familiar with.
(a) What is a BigFloat? (Hint: use ?BigFloat)
(b) Add comments to the function BigFloat(x::Float128) and show(io::IO,x::Float128) explaining how it works. Why do I display precisely 20 digits in the definition of show?
(c) Check that
y=parse(UInt128,"00111111111111111111111111111111111111111111111111111111111111110100110011001100110011001100110011001100110011001101000000000000",2)
x=reinterpret(Float128,y)
returns 1.300000000000000044
In [18]:
function BigFloat(x::Float128)
if reinterpret(UInt128,x)==0
0.0
elseif reinterpret(UInt128,x)==(UInt128(1) << 127)
-0.0
else
ret=BigFloat(1)
hf=BigFloat(0.5)
sig = bits(significand_uint64(x))
for k=1:length(sig)
if sig[k]=='1'
ret+=hf
end
hf/=2
end
sign(x)*2^BigFloat(exponent(x))*ret
end
end
function show(io::IO,x::Float128)
str=string(BigFloat(x))
print(io,str[1:20])
k=search(str,'e')
if k > 0
print(io,str[k:end])
end
end
Out[18]:
Exercise 5
(a) Complete the function Float128(x::Float64) that converts a Float64 to a Float128
(b) Check that
Float128(2.5)
returns 2.500000000000000000 and
Float128(1.3)
returns 1.300000000000000044
(c) Explain why 2.5 was converted exactly but 1.3 was not.
In [26]:
function Float128(x::Float64)
if x === 0.0
reinterpret(Float128,UInt128(0))
elseif x === -0.0
reinterpret(Float128,UInt128(1) << 127)
else
#TODO: return a Float128 that equals the input
end
end
Out[26]:
In [25]:
Float128(2.5),Float128(1.3)
Out[25]:
Exercise 6
(a) Complete the following definition Float64(x::Float128) that converts a Float128 to a Float64 by chopping
(b) Check that
Float64(Float128(1.3))
returns 1.3
In [28]:
function Float64(x::Float128)
y=reinterpret(UInt128,x)
if y == 0
0.0
elseif y == (UInt128(1) << 127)
-0.0
else
#TODO: return a Float64 that is the input x with the extra bits chopped off
end
end
Out[28]:
Exercise 7
In the remaining exercises, we can assume all Float128s are positive normal numbers.
(a) Complete the following addition function for two Float128s, by reducing the problem to addition with UInt128s. Round the computation towards zero.
(b) Check that
Float128(1.25)+Float128(2.5)
returns 3.750000000000000000
In [30]:
function +(x::Float128,y::Float128)
qx=exponent(x)
qy=exponent(y)
if qx ≥ qy
qx=exponent(x)
qy=exponent(y)
newexp = UInt64(qx+S)
two64 = UInt128(1) << 64
sigx = UInt128(significand_uint64(x)) + two64
sigy = UInt128(significand_uint64(y)) + two64
# TODO: add sigx and sigy and return the resulting Float128.
# Remember: sigy needs to be shifted to account for different exponents.
else
y+x
end
end
Out[30]:
Exercise 8
(a) Complete the following multiplication of two Float128s. Again, assume that x and y are positive normal numbers. Hint: (1+sigx)*(1+sigy) = 1 + sigx + sigy + sigx*sigy
(b) Check that Float128(1.25)*Float128(2.5) returns 3.125000000000000000.
In [61]:
function *(x::Float128,y::Float128)
# TODO: Implement multiplication by reducing to multiplication with UInt128
end
Out[61]:
Exercise 9
The following function computes division for Float128.
(a) Add comments explaining how the following functions work.
(b) Confirm that
Float128(1.3)/Float128(0.4)
returns 3.249999999999999930.
In [38]:
function inv(x::Float128)
qx=exponent(x)
sig = UInt128(significand_uint64(x))
if sig == 0
reinterpret(Float128,UInt128(S-qx)<<64)
else
newexp=UInt128(S-qx-1)<<64
two64=UInt128(1) << 64
sig = sig + two64
two127=UInt128(1) << 127
reinterpret(Float128,newexp+UInt128((div(two127,sig) << 2) % UInt64))
end
end
function /(x::Float128,y::Float128)
x*inv(y)
end
Out[38]:
Exercise 10
(a) Write a routine that calculates
$$\sum_{k=0}^{28} {1 \over k!}$$using Float128 arithmetic, by using the / and + functions defined above.
(b) Compare the number of accurate digits to the higher accuracy approximation of e:
println(BigFloat(e))
In [ ]: