Float128
typePlease complete the following assignment in Jupyter, and email the completed .ipynb
file to Sheehan.Olver@sydney.edu.au by midnight on 14 April 2016.
You are strongly encouraged to work out the problems on your own. If you use existing solutions, existing code, help online or collaborate with other students, you will still receive full credit provided that you make this clear and cite your sources. If you collaborate with other students, please submit individual solutions and be specific in giving credit for what each student contributed.
There are two ways to complete the assignment: using bits
and string manipulation, or using <<
, >>
on UInt128
s. Either approach will be accepted, though the latter is faster and hence recommended.
In this assignment you will construct a Float128
bits-type, represented by 128 bits:
In [4]:
import Base: show, bits, exponent, significand, rand, +, -, *, /, <, >, sign, BigFloat, inv, Float64
bitstype 128 Float128 <: AbstractFloat
For this types, we will interpret the 128-bits in data
as: 1 sign bit, 63 exponent bits and 64 signficiand bits. That is, numbers will be represented by
where $S = 2^62-1$, $q$ is an unsigned 63-bit integer in the 2nd through 64th bits and $b_1b_2\cdots b_{64}$ are in the last 64 bits.
We will not implement Inf
, NaN
or subnormal numbers. We will however implement both $±0$.
We will use a globally defined constant S
to represent $S$:
In [5]:
const S=Int64(2)^Int64(62)-Int64(1)
Out[5]:
We will use the bits
to get access to the bits. The following function defines bits
for a Float8
:
In [7]:
bits(x::Float128)=bits(reinterpret(UInt128,x))
Out[7]:
Exercise 1
(a) Complete the following exponent
function, that returns $q-S$ (as an Int64
).
(b) Check that
y=parse(UInt128,"01000000000000000000000000000000000000000000000000000000000000010010011001100110011001100110011001100110011001100110100000000000",2)
x=reinterpret(Float128,y)
exponent(x)
returns 2.
In [60]:
function exponent(x::Float128)
y=reinterpret(UInt128,x)
## TODO: interpret the bits and return an integer
end
Out[60]:
Exercise 2
Explain the following significand_uint64
function, that returns the significand bits of a Float128
as a UInt64
.
In [11]:
function significand_uint64(x::Float128)
y=reinterpret(UInt128,x)
y % UInt64
end
Out[11]:
Exercise 3
Complete the following function sign
to return the sign of a Float128
:
In [14]:
function sign(x::Float128)
#TODO: return the sign of x as +1 or -1
end
Out[14]:
Exercise 4
An important skill to learn is how to read code that uses functionality that you are not yet familiar with.
(a) What is a BigFloat
? (Hint: use ?BigFloat
)
(b) Add comments to the function BigFloat(x::Float128)
and show(io::IO,x::Float128)
explaining how it works. Why do I display precisely 20 digits in the definition of show
?
(c) Check that
y=parse(UInt128,"00111111111111111111111111111111111111111111111111111111111111110100110011001100110011001100110011001100110011001101000000000000",2)
x=reinterpret(Float128,y)
returns 1.300000000000000044
In [18]:
function BigFloat(x::Float128)
if reinterpret(UInt128,x)==0
0.0
elseif reinterpret(UInt128,x)==(UInt128(1) << 127)
-0.0
else
ret=BigFloat(1)
hf=BigFloat(0.5)
sig = bits(significand_uint64(x))
for k=1:length(sig)
if sig[k]=='1'
ret+=hf
end
hf/=2
end
sign(x)*2^BigFloat(exponent(x))*ret
end
end
function show(io::IO,x::Float128)
str=string(BigFloat(x))
print(io,str[1:20])
k=search(str,'e')
if k > 0
print(io,str[k:end])
end
end
Out[18]:
Exercise 5
(a) Complete the function Float128(x::Float64)
that converts a Float64
to a Float128
(b) Check that
Float128(2.5)
returns 2.500000000000000000
and
Float128(1.3)
returns 1.300000000000000044
(c) Explain why 2.5
was converted exactly but 1.3
was not.
In [26]:
function Float128(x::Float64)
if x === 0.0
reinterpret(Float128,UInt128(0))
elseif x === -0.0
reinterpret(Float128,UInt128(1) << 127)
else
#TODO: return a Float128 that equals the input
end
end
Out[26]:
In [25]:
Float128(2.5),Float128(1.3)
Out[25]:
Exercise 6
(a) Complete the following definition Float64(x::Float128)
that converts a Float128
to a Float64
by chopping
(b) Check that
Float64(Float128(1.3))
returns 1.3
In [28]:
function Float64(x::Float128)
y=reinterpret(UInt128,x)
if y == 0
0.0
elseif y == (UInt128(1) << 127)
-0.0
else
#TODO: return a Float64 that is the input x with the extra bits chopped off
end
end
Out[28]:
Exercise 7
In the remaining exercises, we can assume all Float128
s are positive normal numbers.
(a) Complete the following addition function for two Float128
s, by reducing the problem to addition with UInt128
s. Round the computation towards zero.
(b) Check that
Float128(1.25)+Float128(2.5)
returns 3.750000000000000000
In [30]:
function +(x::Float128,y::Float128)
qx=exponent(x)
qy=exponent(y)
if qx ≥ qy
qx=exponent(x)
qy=exponent(y)
newexp = UInt64(qx+S)
two64 = UInt128(1) << 64
sigx = UInt128(significand_uint64(x)) + two64
sigy = UInt128(significand_uint64(y)) + two64
# TODO: add sigx and sigy and return the resulting Float128.
# Remember: sigy needs to be shifted to account for different exponents.
else
y+x
end
end
Out[30]:
Exercise 8
(a) Complete the following multiplication of two Float128
s. Again, assume that x
and y
are positive normal numbers. Hint: (1+sigx)*(1+sigy) = 1 + sigx + sigy + sigx*sigy
(b) Check that Float128(1.25)*Float128(2.5)
returns 3.125000000000000000
.
In [61]:
function *(x::Float128,y::Float128)
# TODO: Implement multiplication by reducing to multiplication with UInt128
end
Out[61]:
Exercise 9
The following function computes division for Float128
.
(a) Add comments explaining how the following functions work.
(b) Confirm that
Float128(1.3)/Float128(0.4)
returns 3.249999999999999930
.
In [38]:
function inv(x::Float128)
qx=exponent(x)
sig = UInt128(significand_uint64(x))
if sig == 0
reinterpret(Float128,UInt128(S-qx)<<64)
else
newexp=UInt128(S-qx-1)<<64
two64=UInt128(1) << 64
sig = sig + two64
two127=UInt128(1) << 127
reinterpret(Float128,newexp+UInt128((div(two127,sig) << 2) % UInt64))
end
end
function /(x::Float128,y::Float128)
x*inv(y)
end
Out[38]:
Exercise 10
(a) Write a routine that calculates
$$\sum_{k=0}^{28} {1 \over k!}$$using Float128
arithmetic, by using the /
and +
functions defined above.
(b) Compare the number of accurate digits to the higher accuracy approximation of e
:
println(BigFloat(e))
In [ ]: