In [1]:
printred(x)=print("\x1b[31m"*x*"\x1b[0m")
printgreen(x)=print("\x1b[32m"*x*"\x1b[0m")
printblue(x)=print("\x1b[34m"*x*"\x1b[0m")
function printbits(x::Float16)
bts=bits(x)
printred(bts[1:1])
printgreen(bts[2:7])
printblue(bts[8:end])
end
function printbits(x::Float32)
bts=bits(x)
printred(bts[1:1])
printgreen(bts[2:2+8-1])
printblue(bts[2+8:end])
end
function printbits(x::Float64)
bts=bits(x)
printred(bts[1:1])
printgreen(bts[2:2+11-1])
printblue(bts[2+11:end])
end
Out[1]:
Float64 is a type representing real numbers using 64 bits, that is also known as double precision. We can create Float64s by including a decimal point when writing the number: 1.0 is a Float64 while 1 is an Int64/Int32. We use printbits to see what the bits of a Float64 for a few numbers are.
First, let's check an integer. The format is very different from Int64/Int32:
In [2]:
printbits(1.0)
Even though 1.3 is representable with only two base-10 digits, it requires an infinite number of base-2 digits, which is cut off:
In [3]:
printbits(1.3)
Float32 is another type representing real numbers using 32 bits, that is known as single precision. Float64 is now the default format for scientific computing (on the Floating Point Unit, FPU). Float32 is the default format for graphics (on the Graphics Processing Unit, GPU), as the difference between 32 bits and 64 bits is indistinguishable to the eye.
In [4]:
printbits(Float32(1.3))
We will now explain the interpretation of this format.
In lectures we worked out the base-2 expansion of 1/3: $$ {1 \over 3} = (0.0101010101010…)_2 $$ This representation is simply code for the infinite sum $$ 0 + {0 \over 2} + {1 \over 2^2} + {0 \over 2^3} + {1 \over 2^4} + {0 \over 2^5} + {1 \over 2^6} + \cdots $$ We can check this on a computer, however, we are only allowed to do a finite number of computations in practice:
In [5]:
1/2^2+1/2^4+1/2^6+1/2^8+1/2^10 # approximates 1/3
Out[5]:
Floats are stored in the format $$ x=\pm 2^{q-S} \times (1.b_1b_2b_3\ldots b_P)_2 $$ where $S$ and $P$ are fixed constants that depend on the type, $q$ is an unsigned integer of a fixed number bits, and $b_1b_2\ldots b_P$ are binary digits, stored as $P$ bits.
In the case of Float64, $S=1023$, $P=52$, and $q$ is stored with 11 bits.
Let's do an example:
In [23]:
printbits(100+1/3)
The red bit tells us that the number is positive. The green bits tell us $q$:
In [106]:
q=parse(Int,"10000000101",2)
Out[106]:
This tells us that the exponent is $1029-1023=6$, which we can check using the exponent command:
In [107]:
exponent(100+1/3)
Out[107]:
The remaining blue bits tell us the significand, therefore $$ 100+1/3 = 2^6*(1+{1 \over 2} +{1 \over 2^4}+{1 \over 2^8} + {1\over 2^{10}} + {1 \over 2^{12}} + \cdots) $$ Let's check if that works out:
In [112]:
2^6*(1+1/2+1/2^4+1/2^8+1/2^10+1/2^12+1/2^14+1/2^16)
Out[112]:
In [113]:
bits(0.0)
Out[113]:
The smallest normal number is $q=0$ and $b_k$ all zero. For a given floating point type, it can be found using realmin:
In [114]:
mn=realmin(Float64)
Out[114]:
In [115]:
2.0^(1-1023)
Out[115]:
In [34]:
printbits(mn)
If we divide by two, we get a subnormal number:
In [35]:
printbits(mn/2)
In [36]:
printbits(mn/4)
We have both $0.0$ and $-0.0$:
In [44]:
printbits(0.0)
In [43]:
printbits(-0.0)
In [119]:
1.0/0.0
Out[119]:
In [120]:
printbits(Inf)
Another special type is NaN, which represents not a number. For example, 0/0 is not defined, so returns NaN:
In [122]:
0/0
Out[122]:
NaN is stored with $q=(11111111111)_2$ and at least one of the $b_k =1$:
In [123]:
printbits(NaN)
What happens if we change some other $b_k$ to be nonzero?
In [126]:
i=parse(UInt64,
"1111111111110000000000000000000010000001000000000010000000000000",2)
reinterpret(Float64,i)
Out[126]:
Thus, there are more than one NaNs on a computer. How many are there?
Arithmetic works differently on Inf and NaN:
In [127]:
Inf*0 # NaN
Inf+5 # Inf
(-1)*Inf # -Inf
1/Inf # 0
1/(-Inf) # -0
Inf-Inf # NaN
Inf==Inf # true
Inf==-Inf # false
Out[127]:
In [80]:
NaN*0 # NaN
NaN+5
1/NaN
NaN==NaN # false
NaN!=NaN #true
Out[80]:
Let's figure out the format for Float32. We can use the fact that realmin(Float64) has $q=1$ to determine what $S$ should be:
In [128]:
S_64=1-exponent(realmin(Float64))
Out[128]:
In [129]:
S_32=1-exponent(realmin(Float32))
Out[129]:
In [86]:
printbits(Float32(1.0))
In [90]:
x=1.3
printbits(1.3) # 64 bits
In [91]:
printbits(Float32(1.3)) # 32 bits
Let's compare the difference in the significands. We can get the bits of the significand as follows:
In [130]:
x=1.3
str=bits(1.3)
bts64=str[13:end] # lets get the bits for the significand. This uses the
# `end` keyword for getting all the characters of a string
# up to the last one
Out[130]:
In [131]:
x=1.3
str=bits(Float32(1.3))
bts32=str[10:end] # lets get the bits for the significand. This uses the
# `end` keyword for getting all the characters of a string
# up to the last one
Out[131]:
In [99]:
bts64
Out[99]:
In [100]:
bts32
Out[100]:
We see from the fact that the last digit is zero that rounding strategy is either round down, round towards zero, or, round to nearest.