ESE - Numerical Methods I: Basics of Computer Arithmetic

Floating point (FP) arithemtic $\neq$ arithmetic in ${\rm I\!R}$

The motivation for this session can be summed up by the following problem:



In [1]:

    
1.2-1. == 10.2-10.









    Out[1]:





False

or, actually the same, but expressed in a more challenging manner:



In [2]:

    
1.2-1. != 10.2-10.









    Out[2]:





True

Is that a bug in Python?

That would be a lot of people using a buggy language. Let's try this in Matlab:

Bummer (remember the binary convention 0 == False and 1 == True). Apparently it has nothing to do with whether we use Python, Matlab, C++, etc. Let's try some reverse engineering to get closer to the root of the problem:



In [3]:

    
0.2-0.









    Out[3]:





0.2



In [4]:

    
1.2-1.









    Out[4]:





0.19999999999999996



In [5]:

    
10.2-10.









    Out[5]:





0.1999999999999993



In [6]:

    
100.2-100.









    Out[6]:





0.20000000000000284



In [7]:

    
1000.2-1000.









    Out[7]:





0.20000000000004547



In [8]:

    
10000.2-10000.









    Out[8]:





0.2000000000007276

The problem seems even trickier than we had thought. First, the answer is never $0.2$ exactly (besides the trivial case $0.2-0.$, which really triggers not much at all), but it seems to deviate more and more with larger numbers. It must be concluded that numerical calculations on binary systems like computers contain errors, errors that do not exist for calculus in ${\rm I\!R}$. Before we explore the nature and source of these errors, let's quantify them in terms of a 'correct' solution:

First, the absolute error:

\begin{equation} \epsilon = |\tilde{x}-x| \end{equation}

where $\tilde{x}$ is the computed solution and $x$ is the exact solution.

For theabove problem, let's see how the absolute error scales with the magnitude of the numerical values involved.



In [9]:

    
%pylab inline
import numpy as np
import matplotlib.pyplot as plt









    



Populating the interactive namespace from numpy and matplotlib



In [10]:

    
# remember that in numpy, unless otherwise specified by the user, numbers are treated as floats
# it is good practice though to indicate what numbers you consider as floats by using the decimal point
ls = np.logspace(1.,25,25)
x = 0.2
# operators act elementwise on array
x_tilde = (ls+x)-ls
absolute_error = np.abs(x_tilde-x)



In [11]:

    
plt.plot(ls, absolute_error)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('$O(x)$')
plt.ylabel('$O(\epsilon)$')
plt.title('Absolute Error')
plt.show()

We can see that the absolute error, that is the absolute difference from $x = 0.2$ compared to $\tilde{x}$, increases with increasing magnitude of the numbers that we perform the arithemtic operation on. Above a certain magnitude, the error appears consant ... we'll come back to this.

Second, the relative error:

\begin{equation} \epsilon = |\frac{\tilde{x}-x}{x}| \end{equation}

Let's see if this provides more insight:



In [12]:

    
relative_error = np.abs((x_tilde-x)/x)



In [13]:

    
plt.plot(ls, relative_error)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('$O(x)$')
plt.ylabel('$O(\epsilon)$')
plt.title('Relative Error')
plt.ylim(None, 1.0E1)
plt.show()

To sum up our observations, we can say that the error in our little subtraction increases with the magnitude of the number leading the decimal point, up to a point where it is constant, i.e., of the exact size as the subtracted number. In other words, our subtraction of $0.2$ has no effect for numbers larger than a certain magnitude. More on that later...

A pressing question by now should be: How can we do reliable computations if any operation on floating introduces errors? Particularly since we know that so many safety critical aspects are based on computer calculations. The answer can be summarized by:



In [14]:

    
1.2-1. == 1.2-1.









    Out[14]:





True

This, simply put, means that even though errors are part of computations, these errors are not random (like in experimental measurement apparati). They are systematic, reproducible and well understood. The answer is not that the errors are small, since this would be very much a matter of context.

The two errors that significantly contribute to deviations from arithmetic in ${\rm I\!R}$ are due to mechanisms of:

rounding
truncation

In brief, errors result from a finite representation of numbers on binary systems and rules to operate on them, which brings us to...

Binary number representation

In 1986, Link faced a problem: no matter his achievements, he could never accumulate more than 255 rupees in his wallet. This was not a consequence of reigning communism - he faced the same problem as Psy in the year 2014: the finite representation of numbers on binary systems.

Integer numbers

Before we learn how computer systems represent, i.e., store, numbers, let's look at a systematic way of representing numeric values, here the natural number $123$, using our base 10 numeral system:

$1\times10^2 + 2\times10^1 + 3\times10^0 = 123$

This can be ordered in a column manner, say for an arbitrary four columns:

$10^3$	$10^2$	$10^1$	$10^0$
$0$	$1$	$2$	$3$

so that, again, the expression yields:

$0\times10^3 +1\times10^2 + 2\times10^1 + 3\times10^0 = 123$

Since our factor ranges from 0 to 9 (10 values), we can represent all natural numbers $[0,9999]$ this way. If we allocated more cells, say five:

$10^4$	$10^3$	$10^2$	$10^1$	$10^0$
$0$	$0$	$1$	$2$	$3$

it would be $[0,99999]$, etc...

Conventional computer systems, however, store data in terms of low voltage or high voltage, or 0 and 1, false and true, off and on. So the obvious base for a binary number system is two:

$2^7$	$2^6$	$2^5$	$2^4$	$2^3$	$2^2$	$2^1$	$2^0$
$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

Following the same rationale, we get:



In [15]:

    
0*2**7 + 1*2**6 + 1*2**5 + 1*2**4 + 1*2**3 + 0*2**2 + 1*2**1 + 1*2**0









    Out[15]:





123

This concept is used by the operating system to allocate a fixed width of memory, here 8 bits, and interpret its values as multipliers for decreasing powers of two, bitwise, from the left to the right, to compute a number from $0 / I$ values. It immediately follows that the range of numbers is a function of the numbers of bits allocated to represent the integer value. Common bit chunk sizes and their limits are found here.

This system has been extended to negative numbers, allowing for simple addition/subtraction operations - the two's complement, standard for integer representation and operations on modern computer systems. The system is simple, really:

$123$ is, as derived above, given as:

$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

but now, the left most bit has a different meaning: it is the significant bit, or sign bit. A $0$ value indicates the following representation is for a positive number. To get $-123$ we first invert all bits

$1$	$0$	$0$	$0$	$0$	$1$	$0$	$0$

and then add $1$ to get

$1$	$0$	$0$	$0$	$0$	$1$	$0$	$1$

the two's complement representation of $-123$, with the signigicant bit of $1$ indicating a negative number. The operation of addition is very simply peformed in this system. For $123+123$, we first need to allocate more memory, to avoid overflow, since the result is not within $[-128,127]$. So for 16 bits:

$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

The addition of int_1 + int_2 = resul, where int_1 = $123$ and int_2 = $123$, can then be represented as

carry	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

int_1	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

int_2	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

resul	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$	$0$

where resul = 246, check for yourself (we start at the first non-zero bit from the left):



In [16]:

    
1*2**7 + 1*2**6 + 1*2**5 + 1*2**4 + 0*2**3 + 1*2**2 + 1*2**1 + 0*2**0









    Out[16]:





246

Subtraction is simply implemented by adding the negative value, so $123-123$ becomes $123+(-123)$:

carry	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$

int_1	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$1$	$1$	$1$	$1$	$0$	$1$	$1$

int_2	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$0$	$0$	$0$	$0$	$1$	$0$	$1$

resul	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$	$0$

which yields $0$, as expected. This is all we'll look at here in terms of arithmetic operations on binary systems. The take-home message from this section is that integers are represented on computers in a two's complement manner (base 2 numeral system), and are limited in the range of representable numbers by the dedicated memory.

Floating point number representation

Like integers, floating point numbers are limited in the range of numbers that can be represented, a function of the allocated memory. But they also face another limitation: one of precision.

To illustrate the concept of a floating point, think about two applications:

a conventional hydrocarbon resevoir model on the km scale:

a pore model on the micron scale:

Numerical values describing the model geometry and processes therein vary drastically in magnitude. Most likely, however, we're not concerned with mm scale errors on the field scale, and we're not expecting any values larger than mm on the pore scale. This motivates a numeral system that represents numbers with a relative precision, i.e., a floating point - as opposed to a fixed point. You are familiar with one system of this kind, namely the scientific notation, a decimal (base 10) floating point notation. For example

$123000$ = 1.23E3 = 1.23$\times10^3$

$0.00123$ = 1.23E-3 = 1.23$\times10^{-3}$

A floating point representation has two components, a fraction or significand (here $1.23$) and an exponent (here $3$ and $-3$), around a base, here $10$ in the decimal system. It follwos

$value = fraction \times base^{exponent}$

The standard bit representation of floating point numbers , here for single precison (32 bit), adheres to this concept:

The base, again, is $2$. The numeric value of the bit pattern follows as

\begin{equation} value = (-1)^{sign} \left( 1 + \sum_{i=1}^{23} b_{23-i}2^{-i} \right) \times 2^{(exponent-127)} \end{equation}

similar to

$value = sign \times fraction \times base^{exponent}$

where the $fraction$ is also referred to as $mantissa$. The zero offset of $127$ in the exponent, also referred to as exponent bias, arises mostly to allow for the representation of special values and zero, but we won't go into details here. The special numbers are

$NaN$ ... Not a Number

and

$-/+ \infty$

These values invoke non-standard arithmetic operations and significantly slow down the process compared to conventional binary numbers. Think about why it would make sense to initialize variables to NaN and not to zero! Look up the order of magnitude that results in $\infty$.

The floating point format allows for a wide range of numbers to be represented with relative precision. It is obvious that both are a function of allocated memory, i.e., 16bit (half-), 32bit (single-) or 64bit (double-precision). Specifically, the finiteness of the fraction limits the precision, and the finiteness of the exponent limits the range.

Let's think about this, recalling our initial assessment of the relative error in the calculation

$1.2 - 1., 10.2-10., 100.2-100...$



In [17]:

    
plt.plot(ls, relative_error)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('$O(x)$')
plt.ylabel('$O(\epsilon)$')
plt.title('Relative Error')
plt.ylim(None, 1.0E1)
plt.show()

We recognize that our error is expressed relative to the result $0.2$. We see that for a number of $10^{16}$ and greater the relative error is $1$, meaning the subtraction has no effect, i.e.,

$ 10^{17} - 0.2 = 10^{17}$

Let's express the error relative to the magnitude of the involved numbers.



In [18]:

    
relative_error_2 = np.abs((x_tilde-x)/ls)



In [19]:

    
plt.plot(ls, relative_error_2)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('$O(x)$')
plt.ylabel('$O(\epsilon)$')
plt.title('Absolute Error')
plt.ylim(None, 1.0E1)
plt.annotate('$\epsilon_{machine}$', xy=(1.1E1, 1.1E-16), xycoords='data', xytext=(1E6, 1E-10), textcoords='data', 
            arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), fontsize=20 )
plt.show()

Expressed this way, the relative error is constant up to $10^{16}$, in the order of magnitude $10^{16}$.

The explanation for these observations is the following: floating point numbers are represented to a relative precision, denoted by

$\epsilon_{machine} = 2.220446049250313 \times 10^{−16}$ for 64bit floating point numbers

It represents the smallest relative gap between representable numbers, so that, for example, all numbers between

$[1,1+\epsilon_{machine}]$

are rounded to either one.

We can now see why the absolute error of $(10^{17}+0.2) - 10^{17}$ is exactly $0.2$: we cannot represent the absolute value of $0.2$ at this magnitude, since the absolute gap between representable number is



In [20]:

    
1E17*2.220446049250313E-16









    Out[20]:





22.20446049250313

This gap, in absolute terms, is much larger than $0.2$, the value we're trying to subtract.



In [21]:

    
1e17 - 0.2









    Out[21]:





1e+17

This is also referred to as catastrophic cancellation, when there is devastating loss of precision when small numbers are computed from large numbers, which themselves are subject to round-off error. So is



In [22]:

    
1e17 - 2









    Out[22]:





1e+17

but not



In [23]:

    
1e17 - 20









    Out[23]:





9.999999999999998e+16

since here the subtraction is significant enough to cause a round-off to $10^{17}-10^{17}\times\epsilon_{machine}$

To summarize, computers use a floating point number representation, where range and precision are limited by the allocated memory (bits). The precision is relative and given by $\epsilon_{machine}$. A nice illustration of the absolute spacing between representable floating point numbers is shown below

for a reduced system with only three bits for the fraction and exponent. The numbers on the x-axis are powers of two, and the red lines indicate the scaling of gaps between representable numbers with increasing magnitude. The spacing is the absolute value of $\epsilon_{machine}$ in that region.

Hexadecimal (base 16) numeral system

Another numeral system encountered in computer engineering is the hexadecimal, short hex, system. In very simple terms, it extends the deximal system

$0, 1, 2, 3, 4, 5, 6, 7, 8, 9$ by $A, B, C, D, E, F$ (case insensitive), to yield

$0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F$ (16 values)

so that the 8-bit hex number $0001E078$ can be represented in exponent format as (starting with the left most non-zero bit)

$(1 × 16^4) + (E × 16^3) + (0 × 16^2) + (7 × 16^1) + (8 × 16^0)$

equal to (using decimal counterparts)

$(1 × 16^4) + (14 × 16^3) + (0 × 16^2) + (7 × 16^1) + (8 × 16^0)$

to yield



In [24]:

    
1*16**4 + 14*16**3 + 0*16**2 + 7*16**1 + 8*16**0









    Out[24]:





123000

Floating point (FP) arithemtic ≠ arithmetic in ${\rm I\!R}$, but...

We revisit the starting point of this module:



In [25]:

    
1.2-1. == 10.2-10.









    Out[25]:





False

But we know now that it should be within some \epsilon_{machine}. So the should not indicate whether numbers are bitwise equal (what the '==' operatore looks at), but whether they are the close enough to account for floating point errors. Thus, the equivalence of numbers in FP is checked for within a given tolerance. Fortunately, we're not the first ones to come across that (and chances are for most problems you face in the near future you aren't either). So there is a function out there for a more appropriate comparison:



In [26]:

    
np.allclose(1.2-1., 10.2-10., rtol=1e-016, atol=1e-15)









    Out[26]:





True

This method returns $rtol \times |b| +atol < |a-b|$ with $a=1.2-1.$ and $b=10.2-10.$.

Problems

Download the notebook by using the button in the nbviewer bar on the top left.

The FP calculation $v^2$ results in $\infty$ for what $O(v)$ in double-precision?
After how many iterations do you expect
```
i = 0.
while i != 1.:
 i = i+0.1
```
to stop?
Does the order of floating point operations matter? Check for example
```
1e-16 + 1 - 1e-16 == 1e-16 - 1e-16 + 1
```
Write a function that compares two numbers a, b to some relative tolerance rtol!
```
def is_equal(a, b, rtol):
 pass # here your code
```
What is the FP result of $\sqrt{1e-16 + 1} - 1$?
With respect to FP arithmetic, what would be an advantage of using dimensionless numbers?
You store a surface map in the form of a two-dimensional array with 1E6 x 1E6 height measurement points. How much memory do you need for half-, single- and double-precision? If the measurement accuracy is in the order of 20 cm, and minimum/maximum height measurements are in the order of 100 meters, what precision do you recommend?

The FP calculation $v^2$ results in $\infty$ for what $O(v)$ in double-precision?



In [34]:

    
v = 1.0E154
v**2 # largest representable number









    Out[34]:





1e+308



In [35]:

    
v = 1.0E155
v**2 # overflow









    



---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-35-6b63f347adbf> in <module>()
      1 v = 1.0E155
----> 2 v**2

OverflowError: (34, 'Result too large')

The largest representable number in double-presicion os of O(1E308), so that v = O(1E154)

After how many iterations do you expect

i = 0.
while i != 1.:
    i = i+0.1

to stop?

Never - as has been shown, floating point operations are only accurate within some precision. The statement while i != 1. will never yield False.

Does the order of floating point operations matter? Check for example



In [36]:

    
1e-16 + 1 - 1e-16 == 1e-16 - 1e-16 + 1









    Out[36]:





False

yes, order matters!

Write a function that compares two numbers a, b to some relative tolerance rtol!



In [47]:

    
def is_equal(a, b, rtol):
    return abs(a-b) < max(a,b)*rtol



In [48]:

    
is_equal(1.2-1., 10.2-10., 1.0E-14)









    Out[48]:





True

What is the FP result of $\sqrt{1e-16 + 1} - 1$?



In [49]:

    
np.sqrt(1.0E-16 + 1.) - 1.









    Out[49]:





0.0

With respect to FP arithmetic, what would be an advantage of using dimensionless numbers?

Dimensionless numbers are mostly $O(1)$, so floating point operations involving numbers of large differences in order of mangnitude can be avoided. Also the limited range and precision of single-precision is mitigated.

You store a surface map in the form of a two-dimensional array with 1E6 x 1E6 height measurement points. How much memory do you need for half-, single- and double-precision? If the measurement accuracy is in the order of 20 cm, and minimum/maximum height measurements are in the order of 100 meters, what precision do you recommend?

Memory:



In [62]:

    
#       numbers       bits    bits/byte
bytes = 1E6**2    *    64   /     8
kbytes =  bytes/1024
mbytes = kbytes/1024
gbytes = mbytes/1024
tbytes = gbytes/1024



In [63]:

    
print bytes
print kbytes
print mbytes
print gbytes
print tbytes









    



8e+12
7812500000.0
7629394.53125
7450.58059692
7.27595761418

Necessary precision:



In [65]:

    
epsilon_half_16 = 4.88e-04
epsilon_single_32 = 5.96e-08
epsilon_double_64 = 1.11e-16
order = 100.
print order*epsilon_half_16
print order*epsilon_single_32
print order*epsilon_double_64









    



0.0488
5.96e-06
1.11e-14

So that half precision suffices to represent numbers with an absolute precision of down to about 4 [cm], which is less than our absolte measurement precision of 20 [cm]. This way, we use



In [66]:

    
#       numbers       bits    bits/byte
bytes = 1E6**2    *    16   /     8
kbytes =  bytes/1024
mbytes = kbytes/1024
gbytes = mbytes/1024
tbytes = gbytes/1024



In [67]:

    
print bytes
print kbytes
print mbytes
print gbytes
print tbytes









    



2e+12
1953125000.0
1907348.63281
1862.64514923
1.81898940355

and save a fourth the memory.



In [ ]: