5. Basic Numerical Analysis

5.1. NumPy

5.1.1. The standard Python library for linear algebra is numpy. In this section, we cover just enough of the numpy API to implement a few algorithms in the following sections.

5.1.2. The basic object in numpy is ndarray, an $n$-dimensional generalization of Python's list.


In [1]:
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([[1, 2, 3, 4]])
c = np.array([[1], [2], [3], [4]])
d = np.array([[1, 2], [3, 4]])

print(a)
print('shape of a: {}'.format(a.shape))
print()

print(b)
print('shape of b: {}'.format(b.shape))
print()

print(c)
print('shape of c: {}'.format(c.shape))
print()

print(d)
print('shape of d: {}'.format(d.shape))


[1 2 3 4]
shape of a: (4,)

[[1 2 3 4]]
shape of b: (1, 4)

[[1]
 [2]
 [3]
 [4]]
shape of c: (4, 1)

[[1 2]
 [3 4]]
shape of d: (2, 2)

The shape of an ndarray gives us the dimensions. b is a 1-by-4 matrix, or a row vector. c is a 2-by-2 vector, or a column vector. d is a 2-by-2 matrix.

5.1.3. Entering a column vector requires many brackets, which can be tedious. We can instead use transpose.


In [2]:
print(b)
print('shape of b: {}'.format(b.shape))
print()

print(b.transpose())
print('shape of b.transpose(): {}'.format(b.transpose().shape))


[[1 2 3 4]]
shape of b: (1, 4)

[[1]
 [2]
 [3]
 [4]]
shape of b.transpose(): (4, 1)

Similarly, a matrix can be entered by first typing out a one-dimensional array and then putting the array through reshape:


In [3]:
print(b)
print('shape of b: {}'.format(b.shape))
print()

print(b.reshape((2,2)))
print('shape of b.reshape((2,2)): {}'.format(b.transpose().reshape((2,2))))
print()

print(b.reshape((4,1)))
print('shape of b.reshape((4,1)): {}'.format(b.transpose().reshape((4,1))))


[[1 2 3 4]]
shape of b: (1, 4)

[[1 2]
 [3 4]]
shape of b.reshape((2,2)): [[1 2]
 [3 4]]

[[1]
 [2]
 [3]
 [4]]
shape of b.reshape((4,1)): [[1]
 [2]
 [3]
 [4]]

5.1.4. There are a number of pre-built functions for special kinds of vectors and matrices.


In [4]:
print(np.arange(5))
print()

print(np.arange(2, 8))
print()

print(np.arange(2, 15, 3))
print()

print(np.eye(1))
print()

print(np.eye(2))
print()

print(np.eye(3))
print()

print(np.zeros(1))
print()

print(np.zeros(2))
print()

print(np.zeros(3))
print()


[0 1 2 3 4]

[2 3 4 5 6 7]

[ 2  5  8 11 14]

[[1.]]

[[1. 0.]
 [0. 1.]]

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

[0.]

[0. 0.]

[0. 0. 0.]

5.1.5. Matrix multiplications are performed via dot:


In [5]:
x = np.array([[2, 3], [5, 7]])
y = np.array([[1, -1], [-1, 1]])

print(np.dot(x,y))
print()
print(x.dot(y))


[[-1  1]
 [-2  2]]

[[-1  1]
 [-2  2]]

If you are running Python 3.5 or higher, the binary operator @ may be used to denote matrix multiplication.


In [6]:
import sys

version_major, version_minor = sys.version_info[0:2]

if version_major >= 3 and version_minor >= 5:
    print(x @ y)
else:
    print('unsupported operation')


[[-1  1]
 [-2  2]]

5.1.6. ndarray supports coordinatewise operations. Observe, in particular, that * does not result in matrix multiplication.


In [7]:
print(np.array([1, 2, 3]) + np.array([3,4,5]))
print()

print(np.array([[4,3],[2,1]]) - np.array([[1,1],[1,1]]))
print()

print(np.array([[1, 2, 3], [4, 5, 6]]) * np.array([[1, 2, 1], [3, -1, -1]]))
print()

print(np.array([[1], [3]]) / np.array([[2], [2]]))


[4 6 8]

[[3 2]
 [1 0]]

[[ 1  4  3]
 [12 -5 -6]]

[[0.5]
 [1.5]]

Operations with mismatching dimensions sometimes result in legitimate operation, via broadcasting. For example, scalar multiplication works fine:


In [8]:
3 * np.array([[2,4,3], [1,2,5], [-1, -1, -1]])


Out[8]:
array([[ 6, 12,  9],
       [ 3,  6, 15],
       [-3, -3, -3]])

Row-wise operations and column-wise operations are also possible:


In [9]:
x = np.array([5, -1, 3])
y = np.arange(9).reshape((3, 3))

print(y)
print()

print(x.reshape((3, 1)) + y)
print()

print(x.reshape((1, 3)) + y)


[[0 1 2]
 [3 4 5]
 [6 7 8]]

[[ 5  6  7]
 [ 2  3  4]
 [ 9 10 11]]

[[ 5  0  5]
 [ 8  3  8]
 [11  6 11]]

5.2. Floating-Point Arithmetic

5.2.1. The real and complex number systems that we use in mathematics unfortunately cannot be stored in a computer. On a computer, we use a finite approximation known as the floating-point number system.

The floating-point number system in base $\beta$ with precision $p$, minimum exponent $e_\min$, and maximum exponent $e_\max$ is the discrete subset $\mathscr{F} = \mathscr{F}(\beta,p,e_\min,e_\max)$ consisting of 0, $\infty$, $-\infty$, NaN, and numbers of the form

$$\pm(d_0 + d_1 / \beta + \cdots + d_{p-1} / \beta^{p-1}) \beta^e,$$

where

  • $d_0,\ldots,d_{p-1},e$ are integers,
  • $1 \leq d_0 < \beta$,
  • $0 \leq d_i < \beta$ for all $1 \leq i \leq p-1$, and
  • $e_\min \leq e \leq e_\max$.

Given a floating-point number of the above form, we call the number $d_0 + d_1/\beta + \cdots + d_{p-1}/\beta^{p-1}$ the significand, or the mantissa. The number $e$ is called the exponent. The sign $\pm$ is called, naturally, the sign.

The condition $d_0 \geq 1$ makes the floating-point representation of a number unique. For this reason, a floating-point number with $d_0 \geq 1$ is called normalized.

5.2.2. How do we compute the floating-point representation $\operatorname{fl}(x)$ of a real number $x$?

It makes sense to define $\operatorname{fl}(x)$ as the floating-point number nearest to $x$. To approximate $\operatorname{fl}(x)$, we observe that $x$ admits a unique infinite-sum representation

$$x = \pm (d_0 + d_1/\beta + \cdots + d_{p-1}/\beta_{p-1} + d_p /\beta_p + \cdots) \beta^e,$$

provided that $d_0 \geq 1$ and $d_i \neq \beta - 1$ for at least one $i \geq 1$. The infinite-sum representation of $x$ shows that $\operatorname{fl}(x)$ must lie between

$$\pm (d_0 + d_1/\beta + \cdots + d_{p-1}/\beta_{p-1}) \beta^e - \beta_{p-1}\beta^e$$

and

$$\pm (d_0 + d_1/\beta + \cdots + d_{p-1}/\beta_{p-1}) \beta^e + \beta_{p-1}\beta^e$$

It follows that

$$\vert\operatorname{fl}(x) - x\vert \leq 2\beta^{-(p-1)}\beta^e.$$

To obtain an error bound independent of the choice of $x$, we normalize the difference:

$$\delta_x = \frac{\vert \operatorname{fl}(x) - x\vert}{\vert x \vert}.$$

Since $d_0$ in the infinite-series representation of $x$ must at least be 1, it follows that $\vert x \vert \geq \beta^e$. We now see that

$$\delta_x \leq \frac{2 \beta^{-(p-1)\beta^e}}{\beta^e} = 2 \beta^{-(p-1)}$$

regardless of the choice of $x$.

Since there exists at least one bound independent of the choice of $x$, we can find the smallest number $\epsilon_{\beta, p}$ such that

$$\delta_x \leq \epsilon_{\beta, p}$$

for every real number $x$. The bound $\epsilon_{\beta, p}$ is called the machine epsilon with respect to base $\beta$ and precision $p$.

Instead of the optimal bound, many references settle for the bound

$$\delta_x \leq \frac{1}{2} \beta^{-(p-1)},$$

which is still four times better than the crude bound we have obtained above. It is common to see this quantity referred to see as the machine epsilon.

5.2.3. We typically take the base $\beta$ to be 2. For notational convenience, let us write

$$(a_k \ldots a_1.b_1 \ldots b_m)_2$$

to denote the sum

$$\sum_{i=1}^k a_i 2^i + \sum_{j=1}^k b_i 2^{-j}.$$

A typical binary representation of a floating-point number is

$$\underbrace{s}_{\text{sign}} \mid \underbrace{e_1 \cdots e_{q}}_{\text{exponent}} \mid \underbrace{d_1 \cdots d_{p-1}}_{\text{significand}},$$

which codifies

$$\pm(1.d_1 \ldots d_{p-1})_2 \times 2^{(e_1,\ldots,e_q)_2 - (\beta^{e_\max} - 1)}$$

Why is $d_0$ not included? Since $\beta = 2$, normalization implies that $d_0$ always equals 1. There is no need to record it.

The number of bits available for the exponent determines $e_\min$ and $e_\max$. Typically, two's complement representation is used. If $e_1 = \cdots = e_{p-1} = 1$, then the above binary representation codifies $\infty$ if all significand bits are zero, and NaN otherwise. This practice exists to ensure consistent computational results across machines.

5.2.4. The IEEE Standard for Floating-Point Arithmetic (IEEE 754) stipulates how a floating-point number system should be implemented in order to guarantee consistent computational results across machines.

The single format uses 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand, totalling 32 bits. The double format uses 1 bit for sign, 11 bits for the exponent, and 52 bits for the significand, totalling 64 bits.

The floating-point representation of real numbers follows rounding modes specified in the standard. The most popular rounding mode is, of course, rounding to the nearest floating-point number (§5.2.3). The machine epsilon for the double format is approximately $2^{-52} \approx 10^{-16}$. We can check this with a simple binary search:


In [10]:
e = 1.0
while (1.0 + 0.5 * e) != 1.0:
    e = 0.5 * e
print(e)


2.220446049250313e-16

Another important aspect of IEEE 754 is the precision requirements on floating-point arithmetic operations. Specifically, the standard stipulates that

$$\operatorname{fl}(x \star y) = \operatorname{fl}(x) \star \operatorname{f}(y)$$

whenever $x$ and $y$ are real numbers and $\star$ is addition, subtraction, multiplication, or division. From this stipulation follows the so-called fundamental axiom of floting-point arithmetic: every operation of floating-point arithmetic is exact up to a relative error size of at most the machine epsilon.

5.2.5. The limitations of the floating-point arithmetic implies that we cannot expect a computer program to produce an exact answer to every problem. To quantify this, we let $X$ and $Y$ be vector spaces and define mathematical problem to be a function $f:X \to Y$ that maps each data point $X$ to a solution in $Y$. The floating-point computed analogue of $f$ is denoted by $\tilde{f}$.

Recall that the normalized error

$$\delta_x = \frac{\vert \operatorname{fl}(x) - x\vert}{\vert x \vert}$$

for approximating a real number with a floating-point number is bounded by the machine epsilon $\epsilon$ (§5.2.2). Since $\tilde{f}$ can be thought of as a finite process of such approximations, we might hope for

$$\frac{\|\tilde{f}(x) - f(x)\|}{\|f(x)\|} = O(\epsilon)$$

for all $x \in X$, where $\|\cdot\|$ is an appropriate norm defined on $Y$.

This is nearly always too strict, so we settle for

$$\frac{\|\tilde{f}(x) - f(\tilde{x})\|}{\|f(\tilde{x})\|} = O(\epsilon)$$

for at least one $\tilde{x}$, depending on the choice of $x$, such that

$$\frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon).$$

An algorithm that satisfies this condition is called stable.

Because we allow for rounding errors in the input data, it is possible that, for all $x \in X$, we have the equality

$$\|\tilde{f}(x) = f(\tilde{x})\|$$

for at least one $\tilde{x}$ such that

$$\frac{\|\tilde{x} - x\|}{\|x\|} = O(\epsilon).$$

Such an algorithm is called backward stable.