This is a Jupyter notebook using Python. You can install Jupyter locally to edit and interact with this notebook.
Numerical analysis is the study of algorithms for the problems of continuous mathematics. -- L. N. Trefethen
This course is concerned with the approximation of functions $u(x)$ for $x$ in some domain $\Omega \subset \mathbb R^d$ (where the dimension $d \le 4$ for most problems that we will consider). We will demand that $u$ satisfy some equations in its interior and on the boundary $\partial \Omega$ of our domain. We will usually be solving well-posed problems, for which there is a unique exact solution $u_*(x)$. While such exact solutions are rarely available in applications, we can manufacture exact solutions to test our numerical schemes.
There is typically no known finite representation of the exact solution $u_*$ so we will use discrete approximations $u_n$ where $n$ is some measure of the cost. Given an approximation $u_n(x)$, we need a measure of discretization error.
In [33]:
%matplotlib notebook
import numpy
from matplotlib import pyplot
pyplot.style.use('ggplot')
def u_n(n):
x = numpy.linspace(0,1,n)
return x, 1 + x + x**2/2 + x**3/6
for n in (40, 20, 10):
x, y = u_n(n)
pyplot.plot(x, y, 'o', label='$u_{%d}(x)$' % n)
pyplot.plot(x, numpy.exp(x), label='$\exp(x)$')
pyplot.legend(loc='upper left')
Out[33]:
In [2]:
pyplot.figure()
for n in (10, 20, 40, 80):
x, y = u_n(n)
pyplot.plot(n, numpy.linalg.norm(y - numpy.exp(x)), 'o')
The norm of a vector $y$ of length $n$ is $$ \lVert y \rVert = \sqrt{y^T y} = \sqrt{\sum_{i=1}^n |y_i|^2} .$$
Clearly this depends on the length of $x$. How can we measure norms of functions?
$$ \lVert u(x) \rVert_{L^2(\Omega)} = \sqrt{\langle u, u \rangle} = \sqrt{\int_{\Omega} u^2} .$$A norm on a vector space $V$ is a functional $\lVert \cdot \rVert : V \to \mathbb R$ satisfying all of the following
If the last condition is not met, we call it a seminorm.
Can this norm be computed exactly for arbitrary functions?
If we have a discrete function $u_n(x_i)|_{i=1}^n$, what would we need to compute $$ \lVert u_n - u_* \rVert_{L^2(\Omega)} $$ accurately?
Discuss examples of catastrophic failure.
In [3]:
pyplot.figure()
for n in (10, 20, 40, 80, 160, 320):
x, y = u_n(n)
pyplot.semilogx(n, numpy.linalg.norm(y - numpy.exp(x))/numpy.sqrt(n), 'o')
pyplot.title('Scaled like trapezoid rule integration of $L^2((0,1))$')
Out[3]:
We will usually seek methods that are convergent in a norm,
$$ \lim_{n\to \infty} \lVert u_n - u_* \rVert = 0. $$For scientific and engineering purposes, it might be sufficient for the method to be convergent in a seminorm. For example, lift and drag coefficients for an airfoil might be sufficient.
We will be interested in the rate of convergence as $n$ is increased. For an approximation problem (PDE or otherwise) in $d$ dimensions, $\Omega \subset \mathbb R^d$, we will say that a method converges at order $p$ in the norm $ \lVert \cdot \rVert $ if
$$ \lVert u_n - u_*\rVert \le C n^{-p/d} $$for some constant $C$ that is independent of resolution. This bound is often expressed in terms of a nominal grid spacing
$$ h = n^{-1/d} $$in which case we have
$$ \lVert u_h - u_* \rVert \le C h^p .$$Consider the boundary value problem
$$ \begin{gather} -\frac{d^2 u}{dx^2} = f(x) \quad x \in \Omega = (-1,1) \\ u(-1) = a \quad \frac{du}{dx}(1) = b . \end{gather} $$$f(x)$ is the "forcing" term and we have a Dirichlet boundary condition at the left end of the domain and a Neumann condition on the right end. We need to choose
Finite difference, finite volume, and finite element methods provide related but distinct frameworks for making these choices.
You have all seen basic linear algebra before, but this will summarize some different ways of thinking about the fundamental operations.
Linear algebra is the study of linear transformations on vectors, which represent points in a finite dimensional space. The matrix-vector product $y = A x$ is a linear combination of the columns of $A$. The familiar definition,
$$ y_i = \sum_j A_{i,j} x_j $$can also be viewed as
$$ y = \Bigg[ A_{:,0} \Bigg| A_{:,1} \Bigg| \dotsm \Bigg] \begin{bmatrix} x_0 \\ x_1 \\ \vdots \end{bmatrix} = \Bigg[ A_{:,0} \Bigg] x_0 + \Bigg[ A_{:,1} \Bigg] x_1 + \dotsb . $$The notation $A_{i,j}$ corresponds to the Python syntax A[i,j]
and the colon :
means the entire range (row or column). So $A_{:,j}$ is the $j$th column and $A_{i,:}$ is the $i$th row. The corresponding Python syntax is A[:,j]
and A[i,:]
.
In [4]:
def matmult1(A, x):
"""Entries of y are dot products of rows of A with x"""
y = numpy.zeros_like(A[:,0])
for i in range(len(A)):
row = A[i,:]
for j in range(len(row)):
y[i] += row[j] * x[j]
return y
A = numpy.array([[1,2],[3,5],[7,11]])
x = numpy.array([10,20])
matmult1(A, x)
Out[4]:
In [5]:
def matmult2(A, x):
"""Same idea, but more compactly"""
y = numpy.zeros_like(A[:,0])
for i,row in enumerate(A):
y[i] = row.dot(x)
return y
matmult2(A, x)
Out[5]:
In [6]:
def matmult3(A, x):
"""y is a linear expansion of the columns of A"""
y = numpy.zeros_like(A[:,0])
for j,col in enumerate(A.T):
y += col * x[j]
return y
matmult3(A, x)
Out[6]:
In [7]:
# We will use this version
A.dot(x)
Out[7]:
$\DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\null}{null} $ If $A \in \mathbb{R}^{m\times m}$, which of these doesn't belong?
When we write $x = A^{-1} y$, we mean that $x$ is the unique vector such that $A x = y$. (It is rare that we explicitly compute a matrix $A^{-1}$, though it's not as "bad" as people may have told you.) A vector $y$ is equivalent to $\sum_i e_i y_i$ where $e_i$ are columns of the identity. Meanwhile, $x = A^{-1} y$ means that we are expressing that same vector $y$ in the basis of the columns of $A$, i.e., $\sum_i A_{:,i} x_i$.
In [8]:
B = numpy.array([[2, 3],[0, 4]])
print(B)
print(B.dot(B.T), B.T.dot(B))
Binv = numpy.linalg.inv(B)
Binv.dot(B), B.dot(Binv)
Out[8]:
The inner product $$ x^T y = \sum_i x_i y_i $$ of vectors (or columns of a matrix) tell us about their magnitude and about the angle. The norm is induced by the inner product, $$ \lVert x \rVert = \sqrt{x^T x} $$ and the angle $\theta$ is defined by $$ \cos \theta = \frac{x^T y}{\lVert x \rVert \, \lVert y \rVert} . $$ Inner products are bilinear, which means that they satisfy some convenient algebraic properties $$ \begin{split} (x + y)^T z &= x^T z + y^T z \\ x^T (y + z) &= x^T y + x^T z \\ (\alpha x)^T (\beta y) &= \alpha \beta x^T y \\ \end{split} . $$ The pairwise inner products between two sets of vectors can be expressed by collecting the sets as columns in matrices and writing $A = X^T Y$ where $A_{i,j} = x_i^T y_j$. It follows from this definition that $$ (X^T Y)^T = Y^T X .$$
If $x^T y = 0$ then we say $x$ and $y$ are orthogonal (or "$x$ is orthogonal to $y$"). A vector is said to be normalized if $\lVert x \rVert = 1$. If $x$ is orthogonal to $y$ and $\lVert x \rVert = \lVert y \rVert = 1$ then we say $x$ and $y$ are orthonormal. A matrix with orthonormal columns is said to be an orthogonal matrix. We typically use $Q$ or $U$ and $V$ for matrices that are known/constructed to be orthogonal. Orthogonal matrices are always full rank -- the columns are linearly independent. The inverse of a square orthogonal matrix is its transpose: $$ Q^T Q = Q Q^T = I . $$ Orthogonal matrices are a powerful building block for robust numerical algorithms.
In [9]:
# Make some polynomials
x = numpy.linspace(-1,1)
A = numpy.vander(x, 4)
q0 = A.dot(numpy.array([0,0,0,.5])) # .5
q1 = A.dot(numpy.array([0,0,1,0])) # x
q2 = A.dot(numpy.array([0,1,0,0])) # x^2
pyplot.figure()
pyplot.plot(x, numpy.array([q0, q1, q2]).T)
Out[9]:
In [10]:
x
Out[10]:
In [11]:
# Inner products of even and odd functions
q0 = q0 / numpy.linalg.norm(q0)
q1.dot(q0), q2.dot(q0), q2.dot(q1)
Out[11]:
In [12]:
q0
Out[12]:
In [13]:
# What is the constant component of q2?
pyplot.figure()
pyplot.plot(x, q2.dot(q0)*q0)
Out[13]:
In [14]:
# Let's project that away so that q2 is orthogonal to q0
q2 = q2 - q2.dot(q0)*q0
Q = numpy.array([q0, q1, q2]).T
print(Q.T.dot(Q))
pyplot.figure()
pyplot.plot(x, Q)
Out[14]:
In [15]:
def gram_schmidt_naive(X):
Q = numpy.zeros_like(X)
R = numpy.zeros((len(X.T),len(X.T)))
for i in range(len(Q.T)):
v = X[:,i].copy()
for j in range(i):
r = v.dot(Q[:,j])
R[j,i] = r
v -= r * Q[:,j] # "modified Gram-Schmidt" - remove each component before next dot product
R[i,i] = numpy.linalg.norm(v)
Q[:,i] = v / R[i,i]
return Q, R
x = numpy.linspace(-1,1,50)
k = 6
A = numpy.vander(x, k, increasing=True)
Q, R = gram_schmidt_naive(A)
print(numpy.linalg.norm(Q.T.dot(Q) - numpy.eye(k)))
print(numpy.linalg.norm(Q.dot(R)-A))
pyplot.figure()
pyplot.plot(x, Q)
A.shape
Out[15]:
In [16]:
Q, R = gram_schmidt_naive(numpy.vander(x, 4, increasing=True))
pyplot.figure()
pyplot.plot(x, Q)
Out[16]:
Consider a function $f: X \to Y$ and define the absolute condition number $$ \hat\kappa = \lim_{\delta \to 0} \max_{|\delta x| < \delta} \frac{|f(x + \delta x) - f(x)|}{|\delta x|} = \max_{\delta x} \frac{|\delta f|}{|\delta x|}. $$ If $f$ is differentiable, then $\hat\kappa = |f'(x)|$.
Floating point arithmetic $x \circledast y := \text{float}(x * y)$ is exact within a relative accuracy $\epsilon_{\text{machine}}$. Formally, $$ x \circledast y = (x * y) (1 + \epsilon) $$ for some $|\epsilon| \le \epsilon_{\text{machine}}$.
In [17]:
eps = numpy.float32(1)
while numpy.float32(1) + eps > 1:
eps /= numpy.float64(2)
eps_machine = 2*eps # We call this "machine epsilon"
print(eps_machine)
In [18]:
format((.2 - 1/3) + 2/15, '.20f')
Out[18]:
In [19]:
format(.1, '.20f')
Out[19]:
Given the relative nature of floating point arithmetic, it is more useful to discuss relative condition number, $$ \kappa = \max_{\delta x} \frac{|\delta f|/|f|}{|\delta x|/|x|} = \max_{\delta x} \Big[ \frac{|\delta f|/|\delta x|}{|f| / |x|} \Big] $$ or, if $f$ is differentiable, $$ \kappa = \max_{\delta x} |f'(x)| \frac{|x|}{|f|} . $$
How does a condition number get big?
The relative accuracy of the best-case algorithm will not be reliably better than $\epsilon_{\text{machine}}$ times the condition number. $$ \max_{\delta x} \frac{|\delta f|}{|f|} \ge \kappa \cdot \epsilon_{\text{machine}} $$
In [20]:
numpy.log(1 + 1e-10) - numpy.log1p(1e-10)
Out[20]:
We use the notation $\tilde f(x)$ to mean a numerical algorithm for approximating $f(x)$. Additionally, $\tilde x = x (1 + \epsilon)$ is some "good" approximation of the exact input $x$.
"nearly the right answer to nearly the right question" $$ \frac{\lvert \tilde f(x) - f(\tilde x) \rvert}{| f(\tilde x) |} \in O(\epsilon_{\text{machine}}) $$ for some $\tilde x$ that is close to $x$
"exactly the right answer to nearly the right question" $$ \tilde f(x) = f(\tilde x) $$ for some $\tilde x$ that is close to $x$
The algorithm computes $$\tilde f(x) = \text{float}(x) \oplus 1 = [x(1+\epsilon_1) + 1](1 + \epsilon_2) = (x + 1 + x\epsilon_1)(1 + \epsilon_2) $$ and we can express any $\tilde x = x(1 + \epsilon_3)$. To see if if the algorithm is stable, we compute $$ \frac{\tilde f(x) - f(\tilde x)}{|f(\tilde x)|} = \frac{(x + 1 + x\epsilon_1)(1 + \epsilon_2) - [x(1+ \epsilon_3) + 1]}{\tilde x + 1} = \frac{(x + 1)\epsilon_2 + x(\epsilon_1 - \epsilon_3) + O(\epsilon^2)}{x + 1 + x\epsilon_3} . $$ If we can choose $\epsilon_3$ to make this small, then the method will be (forward) stable, and if we can make this expression exactly zero, then we'll have backward stability. Trying for the latter, we solve for $\epsilon_3$ by setting the numerator equal to zero, $$ \epsilon_3 = \frac{x + 1}{x}\epsilon_2 + \epsilon_1 + O(\epsilon^2)/x $$ which is small so long as $|x| \gg 0$, but the first term blows up as $x \to 0$. In other words, the fact that $\epsilon_2$ can produce a large error relative to the input causes this algorithm to not be backward stable. In contrast, this $x\to 0$ case is not a problem for forward stability because $\epsilon_3 = \epsilon_1$ yields error on the order of $\epsilon_2$.
Now we are interested in $$ \frac{\tilde f(x,y) - f(\tilde x,\tilde y)}{f(\tilde x,\tilde y)} $$ and we can vary both $\tilde x$ and $\tilde y$. If we choose $y=1$, then the ability to vary $\tilde y$ is powerful enough to ensure backward stability.
A backward stable algorithm for computing $f(x)$ has relative accuracy $$ \left\lvert \frac{\tilde f(x) - f(x)}{f(x)} \right\rvert \in O(\kappa(f) \epsilon_{\text{machine}}) . $$ This is a rewording of a statement made earlier -- backward stability is the best case.
We used x = numpy.linspace(-1,1)
which uses $m=50$ points by default. The number 50 is arbitrary and as we use more points, our columns become better approximations of continuous functions and the vector inner product becomes an integral (up to scaling):
$$ \frac 2 m \sum_{i=1}^m p_i q_i \approx \int_{-1}^1 p(x) q(x) . $$
When we orthogonalize the monomials using this inner product, we get the Legendre Polynomials (up to scaling). These polynomials have important applications in physics and engineering, as well as playing an important role in approximation (which we will go into in more detail).
To solve $$ A x = b $$ we can compute $A = QR$ and then $$ x = R^{-1} Q^T b . $$
This also works for non-square systems!
In [21]:
x1 = numpy.array([-0.9, 0.1, 0.5, 0.8]) # points where we know values
y = numpy.array([1, 2.4, -0.2, 1.3]) # values at those points
pyplot.figure()
pyplot.plot(x1, y, '*')
B = numpy.vander(x1, 4) # Vandermonde matrix at the known points
Q, R = gram_schmidt_naive(B)
p = numpy.linalg.solve(R, Q.T.dot(y)) # Compute the polynomial coefficients
print(p)
pyplot.plot(x, numpy.vander(x,4).dot(p)) # Plot the polynomial evaluated at all points
print('B =', B, '\np =', p)
In [22]:
m = 20
V = numpy.vander(numpy.linspace(-1,1,m), increasing=False)
Q, R = gram_schmidt_naive(V)
def qr_test(qr, V):
Q, R = qr(V)
m = len(Q.T)
print(qr.__name__,
numpy.linalg.norm(Q.dot(R) - V),
numpy.linalg.norm(Q.T.dot(Q) - numpy.eye(m)))
qr_test(gram_schmidt_naive, V)
qr_test(numpy.linalg.qr, V)
In [23]:
def gram_schmidt_classical(X):
Q = numpy.zeros_like(X)
R = numpy.zeros((len(X.T),len(X.T)))
for i in range(len(Q.T)):
v = X[:,i].copy()
R[:i,i] = Q[:,:i].T.dot(v)
v -= Q[:,:i].dot(R[:i,i])
R[i,i] = numpy.linalg.norm(v)
Q[:,i] = v / R[i,i]
return Q, R
qr_test(gram_schmidt_classical, V[:,:15])
# Q, R = numpy.linalg.qr(V)
#print(Q[:,0])
Classical Gram-Schmidt is highly parallel, but unstable, as evidenced by the lack of orthogonality in $Q$.
The implementations above have been "left-looking"; when working on column $i$, we compare it only to columns to the left (i.e., $j < i$). We can reorder the algorithm to look to the right by projecting $q_i$ out of all columns $j > i$. This algorithm is stable while being just as parallel as gram_schmidt_classical
.
In [24]:
def gram_schmidt_modified(X):
Q = X.copy()
R = numpy.zeros((len(X.T), len(X.T)))
for i in range(len(Q.T)):
R[i,i] = numpy.linalg.norm(Q[:,i])
Q[:,i] /= R[i,i]
R[i,i+1:] = Q[:,i+1:].T.dot(Q[:,i])
Q[:,i+1:] -= numpy.outer(Q[:,i], R[i,i+1:])
return Q, R
qr_test(gram_schmidt_modified, V)
Gram-Schmidt methods perform triangular transformations to build an orthogonal matrix. As we have seen, $X = QR$ is satisfied accurately, but $Q$ may not be orthogonal when $X$ is ill-conditioned. Householder triangularization instead applies a sequence of orthogonal transformations to build a triangular matrix.
$$ \underbrace{Q_{n-1} \dotsb Q_0}_{Q^T} A = R $$The structure of the algorithm is
$$ \underbrace{\begin{bmatrix} * & * & * \\ * & * & * \\ * & * & * \\ * & * & * \\ * & * & * \\ \end{bmatrix}}_{A} \to \underbrace{\begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \\ \end{bmatrix}}_{Q_0 A} \to \underbrace{\begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & * \\ 0 & 0 & * \\ \end{bmatrix}}_{Q_1 Q_0 A} \to \underbrace{\begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix}}_{Q_2 Q_1 Q_0 A} $$where the elementary orthogonal matrices $Q_i$ chosen to introduce zeros below the diagonal in the $i$th column of $R$. Each of these transformations will have the form $$Q_i = \begin{bmatrix} I_i & 0 \\ 0 & F \end{bmatrix}$$ where $F$ is a "reflection" that achieves $$ F x = \begin{bmatrix} \lVert x \rVert \\ 0 \\ 0 \\ \vdots \end{bmatrix} $$ where $x$ is the column of $R$ from the diagonal down. This transformation is a reflection across a plane with normal $v = Fx - x = \lVert x \rVert e_1 - x$.
The reflection, as depected above by Trefethen and Bau (1999) can be written $F = I - 2 \frac{v v^T}{v^T v}$.
In [25]:
def householder_Q_times(V, x):
"""Apply orthogonal matrix represented as list of Householder reflectors"""
y = x.copy()
for i in reversed(range(len(V))):
y[i:] -= 2 * V[i] * V[i].dot(y[i:])
return y
def qr_householder1(A):
"Compute QR factorization using naive Householder reflection"
m, n = A.shape
R = A.copy()
V = []
for i in range(n):
x = R[i:,i]
v = -x
v[0] += numpy.linalg.norm(x)
v = v/numpy.linalg.norm(v) # Normalized reflector plane
R[i:,i:] -= 2 * numpy.outer(v, v.dot(R[i:,i:]))
V.append(v) # Storing reflectors is equivalent to storing orthogonal matrix
Q = numpy.eye(m, n)
for i in range(n):
Q[:,i] = householder_Q_times(V, Q[:,i])
return Q, numpy.triu(R[:n,:])
qr_test(qr_householder1, numpy.array([[1.,2],[3,4],[5,6]]))
In [26]:
qr_test(qr_householder1, V)
qr_test(numpy.linalg.qr, V)
In [27]:
qr_test(qr_householder1, numpy.eye(1))
In [28]:
qr_test(qr_householder1, numpy.eye(3,2))
Inside qr_householder1
, we have the lines
x = R[i:,i]
v = -x
v[0] += numpy.linalg.norm(x)
v = v/numpy.linalg.norm(v) # Normalized reflector plane
What happens when $$x = \begin{bmatrix}1 \\ 0 \end{bmatrix}$$ (i.e., the column of $R$ is already upper triangular)?
We are trying to define a reflector plane (via its normal vector) from the zero vector,
$$v = \lVert x \rVert e_0 - x .$$
When we try to normalize this vector, we divide zero by zero and the algorithm breaks down (nan
). Maybe we just need to test for this special case and "skip ahead" when no reflection is needed? And if so, how would we define $Q$?
In [29]:
qr_test(qr_householder1, numpy.array([[1.,1], [2e-8,1]]))
print(qr_householder1(numpy.array([[1.,1], [2e-8,1]])))
The error $QR - A$ is still $10^{-8}$ for this very well-conditioned matrix so something else must be at play here.
In [30]:
def qr_householder2(A):
"Compute QR factorization using Householder reflection"
m, n = A.shape
R = A.copy()
V = []
for i in range(n):
v = R[i:,i].copy()
v[0] += numpy.copysign(numpy.linalg.norm(v), v[0]) # Choose the further of the two reflections
v = v/numpy.linalg.norm(v) # Normalized reflector plane
R[i:,i:] -= 2 * numpy.outer(v, v.dot(R[i:,i:]))
V.append(v) # Storing reflectors is equivalent to storing orthogonal matrix
Q = numpy.eye(m, n)
for i in range(n):
Q[:,i] = householder_Q_times(V, Q[:,i])
return Q, numpy.triu(R[:n,:])
qr_test(qr_householder2, numpy.eye(3,2))
qr_test(qr_householder2, numpy.array([[1.,1], [1e-8,1]]))
print(qr_householder2(numpy.array([[1.,1], [1e-8,1]])))
qr_test(qr_householder2, V)
We now have a usable implementation of Householder QR. There are some further concerns for factoring rank-deficient matrices. We will visit the concept of pivoting later, in the context of LU and Cholesky factorization.
We may have informally referred to a matrix as "ill-conditioned" when the columns are nearly linearly dependent, but let's make this concept for precise. Recall the definition of (relative) condition number from the Rootfinding notes,
$$ \kappa = \max_{\delta x} \frac{|\delta f|/|f|}{|\delta x|/|x|} . $$We understood this definition for scalar problems, but it also makes sense when the inputs and/or outputs are vectors (or matrices, etc.) and absolute value is replaced by vector (or matrix) norms. Let's consider the case of matrix-vector multiplication, for which $f(x) = A x$.
$$ \kappa(A) = \max_{\delta x} \frac{\lVert A (x+\delta x) - A x \rVert/\lVert A x \rVert}{\lVert \delta x\rVert/\lVert x \rVert} = \max_{\delta x} \frac{\lVert A \delta x \rVert}{\lVert \delta x \rVert} \, \frac{\lVert x \rVert}{\lVert A x \rVert} = \lVert A \rVert \frac{\lVert x \rVert}{\lVert A x \rVert} . $$There are two problems here:
Vector norms are built into the linear space (and defined in term of the inner product). Matrix norms are induced by vector norms, according to
$$ \lVert A \rVert = \max_{x \ne 0} \frac{\lVert A x \rVert}{\lVert x \rVert} . $$Now we understand the formula for condition number, but it depends on $x$. Consider the matrix
$$ A = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} . $$The condition number of matrix-vector multiplication depends on the vector. The condition number of the matrix is the worst case (maximum) of the condition number for any vector, i.e.,
$$ \kappa(A) = \max_{x \ne 0} \lVert A \rVert \frac{\lVert x \rVert}{\lVert A x \rVert} .$$If $A$ is invertible, then we can rephrase as
$$ \kappa(A) = \max_{x \ne 0} \lVert A \rVert \frac{\lVert A^{-1} (A x) \rVert}{\lVert A x \rVert} = \max_{A x \ne 0} \lVert A \rVert \frac{\lVert A^{-1} (A x) \rVert}{\lVert A x \rVert} = \lVert A \rVert \lVert A^{-1} \rVert . $$Evidently multiplying by a matrix is just as ill-conditioned of an operation as solving a linear system using that matrix.
In [31]:
def R_solve(R, b):
"""Solve Rx = b using back substitution."""
x = b.copy()
m = len(b)
for i in reversed(range(m)):
x[i] -= R[i,i+1:].dot(x[i+1:])
x[i] /= R[i,i]
return x
x = numpy.linspace(-1,1,15)
A = numpy.vander(x, 4)
print(A.shape)
Q, R = numpy.linalg.qr(A)
b = Q.T.dot(A.dot(numpy.array([1,2,3,4])))
numpy.linalg.norm(R_solve(R, b) - numpy.linalg.solve(R, b))
R_solve(R, b)
Out[31]:
The dominant cost comes from the line
R[i:,i:] -= 2 * numpy.outer(v, v.dot(R[i:,i:]))
were R[i:,i:]
is an $(m-i)\times(n-i)$ matrix.
This line performs $2(m-i)(n-i)$ operations in v.dot(R[i:,i:])
, another $(m-i)(n-i)$ in the "outer" product and again in subtraction. As written, multiplication by 2 would be another $(m-i)(n-i)$ operations, but is only $m-i$ operations if we rewrite as
w = 2*v
R[i:,i:] -= numpy.outer(w, v.dot(R[i:,i:]))
in which case the leading order cost is $4(m-i)(n-i)$. To compute the total cost, we need to sum over all columns $i$, $$\begin{split} \sum_{i=1}^n 4(m-i)(n-i) &= 4 \Big[ \sum_{i=1}^n (m-n)(n-i) + \sum_{i=1}^n (n-i)^2 \Big] \\ &= 4 (m-n) \sum_{i=1}^n i + 4 \sum_{i=1}^n i^2 \\ &\approx 2 (m-n) n^2 + 4 n^3/3 \\ &= 2 m n^2 - \frac 2 3 n^3 . \end{split}$$ Recall that Gram-Schmidt QR cost $2 m n^2$, so Householder costs about the same when $m \gg n$ and is markedly less expensive when $m \approx n$.
A least squares problem takes the form: given an $m\times n$ matrix $A$ ($m \ge n$), find $x$ such that $$ \lVert Ax - b \rVert $$ is minimized. If $A$ is square and full rank, then this minimizer will satisfy $A x - b = 0$, but that is not the case in general because $b$ is not in the range of $A$. The residual $A x - b$ must be orthogonal to the range of $A$.
In HW2, we showed that $QQ^T$ is an orthogonal projector onto the range of $Q$. If $QR = A$, $$ QQ^T (A x - b) = QQ^T(Q R x - b) = Q (Q^T Q) R x - QQ^T b = QR x - QQ^T b = A x - QQ^T b . $$ So if $b$ is in the range of $A$, we can solve $A x = b$. If not, we need only orthogonally project $b$ into the range of $A$.
Solve $R x = Q^T b$.
This method is stable and accurate.
The mathematically equivalent form $(A^T A) x = A^T b$ are called the normal equations. The solution process involves factoring the symmetric and positive definite $n\times n$ matrix $A^T A$.
The product $A^T A$ is ill-conditioned: $\kappa(A^T A) = \kappa(A)^2$ and can reduce the accuracy of a least squares solution.
Next, we will discuss a factorization $$ U \Sigma V^T = A $$ where $U$ and $V$ have orthonormal columns and $\Sigma$ is diagonal with nonnegative entries. The entries of $\Sigma$ are called singular values and this decomposition is the singular value decomposition (SVD). It may remind you of an eigenvalue decomposition $X \Lambda X^{-1} = A$, but
An alternative is to explicitly form the $n\times m$ pseudoinverse $A^\dagger = R^{-1} Q^T$ (at a cost of $mn^2$) at which point each right hand side costs $2 mn$. Why might we do this?
In [32]:
# Test accuracy of solver for an ill-conditioned square matrix
x = numpy.linspace(-1,1,19)
A = numpy.vander(x)
print('cond(A) = ',numpy.linalg.cond(A))
Q, R = numpy.linalg.qr(A)
print('cond(R^{-1} Q^T A) =', numpy.linalg.cond(numpy.linalg.solve(R, Q.T.dot(A))))
L = numpy.linalg.cholesky(A.T.dot(A))
print('cond(L^{-T} L^{-1} A^T A) =', numpy.linalg.cond(numpy.linalg.solve(L.T, numpy.linalg.solve(L, A.T.dot(A)))))
In [ ]:
The SVD is the decomposition $$ U \Sigma V^h = A $$ where $U$ and $V$ have orthonormal columns and $\Sigma$ is diagonal and nonnegative. Evidenly an arbitrary matrix consists of
It is typical to order the singular values in descending order.
The inverse has the same behavior:
$$ A^{-1} = V \Sigma^{-1} U^h $$The matrix norm is the largest singular value
$$ \lVert A \rVert = \sigma_{\max} . $$The largest value in $\Sigma^{-1}$ is $\sigma_\min^{-1}$, so
$$ \lVert A^{-1} \rVert = \sigma_\min^{-1} . $$We showed previously that
$$\kappa(A) = \lVert A \rVert \, \lVert A^{-1} \rVert $$so now we can also write
$$ \kappa(A) = \frac{\sigma_\max}{\sigma_\min} . $$The SVD is a crucial tool in statistics and dimensionality reduction, often under names like
In this context, one computes an SVD of a data matrix and finds that the spectrum $\Sigma$ decays rapidly and only the $k \ll n$ components larger than some threshold are "important". The dense matrix can thus be approximated as
$$ \hat U \hat \Sigma \hat V^h $$by keeping only the first (largest) $k$ columns. This reduces the storage from $mn$ entries to $mk + kn$ entries.
In [ ]: