Jupyter notebooks

This is a Jupyter notebook using Python. You can install Jupyter locally to edit and interact with this notebook.

Linear Algebra

You have all seen basic linear algebra before, but this will summarize some different ways of thinking about the fundamental operations. It also presents concepts in a different order than Sauer's book.

Linear algebra is the study of linear transformations on vectors, which represent points in a finite dimensional space. The matrix-vector product $y = A x$ is a linear combination of the columns of $A$. The familiar definition,

$$ y_i = \sum_j A_{i,j} x_j $$

can also be viewed as

$$ y = \Bigg[ A_{:,0} \Bigg| A_{:,1} \Bigg| \dotsm \Bigg] \begin{bmatrix} x_0 \\ x_1 \\ \vdots \end{bmatrix} = \Bigg[ A_{:,0} \Bigg] x_0 + \Bigg[ A_{:,1} \Bigg] x_1 + \dotsb . $$

The notation $A_{i,j}$ corresponds to the Python syntax A[i,j] and the colon : means the entire range (row or column). So $A_{:,j}$ is the $j$th column and $A_{i,:}$ is the $i$th row. The corresponding Python syntax is A[:,j] and A[i,:].


In [83]:
%matplotlib notebook
import numpy
from matplotlib import pyplot

def matmult1(A, x):
    """Entries of y are dot products of rows of A with x"""
    y = numpy.zeros_like(A[:,0])
    for i in range(len(A)):
        row = A[i,:]
        for j in range(len(row)):
            y[i] += row[j] * x[j]
    return y

A = numpy.array([[1,2],[3,5],[7,11]])
x = numpy.array([10,20])
matmult1(A, x)


Out[83]:
array([ 50, 130, 290])

In [84]:
def matmult2(A, x):
    """Same idea, but more compactly"""
    y = numpy.zeros_like(A[:,0])
    for i,row in enumerate(A):
        y[i] = row.dot(x)
    return y

matmult2(A, x)


Out[84]:
array([ 50, 130, 290])

In [85]:
def matmult3(A, x):
    """y is a linear expansion of the columns of A"""
    y = numpy.zeros_like(A[:,0])
    for j,col in enumerate(A.T):
        y += col * x[j]
    return y

matmult3(A, x)


Out[85]:
array([ 50, 130, 290])

In [86]:
# We will use this version
A.dot(x)


Out[86]:
array([ 50, 130, 290])

Some common terminology

  • The range of $A$ is the space spanned by its columns. This definition coincides with the range of a function $f(x)$ when $f(x) = A x$.
  • The nullspace of $A$ is the space of vectors $x$ such that $A x = 0$.
  • The rank of $A$ is the dimension of its range.
  • A matrix has full rank if the nullspace of either $A$ or $A^T$ is empty (only the 0 vector). Equivalently, if all the columns of $A$ (or $A^T$) are linearly independent.
  • A nonsingular (or invertible) matrix is a square matrix of full rank. We call the inverse $A^{-1}$ and it satisfies $A^{-1} A = A A^{-1} = I$.

$\DeclareMathOperator{\rank}{rank} \DeclareMathOperator{\null}{null} $ If $A \in \mathbb{R}^{m\times m}$, which of these doesn't belong?

  1. $A$ has an inverse $A^{-1}$
  2. $\rank (A) = m$
  3. $\null(A) = \{0\}$
  4. $A A^T = A^T A$
  5. $\det(A) \ne 0$
  6. $A x = 0$ implies that $x = 0$

When we write $x = A^{-1} y$, we mean that $x$ is the unique vector such that $A x = y$. (It is rare that we explicitly compute a matrix $A^{-1}$, though it's not as "bad" as people may have told you.) A vector $y$ is equivalent to $\sum_i e_i y_i$ where $e_i$ are columns of the identity. Meanwhile, $x = A^{-1} y$ means that we are expressing that same vector $y$ in the basis of the columns of $A$, i.e., $\sum_i A_{:,i} x_i$.


In [87]:
B = numpy.array([[2, 3],[0, 4]])
print(B)
print(B.dot(B.T), B.T.dot(B))
Binv = numpy.linalg.inv(B)
Binv.dot(B), B.dot(Binv)


[[2 3]
 [0 4]]
[[13 12]
 [12 16]] [[ 4  6]
 [ 6 25]]
Out[87]:
(array([[ 1.,  0.],
        [ 0.,  1.]]), array([[ 1.,  0.],
        [ 0.,  1.]]))

Vandermonde matrices

A Vandermonde matrix is one whose columns are functions evaluated at discrete points.


In [88]:
x = numpy.linspace(-1,1)
A = numpy.array([x**3, x**2, x, 1+0*x]).T
print('shape =', A.shape)  # This is a tall matrix with 4 columns
pyplot.style.use('ggplot')
pyplot.figure()
pyplot.plot(x, A)
pyplot.ylim((-1.1,1.1))
pyplot.show()


shape = (50, 4)
  • This type of matrix is very common; we can also create it with numpy.vander.

We can evaluate polynomials using matrix-vector multiplication. For example, $$ 5x^3 - 3x = \Bigg[ x^3 \Bigg|\, x^2 \Bigg|\, x \,\Bigg|\, 1 \Bigg] \begin{bmatrix}5 \\ 0 \\ -3 \\ 0 \end{bmatrix} . $$


In [89]:
pyplot.figure()
p = numpy.array([5,0,-3,0])
pyplot.plot(x, A.dot(p))


Out[89]:
[<matplotlib.lines.Line2D at 0x7f1f00dabc50>]

Now suppose we know the value of a polynomial at a few points. We can use the Vandermonde matrix to find a polynomial through those points.


In [90]:
x1 = numpy.array([-0.9, 0.1, 0.5, 0.8])  # points where we know values
y = numpy.array([1, 2.4, -0.2, 1.3])     # values at those points
pyplot.figure()
pyplot.plot(x1, y, '*')
B = numpy.vander(x1)                     # Vandermonde matrix at the known points
p = numpy.linalg.solve(B, y)             # Compute the polynomial coefficients
print(p)
pyplot.plot(x, A.dot(p))                 # Plot the polynomial evaluated at all points
print('B =', B, '\np =', p)


[ 12.98319328  -1.74789916  -9.47605042   3.35210084]
B = [[-0.729  0.81  -0.9    1.   ]
 [ 0.001  0.01   0.1    1.   ]
 [ 0.125  0.25   0.5    1.   ]
 [ 0.512  0.64   0.8    1.   ]] 
p = [ 12.98319328  -1.74789916  -9.47605042   3.35210084]

Evidently $p(x) = 12.983 x^3 - 1.748 x^2 - 9.476 x + 3.352$ is the unique cubic polynomial that interpolates those points. Applying $B^{-1}$ converted from the values at the marked points to the polynomial coefficients.

Inner products and orthogonality

The inner product $$ x^T y = \sum_i x_i y_i $$ of vectors (or columns of a matrix) tell us about their magnitude and about the angle. The norm is induced by the inner product, $$ \lVert x \rVert = \sqrt{x^T x} $$ and the angle $\theta$ is defined by $$ \cos \theta = \frac{x^T y}{\lVert x \rVert \, \lVert y \rVert} . $$ Inner products are bilinear, which means that they satisfy some convenient algebraic properties $$ \begin{split} (x + y)^T z &= x^T z + y^T z \\ x^T (y + z) &= x^T y + x^T z \\ (\alpha x)^T (\beta y) &= \alpha \beta x^T y \\ \end{split} . $$ The pairwise inner products between two sets of vectors can be expressed by collecting the sets as columns in matrices and writing $A = X^T Y$ where $A_{i,j} = x_i^T y_j$. It follows from this definition that $$ (X^T Y)^T = Y^T X .$$

Orthogonal matrices

If $x^T y = 0$ then we say $x$ and $y$ are orthogonal (or "$x$ is orthogonal to $y$"). A vector is said to be normalized if $\lVert x \rVert = 1$. If $x$ is orthogonal to $y$ and $\lVert x \rVert = \lVert y \rVert = 1$ then we say $x$ and $y$ are orthonormal. A matrix with orthonormal columns is said to be an orthogonal matrix. We typically use $Q$ or $U$ and $V$ for matrices that are known/constructed to be orthogonal. Orthogonal matrices are always full rank -- the columns are linearly independent. The inverse of a square orthogonal matrix is its transpose: $$ Q^T Q = Q Q^T = I . $$ Orthogonal matrices are a powerful building block for robust numerical algorithms.


In [91]:
# Make some polynomials
q0 = A.dot(numpy.array([0,0,0,.5]))  # .5
q1 = A.dot(numpy.array([0,0,1,0]))  # x
q2 = A.dot(numpy.array([0,1,0,0]))  # x^2

pyplot.figure()
pyplot.plot(x, numpy.array([q0, q1, q2]).T)


Out[91]:
[<matplotlib.lines.Line2D at 0x7f1f00afeb38>,
 <matplotlib.lines.Line2D at 0x7f1f00afe780>,
 <matplotlib.lines.Line2D at 0x7f1f00af8cc0>]

In [92]:
# Inner products of even and odd functions

q0 = q0 / numpy.linalg.norm(q0)
q1.dot(q0), q2.dot(q0), q2.dot(q1)


Out[92]:
(-7.4940054162198066e-16, 2.4532276081982256, -3.5527136788005009e-15)

In [93]:
# What is the constant component of q2?

pyplot.figure()
pyplot.plot(x, q2.dot(q0)*q0)


Out[93]:
[<matplotlib.lines.Line2D at 0x7f1f00e6c4a8>]

In [94]:
# Let's project that away so that q2 is orthogonal to q0

q2 = q2 - q2.dot(q0)*q0

Q = numpy.array([q0, q1, q2]).T
print(Q.T.dot(Q))
pyplot.figure()
pyplot.plot(x, Q)


[[  1.00000000e+00  -7.49400542e-16   1.66533454e-16]
 [ -7.49400542e-16   1.73469388e+01  -9.99200722e-16]
 [  1.66533454e-16  -9.99200722e-16   4.80888065e+00]]
Out[94]:
[<matplotlib.lines.Line2D at 0x7f1f00f49cc0>,
 <matplotlib.lines.Line2D at 0x7f1f00f49b70>,
 <matplotlib.lines.Line2D at 0x7f1f00fb14e0>]

Gram-Schmidt Orthogonalization

Given a collection of vectors (columns of a matrix), we can find an orthogonal basis by applying the above procedure one column at a time.


In [95]:
def gram_schmidt_naive(X):
    Q = numpy.zeros_like(X)
    R = numpy.zeros((len(X.T),len(X.T)))
    for i in range(len(Q.T)):
        v = X[:,i].copy()
        for j in range(i):
            r = v.dot(Q[:,j])
            R[j,i] = r
            v -= r * Q[:,j]    # "modified Gram-Schmidt" - remove each component before next dot product
        R[i,i] = numpy.linalg.norm(v)
        Q[:,i] = v / R[i,i]
    return Q, R

Q, R = gram_schmidt_naive(A)
print(Q.T.dot(Q))
print(numpy.linalg.norm(Q.dot(R)-A))
pyplot.figure()
pyplot.plot(x, Q)


[[  1.00000000e+00   2.91433544e-16   1.24900090e-16  -2.28983499e-16]
 [  2.91433544e-16   1.00000000e+00   1.38777878e-17  -4.30211422e-16]
 [  1.24900090e-16   1.38777878e-17   1.00000000e+00  -4.51028104e-17]
 [ -2.28983499e-16  -4.30211422e-16  -4.51028104e-17   1.00000000e+00]]
6.72043887442e-16
Out[95]:
[<matplotlib.lines.Line2D at 0x7f1f00cdcb00>,
 <matplotlib.lines.Line2D at 0x7f1f00cdc080>,
 <matplotlib.lines.Line2D at 0x7f1f00cdc400>,
 <matplotlib.lines.Line2D at 0x7f1f00cdc940>]

In [96]:
Q, R = gram_schmidt_naive(numpy.vander(x, 4, increasing=True))
pyplot.figure()
pyplot.plot(x, Q)


Out[96]:
[<matplotlib.lines.Line2D at 0x7f1f00cee4a8>,
 <matplotlib.lines.Line2D at 0x7f1f010ce550>,
 <matplotlib.lines.Line2D at 0x7f1f00fa32b0>,
 <matplotlib.lines.Line2D at 0x7f1f0104f9e8>]

Theorem: all full-rank $m\times n$ matrices ($m \ge n$) have a unique $Q R$ factorization with $R_{j,j} > 0$.

Orthogonal polynomials

We used x = numpy.linspace(-1,1) which uses $m=50$ points by default. The number 50 is arbitrary and as we use more points, our columns become better approximations of continuous functions and the vector inner product becomes an integral (up to scaling): $$ \frac 2 m \sum_{i=1}^m p_i q_i \approx \int_{-1}^1 p(x) q(x) . $$

When we orthogonalize the monomials using this inner product, we get the Legendre Polynomials (up to scaling). These polynomials have important applications in physics and engineering, as well as playing an important role in approximation (which we will go into in more detail).

Solving equations using QR

To solve $$ A x = b $$ we can compute $A = QR$ and then $$ x = R^{-1} Q^T b . $$

This also works for non-square systems!


In [97]:
x1 = numpy.array([-0.9, 0.1, 0.5, 0.8])  # points where we know values
y = numpy.array([1, 2.4, -0.2, 1.3])     # values at those points
pyplot.figure()
pyplot.plot(x1, y, '*')
B = numpy.vander(x1, 2)                     # Vandermonde matrix at the known points
Q, R = gram_schmidt_naive(B)
p = numpy.linalg.solve(R, Q.T.dot(y))             # Compute the polynomial coefficients
print(p)
pyplot.plot(x, numpy.vander(x,2).dot(p))                 # Plot the polynomial evaluated at all points
print('B =', B, '\np =', p)


[-0.17147193  1.14643399]
B = [[-0.9  1. ]
 [ 0.1  1. ]
 [ 0.5  1. ]
 [ 0.8  1. ]] 
p = [-0.17147193  1.14643399]

In [98]:
m = 20
V = numpy.vander(numpy.linspace(-1,1,m), increasing=False)
Q, R = gram_schmidt_naive(V)

def qr_test(qr, V):
    Q, R = qr(V)
    m = len(Q.T)
    print(qr.__name__, numpy.linalg.norm(Q.dot(R) - V), numpy.linalg.norm(Q.T.dot(Q) - numpy.eye(m)))
    
qr_test(gram_schmidt_naive, V)
qr_test(numpy.linalg.qr, V)


gram_schmidt_naive 1.22046927257e-15 1.86469927987e-08
qr 5.89181489555e-15 2.97458960556e-15

In [99]:
def gram_schmidt_classical(X):
    Q = numpy.zeros_like(X)
    R = numpy.zeros((len(X.T),len(X.T)))
    for i in range(len(Q.T)):
        v = X[:,i].copy()
        R[:i,i] = Q[:,:i].T.dot(v)
        v -= Q[:,:i].dot(R[:i,i])
        R[i,i] = numpy.linalg.norm(v)
        Q[:,i] = v / R[i,i]
    return Q, R

qr_test(gram_schmidt_classical, V)


gram_schmidt_classical 1.11926691633e-16 2.97851950621

Classical Gram-Schmidt is highly parallel, but unstable, as evidenced by the lack of orthogonality in $Q$.

Right-looking algorithms

The implementations above have been "left-looking"; when working on column $i$, we compare it only to columns to the left (i.e., $j < i$). We can reorder the algorithm to look to the right by projecting $q_i$ out of all columns $j > i$. This algorithm is stable while being just as parallel as gram_schmidt_classical.


In [100]:
def gram_schmidt_modified(X):
    Q = X.copy()
    R = numpy.zeros((len(X.T), len(X.T)))
    for i in range(len(Q.T)):
        R[i,i] = numpy.linalg.norm(Q[:,i])
        Q[:,i] /= R[i,i]
        R[i,i+1:] = Q[:,i+1:].T.dot(Q[:,i])
        Q[:,i+1:] -= numpy.outer(Q[:,i], R[i,i+1:])
    return Q, R

qr_test(gram_schmidt_modified, V)


gram_schmidt_modified 1.22046927257e-15 1.86469927987e-08

Householder triangularization

Gram-Schmidt methods perform triangular transformations to build an orthogonal matrix. As we have seen, $X = QR$ is satisfied accurately, but $Q$ may not be orthogonal when $X$ is ill-conditioned. Householder triangularization instead applies a sequence of orthogonal transformations to build a triangular matrix.

$$ \underbrace{Q_{n-1} \dotsb Q_0}_{Q^T} A = R $$

The structure of the algorithm is

$$ \underbrace{\begin{bmatrix} * & * & * \\ * & * & * \\ * & * & * \\ * & * & * \\ * & * & * \\ \end{bmatrix}}_{A} \to \underbrace{\begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \\ \end{bmatrix}}_{Q_0 A} \to \underbrace{\begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & * \\ 0 & 0 & * \\ \end{bmatrix}}_{Q_1 Q_0 A} \to \underbrace{\begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \end{bmatrix}}_{Q_2 Q_1 Q_0 A} $$

where the elementary orthogonal matrices $Q_i$ chosen to introduce zeros below the diagonal in the $i$th column of $R$. Each of these transformations will have the form $$Q_i = \begin{bmatrix} I_i & 0 \\ 0 & F \end{bmatrix}$$ where $F$ is a "reflection" that achieves $$ F x = \begin{bmatrix} \lVert x \rVert \\ 0 \\ 0 \\ \vdots \end{bmatrix} $$ where $x$ is the column of $R$ from the diagonal down. This transformation is a reflection across a plane with normal $v = Fx - x = \lVert x \rVert e_1 - x$.

The reflection, as depected above by Trefethen and Bau (1999) can be written $F = I - 2 \frac{v v^T}{v^T v}$.


In [101]:
def householder_Q_times(V, x):
    """Apply orthogonal matrix represented as list of Householder reflectors"""
    y = x.copy()
    for i in reversed(range(len(V))):
        y[i:] -= 2 * V[i] * V[i].dot(y[i:])
    return y

def qr_householder1(A):
    "Compute QR factorization using naive Householder reflection"
    m, n = A.shape
    R = A.copy()
    V = []
    for i in range(n):
        x = R[i:,i]
        v = -x
        v[0] += numpy.linalg.norm(x)
        v = v/numpy.linalg.norm(v)     # Normalized reflector plane
        R[i:,i:] -= 2 * numpy.outer(v, v.dot(R[i:,i:]))
        V.append(v)                    # Storing reflectors is equivalent to storing orthogonal matrix
    Q = numpy.eye(m, n)
    for i in range(n):
        Q[:,i] = householder_Q_times(V, Q[:,i])
    return Q, numpy.triu(R[:n,:])

qr_test(qr_householder1, numpy.array([[1.,2],[3,4],[5,6]]))


qr_householder1 1.83102671941e-15 3.23682852457e-16

In [102]:
qr_test(qr_householder1, V)
qr_test(numpy.linalg.qr, V)


qr_householder1 6.66577491755e-15 2.99974677314e-15
qr 5.89181489555e-15 2.97458960556e-15

Choice of two projections

It turns out our implementation has a nasty deficiency.


In [103]:
qr_test(qr_householder1, numpy.eye(1))


qr_householder1 nan nan
/usr/lib/python3.5/site-packages/ipykernel/__main__.py:17: RuntimeWarning: invalid value encountered in true_divide

In [104]:
qr_test(qr_householder1, numpy.eye(3,2))


qr_householder1 nan nan
/usr/lib/python3.5/site-packages/ipykernel/__main__.py:17: RuntimeWarning: invalid value encountered in true_divide

Inside qr_householder1, we have the lines

        x = R[i:,i]
        v = -x
        v[0] += numpy.linalg.norm(x)
        v = v/numpy.linalg.norm(v)     # Normalized reflector plane

What happens when $$x = \begin{bmatrix}1 \\ 0 \end{bmatrix}$$ (i.e., the column of $R$ is already upper triangular)?

We are trying to define a reflector plane (via its normal vector) from the zero vector, $$v = \lVert x \rVert e_0 - x .$$ When we try to normalize this vector, we divide zero by zero and the algorithm breaks down (nan). Maybe we just need to test for this special case and "skip ahead" when no reflection is needed? And if so, how would we define $Q$?


In [105]:
qr_test(qr_householder1, numpy.array([[1.,1], [2e-8,1]]))
print(qr_householder1(numpy.array([[1.,1], [2e-8,1]])))


qr_householder1 2.2044604925e-09 4.4408920985e-16
(array([[  1.00000000e+00,  -2.22044605e-08],
       [  2.22044605e-08,   1.00000000e+00]]), array([[ 1.        ,  1.00000002],
       [ 0.        ,  0.99999998]]))

The error $QR - A$ is still $10^{-8}$ for this very well-conditioned matrix so something else must be at play here.


In [106]:
def qr_householder2(A):
    "Compute QR factorization using Householder reflection"
    m, n = A.shape
    R = A.copy()
    V = []
    for i in range(n):
        v = R[i:,i].copy()
        v[0] += numpy.sign(v[0])*numpy.linalg.norm(v) # Choose the further of the two reflections
        v = v/numpy.linalg.norm(v)     # Normalized reflector plane
        R[i:,i:] -= 2 * numpy.outer(v, v.dot(R[i:,i:]))
        V.append(v)                    # Storing reflectors is equivalent to storing orthogonal matrix
    Q = numpy.eye(m, n)
    for i in range(n):
        Q[:,i] = householder_Q_times(V, Q[:,i])
    return Q, numpy.triu(R[:n,:])

qr_test(qr_householder2, numpy.eye(3,2))
qr_test(qr_householder2, numpy.array([[1.,1], [1e-8,1]]))
print(qr_householder2(numpy.array([[1.,1], [1e-8,1]])))

qr_test(qr_householder2, V)


qr_householder2 0.0 0.0
qr_householder2 0.0 0.0
(array([[ -1.00000000e+00,   1.00000000e-08],
       [ -1.00000000e-08,  -1.00000000e+00]]), array([[-1.        , -1.00000001],
       [ 0.        , -0.99999999]]))
qr_householder2 6.34323888919e-15 2.31937137379e-15

We now have a usable implementation of Householder QR. There are some further concerns for factoring rank-deficient matrices. We will visit the concept of pivoting later, in the context of LU and Cholesky factorization.

Condition number of a matrix

We may have informally referred to a matrix as "ill-conditioned" when the columns are nearly linearly dependent, but let's make this concept for precise. Recall the definition of (relative) condition number from the Rootfinding notes,

$$ \kappa = \max_{\delta x} \frac{|\delta f|/|f|}{|\delta x|/|x|} . $$

We understood this definition for scalar problems, but it also makes sense when the inputs and/or outputs are vectors (or matrices, etc.) and absolute value is replaced by vector (or matrix) norms. Let's consider the case of matrix-vector multiplication, for which $f(x) = A x$.

$$ \kappa(A) = \max_{\delta x} \frac{\lVert A (x+\delta x) - A x \rVert/\lVert A x \rVert}{\lVert \delta x\rVert/\lVert x \rVert} = \max_{\delta x} \frac{\lVert A \delta x \rVert}{\lVert \delta x \rVert} \, \frac{\lVert x \rVert}{\lVert A x \rVert} = \lVert A \rVert \frac{\lVert x \rVert}{\lVert A x \rVert} . $$

There are two problems here:

  • I wrote $\kappa(A)$ but my formula depends on $x$.
  • What is that $\lVert A \rVert$ beastie?

Stack push: Matrix norms

Vector norms are built into the linear space (and defined in term of the inner product). Matrix norms are induced by vector norms, according to

$$ \lVert A \rVert = \max_{x \ne 0} \frac{\lVert A x \rVert}{\lVert x \rVert} . $$
  • This equation makes sense for non-square matrices -- the vector norms of the input and output spaces may differ.
  • Due to linearity, all that matters is direction of $x$, so it could equivalently be written
$$ \lVert A \rVert = \max_{\lVert x \rVert = 1} \lVert A x \rVert . $$

Stack pop

Now we understand the formula for condition number, but it depends on $x$. Consider the matrix

$$ A = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} . $$
  • What is the norm of this matrix?
  • What is the condition number when $x = [1,0]^T$?
  • What is the condition number when $x = [0,1]^T$?

The condition number of matrix-vector multiplication depends on the vector. The condition number of the matrix is the worst case (maximum) of the condition number for any vector, i.e.,

$$ \kappa(A) = \max_{x \ne 0} \lVert A \rVert \frac{\lVert x \rVert}{\lVert A x \rVert} .$$

If $A$ is invertible, then we can rephrase as

$$ \kappa(A) = \max_{x \ne 0} \lVert A \rVert \frac{\lVert A^{-1} (A x) \rVert}{\lVert A x \rVert} = \max_{A x \ne 0} \lVert A \rVert \frac{\lVert A^{-1} (A x) \rVert}{\lVert A x \rVert} = \lVert A \rVert \lVert A^{-1} \rVert . $$

Evidently multiplying by a matrix is just as ill-conditioned of an operation as solving a linear system using that matrix.


In [107]:
def R_solve(R, b):
    """Solve Rx = b using back substitution."""
    x = b.copy()
    m = len(b)
    for i in reversed(range(m)):
        x[i] -= R[i,i+1:].dot(x[i+1:])
        x[i] /= R[i,i]
    return x

Q, R = numpy.linalg.qr(A)
b = Q.T.dot(A.dot(numpy.array([1,2,3,4])))
numpy.linalg.norm(R_solve(R, b) - numpy.linalg.solve(R, b))
R_solve(R, b)


Out[107]:
array([ 1.,  2.,  3.,  4.])

Cost of Householder factorization

The dominant cost comes from the line

R[i:,i:] -= 2 * numpy.outer(v, v.dot(R[i:,i:]))

were R[i:,i:] is an $(m-i)\times(n-i)$ matrix. This line performs $2(m-i)(n-i)$ operations in v.dot(R[i:,i:]), another $(m-i)(n-i)$ in the "outer" product and again in subtraction. As written, multiplication by 2 would be another $(m-i)(n-i)$ operations, but is only $m-i$ operations if we rewrite as

w = 2*v
    R[i:,i:] -= numpy.outer(w, v.dot(R[i:,i:]))

in which case the leading order cost is $4(m-i)(n-i)$. To compute the total cost, we need to sum over all columns $i$, $$\begin{split} \sum_{i=1}^n 4(m-i)(n-i) &= 4 \Big[ \sum_{i=1}^n (m-n)(n-i) + \sum_{i=1}^n (n-i)^2 \Big] \\ &= 4 (m-n) \sum_{i=1}^n i + 4 \sum_{i=1}^n i^2 \\ &\approx 2 (m-n) n^2 + 4 n^3/3 \\ &= 2 m n^2 - \frac 2 3 n^3 . \end{split}$$ Recall that Gram-Schmidt QR cost $2 m n^2$, so Householder costs about the same when $m \gg n$ and is markedly less expensive when $m \approx n$.

Least squares and the normal equations

A least squares problem takes the form: given an $m\times n$ matrix $A$ ($m \ge n$), find $x$ such that $$ \lVert Ax - b \rVert $$ is minimized. If $A$ is square and full rank, then this minimizer will satisfy $A x - b = 0$, but that is not the case in general because $b$ is not in the range of $A$. The residual $A x - b$ must be orthogonal to the range of $A$.

  • Is this the same as saying $A^T (A x - b) = 0$?
  • If $QR = A$, is it the same as $Q^T (A x - b) = 0$?

In HW2, we showed that $QQ^T$ is an orthogonal projector onto the range of $Q$. If $QR = A$, $$ QQ^T (A x - b) = QQ^T(Q R x - b) = Q (Q^T Q) R x - QQ^T b = QR x - QQ^T b = A x - QQ^T b . $$ So if $b$ is in the range of $A$, we can solve $A x = b$. If not, we need only orthogonally project $b$ into the range of $A$.

Solution by QR (Householder)

Solve $R x = Q^T b$.

  • QR factorization costs $2 m n^2 - \frac 2 3 n^3$ operations and is done once per matrix $A$.
  • Computing $Q^T b$ costs $4 (m-n)n + 2 n^2 = 4 mn - 2n^2$ (using the elementary reflectors, which are stable and lower storage than naive storage of $Q$).
  • Solving with $R$ costs $n^2$ operations. Total cost per right hand side is thus $4 m n - n^2$.

This method is stable and accurate.

Solution by Cholesky

The mathematically equivalent form $(A^T A) x = A^T b$ are called the normal equations. The solution process involves factoring the symmetric and positive definite $n\times n$ matrix $A^T A$.

  • Computing $A^T A$ costs $m n^2$ flops, exploiting symmetry.
  • Factoring $A^T A = R^T R$ costs $\frac 1 3 n^3$ flops. The total factorization cost is thus $m n^2 + \frac 1 3 n^3$.
  • Computing $A^T b$ costs $2 m n$.
  • Solving with $R^T$ costs $n^2$.
  • Solving with $R$ costs $n^2$. Total cost per right hand side is thus $2 m n + 2 n^2$.

The product $A^T A$ is ill-conditioned: $\kappa(A^T A) = \kappa(A)^2$ and can reduce the accuracy of a least squares solution.

Solution by Singular Value Decomposition

Next, we will discuss a factorization $$ U \Sigma V^T = A $$ where $U$ and $V$ have orthonormal columns and $\Sigma$ is diagonal with nonnegative entries. The entries of $\Sigma$ are called singular values and this decomposition is the singular value decomposition (SVD). It may remind you of an eigenvalue decomposition $X \Lambda X^{-1} = A$, but

  • the SVD exists for all matrices (including non-square and deficient matrices)
  • $U,V$ have orthogonal columns (while $X$ can be arbitrarily ill-conditioned). Indeed, if a matrix is symmetric and positive definite (all positive eigenvalues), then $U=V$ and $\Sigma = \Lambda$. Computing an SVD requires a somewhat complicated iterative algorithm, but a crude estimate of the cost is $2 m n^2 + 11 n^3$. Note that this is similar to the cost of $QR$ when $m \gg n$, but much more expensive for square matrices. Solving with the SVD involves
  • Compute $U^T b$ at a cost of $2 m n$.
  • Solve with the diagonal $n\times n$ matrix $\Sigma$ at a cost of $n$.
  • Apply $V$ at a cost of $2 m n$. The total cost per right hand side is thus $4 m n$.

Pseudoinverse

An alternative is to explicitly form the $n\times m$ pseudoinverse $A^\dagger = R^{-1} Q^T$ (at a cost of $mn^2$) at which point each right hand side costs $2 mn$. Why might we do this?

  • Lots of right hand sides
  • Real-time solution

In [108]:
# Test accuracy of solver for an ill-conditioned square matrix

x = numpy.linspace(-1,1,19)
A = numpy.vander(x)
print('cond(A) = ',numpy.linalg.cond(A))
Q, R = numpy.linalg.qr(A)
print('cond(R^{-1} Q^T A) =', numpy.linalg.cond(numpy.linalg.solve(R, Q.T.dot(A))))
L = numpy.linalg.cholesky(A.T.dot(A))
print('cond(L^{-T} L^{-1} A^T A) =', numpy.linalg.cond(numpy.linalg.solve(L.T, numpy.linalg.solve(L, A.T.dot(A)))))


cond(A) =  90847309.6146
cond(R^{-1} Q^T A) = 1.00000000931
cond(L^{-T} L^{-1} A^T A) = 1.45485451693

In [ ]:

The Singular Value Decomposition

The SVD is the decomposition $$ U \Sigma V^T = A $$ where $U$ and $V$ have orthonormal columns and $\Sigma$ is diagonal and nonnegative. Evidenly an arbitrary matrix consists of

  1. Orthogonal matrix -- rotation and reflection
  2. Diagonal scaling (transforms a sphere into an ellipsoid aligned to the coordinate axes)
  3. Orthogonal matrix -- rotation and reflection

It is typical to order the singular values in descending order.

The inverse has the same behavior:

$$ A^{-1} = V \Sigma^{-1} U^T $$

The matrix norm is the largest singular value

$$ \lVert A \rVert = \sigma_{\max} . $$

The largest value in $\Sigma^{-1}$ is $\sigma_\min^{-1}$, so

$$ \lVert A^{-1} \rVert = \sigma_\min^{-1} . $$

We showed previously that

$$\kappa(A) = \lVert A \rVert \, \lVert A^{-1} \rVert $$

so now we can also write

$$ \kappa(A) = \frac{\sigma_\max}{\sigma_\min} . $$

The SVD is a crucial tool in statistics and dimensionality reduction, often under names like

In this context, one computes an SVD of a data matrix and finds that the spectrum $\Sigma$ decays rapidly and only the $k \ll n$ components larger than some threshold are "important". The dense matrix can thus be approximated as

$$ \hat U \hat \Sigma \hat V^T $$

by keeping only the first (largest) $k$ columns. This reduces the storage from $mn$ entries to $mk + kn$ entries.

Tinkering in class 2016-10-10


In [109]:
m = 10
x = numpy.cos(numpy.linspace(0,numpy.pi,m))
f = 1.0*(x > 0) + (x < 0.5)
A = numpy.vander(x)
Q, R = numpy.linalg.qr(A)
p = numpy.linalg.solve(R, Q.T.dot(f))
y = numpy.linspace(-1,1,50)
g = numpy.vander(y, m).dot(p)
pyplot.figure()
pyplot.plot(x, f, '*')
pyplot.plot(y, g)
print(numpy.linalg.cond(A))


1179.25137554

In [110]:
'%10e' % numpy.linalg.cond(numpy.vander(numpy.linspace(-1,1,100),20))


Out[110]:
'7.206778e+06'

In [ ]: