In [1]:
import numpy as np

6. Singular Value Decomposition

6.1 Invertibility of Linear Operators and Matrices

6.1.1. We recall from finite-dimensional Hilbert space theory (§4.4) that eigenvalues and the corresponding (unit) eigenvectors are useful in producing a unitary diagonalization

$$A = PDP^*$$

of a square matrix $A$.

Observe that $(\lambda, v)$ is an eigenvalue-eigenvector pair of an $n$-by-$n$ matrix $A$ if and only if

$$(A-\lambda I)v = 0,$$

where $I$ is the identity matrix. Since $v \neq 0$ and $(A-\lambda I)0 = 0$, we see that the linear operator (§3.7.1) $x \mapsto (A-\lambda I)x$ fails to be bijective.

Conversely, if $x \mapsto (A-\lambda I)x$ is bijective, then the identity

$$(A - \lambda I)x = 0$$

holds only when $x = 0$, whence it follows that $\lambda$ is not an eigenvalue of $A$.

It therefore suffices to compute the values of $\lambda$ such that the linear operator $x \mapsto (A-\lambda I)x$ fails to be bijective. The goal of this section is to develop a computational measure of invertibility.

6.1.2. We recall that a basis (§3.10.1) of a vector space $V$ is a set $\{v^1,\ldots,v^n\}$ such that each vector $v \in V$ can be written uniquely as a sum of the form

$$v = a_1v^1 + \cdots + a_nv^n,$$

where $a_1,\ldots,a_n$ are scalars. A sum of the above form is called a linear combination of vectors $v^1,\ldots,v^n$.

Let us show that a linear operator $T:V \to W$ is bijective if and only if the image $T(\mathscr{B})$ of a basis $\mathscr{B} = \{v^1,\ldots,v^n\}$ of $V$ is a basis of $W$.

Suppose that $T$ is bijective. Given $w \in W$, we find the unique linear combination

$$T^{-1}w = a_1 v^1 + \cdots + a_n v^n.$$

It then follows that

$$w = TT^{-1}w = T(a_1v^1 + \cdots + a_nv^n) = a_1Tv^1 + \cdots + a_nTv^n,$$

and so there is at least one linear combination of $Tv^1,\ldots,Tv^n$ that equals $w$. If

$$w = b_1 Tv^1 + \cdots + b_n Tv^n$$

is an arbitrary linear combination of $Tv^1,\ldots,Tv^n$ that equals $w$, then

$$T^{-1}w = T^{-1}\left( b_1Tv^1 + \cdots + b_n Tv^n \right) = b_1 v^1 + \cdots + b_n v^n.$$

Since $a_1v^1 + \cdots + a_nv^n$ is the unique linear combination of $v^1,\ldots,v^n$ that equals $T^{-1}$, it follows that $a_i = b_i$ for all $1 \leq i \leq n$. It follows that $\{Tv^1,\ldots,Tv^n\}$ is a basis of $W$.

Conversely, we suppose that $\{Tv^1,\ldots,Tv^n\}$ is a basis of $W$. Since every vector in $W$ is a linear combination of $Tv^1,\ldots,Tv^n$, the linear operator $T$ is surjective. Moreover, the uniqueness of linear combination representations implies that $T$ is injective. This concludes the proof that $T$ is bijective if and only if $\{Tv^1,\ldots,Tv^n\}$ is a basis of $W$.

6.1.3. It is a standard fact in linear algebra that all bases of a vector space must have the same cardinality. Every vector space has at least one basis, and so it is possible to assign a unique dimension to each vector space by simply counting the number of basis vectors. For now, we observe that a zero-dimensional vector space consists simply of the zero vector, and a one-dimensional vector space is of the form $\{av : a \in \mathbb{F}\}$ for some nonzero vector $v$.

Let $V$ be a vector space, and let $M$ be a linear subspace (§3.7.2) of $V$. We show that $M = V$ if and only if $\dim M = \dim V$. If $M = V$, then $\dim M = \dim V$ by the uniqueness of the dimension. Conversely, if $\dim M = \dim V$, then a basis of $M$ is already a basis of $V$, whence $M = V$.

6.1.4. We recall that, given a linear operator $T:V \to W$, the kernel

$$\ker T = \{v \in W : Tv = 0_W\}$$

is a linear subspace of $V$, and that the range

$$\operatorname{im} T = \{Tv : v \in V\}$$

is a linear subspace of $W$ (§3.7.2). We remark that the kernel of $T$ is also known as the null space of $T$.

By definition, $T$ is surjective if and only if $\operatorname{im} T = W$. Now, we have seen (§6.1.3) that $\operatorname{im} T = W$ if and only if

$$\dim \operatorname{im} T = \dim W.$$

Moreover, $T$ is injective if and only if $\ker T = \{0_V\}$, which is equivalent to the statement that

$$\dim \ker T = 0.$$

If $T$ is injective, there is no more than one vector that maps to $0_W$, and so $\ker T = \{0_V\}$. Conversely, if $\ker T = \{0_V\}$, then $Tx = Ty$ implies that $T(x-y) = 0_W$, and so $x-y = 0_V$.

Since the dimensions of the range and the null space play important role in determining basic properties of $T$, we give them names. The rank of $T$ is the dimension of the range of $T$. The nullity of $T$ is the dimension of the null space of $T$.

The rank-nullity theorem characterizes the relationship between rank and nullity: every linear operator $T:V \to W$ satisfies the identity

$$\operatorname{rank} T + \operatorname{nullity} T = \dim V.$$

The rank-nullity theorem shows that $T$ cannot be injective if $\dim V > \dim W$, and that $T$ cannot be surjective if $\dim V < \dim W$. We say that $T$ is of full rank if the rank of $T$ is as high as it can be under the restriction of the rank-nullity theorem and the dimension of $W$. Note that $T$ is bijective if and only if $\dim V = \dim W$ and $T$ is of full rank. A linear operator that fails to be of full rank is called rank deficient.

6.1.5. How do we compute the rank of a linear operator? It helps to look at the corresponding matrix representation, of course.

The span of a set of vectors $\{v^1,\ldots,v_n\}$ in a vector space $V$ is the set of all linear combinations (§6.1.2) of $v^1,\ldots,v^n$. The span is always a linear subspace (§3.1.1) of $V$.

Given a matrix $A$, we can extract the column vectors of $A$ by multiplying the standard coordinate vectors (§3.8.1). Indeed, $Ae^i$ is the $i$th column vector of $A$.

We now observe that the range of a linear operator $T:\mathbb{R}^n \to \mathbb{R}^m$ equals the span of the transformed coordinate vectors

$$Te^1,\ldots,Te^n.$$

Defining the column space of an $m$-by-$n$ matrix $A$ as the span of the column vectors $A e^1,\ldots, Ae^n$, we see that the column space of $A$ equals the range of the associated linear operator $v \mapsto Av$.

Therefore, the dimension of the column space of $A$, called the column rank of $A$, equals the rank of $v \mapsto Av$.

6.1.6. We now observe that

$$(e^1)^tA ,\ldots, (e^m)^t A$$

are the row vectors of matrix $A$. Here we recall that $v^t$ is the matrix transpose of $v$ (§4.5.9).

The span (§6.1.5) of the row vectors is called the row space of $A$. The row space of an $m$-by-$n$ matrix $A$ is a linear subspace (§3.1.1) of $\mathbb{R}^m$, and the dimension of the row space of $A$ is called the row rank of $A$.

It is a basic fact of linear algebra that the column rank and the row rank of a matrix are equal. Due to linear algebra's historical connections with systems of linear equations, we typically deal with the row rank than the column rank.

6.1.7. To tease out the row rank of a matrix, we introduce an algorithm that explicitly constructs a basis of the row space.

The Gaussian elimination algorithm turns a matrix $A$ into its reduced row echelon form, $R = (r_{ij})$ satisfying the following properties:

  • every nonzero row is above all zero rows;
  • if $r_{ij}$ is the first nonzero entry in row $i$, then $r_{ij} = 1$ and $r_{i'j} = 0$ for all $i' \neq i$;
  • if $r_{ij}$ is the first nonzero entry in row $i$, then, for each $i' > i$, $r_{i'j'} = 0$ for all $j' \leq j$.

The nonzero rows in $R$ form a basis of the row space of $A$. Therefore, the row rank, hence the column rank, of $A$ is the number of nonzero rows in $R$.

Gaussian elimination can be implemented as follows.


In [2]:
def gaussian_elimination(A):
    A = A.astype(float)  # convert to a floating-point matrx to prevent integer division
    
    numrows, numcols = A.shape
    pivot_row = 0
    
    for j in range(min(numrows, numcols)):
        # find the row index k such that the absolute value of A[k, j] is the largest of
        # the absolute values of all entries in the jth column.
        k = np.argmax(np.abs(A[pivot_row:, j])) + pivot_row
    
        if A[k, j] != 0:
            A[[k, pivot_row]] = A[[pivot_row, k]]  # swap row k and row pivot_row
        
            for i in range(numrows):
                if i != pivot_row:
                    A[i] = A[i] - (A[i, j] / A[pivot_row, j]) * A[pivot_row]
            
            A[pivot_row] = A[pivot_row] / A[pivot_row, j]
            pivot_row += 1
    return A

In [3]:
A = np.arange(16).reshape((4, 4))
print(A)
print(gaussian_elimination(A))


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[ 1.  0. -1.  0.]
 [ 0.  1.  2.  0.]
 [-0. -0. -0.  1.]
 [ 0.  0.  0.  0.]]

In [4]:
A = np.array([[0, 2, -9], [1, 3, 5]])
print(A)
print(gaussian_elimination(A))


[[ 0  2 -9]
 [ 1  3  5]]
[[ 1.   0.  18.5]
 [ 0.   1.  -4.5]]

In [5]:
A = np.array([[0, 2, -9], [2, 6, 10], [1, 3, 5], [-4, -12, -20]])
print(A)

print(gaussian_elimination(A))


[[  0   2  -9]
 [  2   6  10]
 [  1   3   5]
 [ -4 -12 -20]]
[[ 1.   0.  18.5]
 [ 0.   1.  -4.5]
 [ 0.   0.   0. ]
 [ 0.   0.   0. ]]

6.1.8. Gaussian elimination (§6.1.7) does much more than the computation of the rank. In fact, Gaussian elimination already does all the work of constructing the inverse of a matrix, whenever it exists.

To reveal the inverse matrix of an $m$-by-$n$ matrix $A$, we constrct an $m$-by-$n$ matrix $B$ such that

$$b_{ij} = \begin{cases} 1 & \mbox{ if } i = j \\ 0 & \mbox{ if } i \neq j. \end{cases}$$

We then perform each step of the computation of Gaussian elimination of $A$ to $B$ as well. The resulting reduced form of $B$ is the inverse of $A$, provided that the reduced row echelon form of $A$ is the identity matrix.


In [6]:
def matrix_inverse(A):
    numrows, numcols = A.shape
    if numrows != numcols:  # nonsquare matrices do not have an inverse
        return None
    n = numrows  # this also equals numcols, as numrows == numcols
    
    A = A.astype(float)  # convert to a floating-point matrx to prevent integer division
    B = np.eye(n)  # identity matrix of size n
    pivot_row = 0
    
    for j in range(n):
        # find the row index k such that the absolute value of A[k, j] is the largest of
        # the absolute values of all entries in the jth column.
        k = np.argmax(np.abs(A[pivot_row:, j])) + pivot_row

        if A[k, j] != 0:
            A[[k, pivot_row]] = A[[pivot_row, k]]  # swap row k and row pivot_row
            B[[k, pivot_row]] = B[[pivot_row, k]]
            
            for i in range(n):
                if i != pivot_row:
                    scaling_factor = A[i, j] / A[pivot_row, j]
                    A[i] = A[i] - scaling_factor * A[pivot_row]
                    B[i] = B[i] - scaling_factor * B[pivot_row]

            scaling_factor = A[pivot_row, j]
            A[pivot_row] = A[pivot_row] / scaling_factor
            B[pivot_row] = B[pivot_row] / scaling_factor
        
            pivot_row += 1
    
    if sum(A[i, i] for i in range(n)) == n:  # check for invertibility
        return B
    else:
        return None

In [7]:
A = np.array([[3, -2, 4], [1, 0, 2], [0, 1, 0]])
print(A)

B = matrix_inverse(A)
print(B)

print()

print("B is the inverse of A, accurate up to e-14")
print(np.round(np.dot(A,B), 14))
print(np.round(np.dot(A,B), 15))


[[ 3 -2  4]
 [ 1  0  2]
 [ 0  1  0]]
[[ 1.  -2.   2. ]
 [ 0.   0.   1. ]
 [-0.5  1.5 -1. ]]

B is the inverse of A, accurate up to e-14
[[ 1.  0. -0.]
 [-0.  1. -0.]
 [ 0.  0.  1.]]
[[ 1.e+00  0.e+00 -1.e-15]
 [-0.e+00  1.e+00 -0.e+00]
 [ 0.e+00  0.e+00  1.e+00]]

6.2. Singular Values

6.2.1. We have seen that $\lambda$ is an eigenvalue of a matrix $A$ if and only if the perturbed matrix

$$A - \lambda I$$

fails to be invertible. In other words, $\lambda$ is an eigenvalue of $A$ if and only if $A - \lambda I$ is rank deficient (§4.1.4). While the Gaussian elimination algorithm (§6.1.7) can tell us the rank of $A - \lambda I$ with an explicit choice of $\lambda$, it is cumbersome to work out the reduced row echelon form with variable $\lambda$. Moreover, Gaussian elimination is an unstable numerical algorithm (§5.2.5), and so a care must be exerted in applying it to large matrices.

Moreover, the spectral theorem (§4.5.7) does not apply for non-normal matrices, and it is not clear how to obtain a unitary diagonalization of a non-normal matrix even if we had all the eigenvalues.

We thus seek a more general method of diagonalization, applicable to all matrices. As a bonus, we shall find that the decomposition applies even to non-square matrices.

6.2.2. Given an arbitrary $m$-by-$n$ matrix $A$, the product $A^*A$ of the conjugate transpose (§4.5.3) and $A$ itself is an $n$-by-$n$ Hermitian matrix (§4.5.3). Indeed,

$$(A^*A)^* = A^*(A^*)^* = A^*A.$$

By the spectral theorem for Hermitian matrices (§4.5.5), there exists an $n$-by-$n$ unitary matrix (§4.5.5) $P = [p^1 \mid \cdots \mid p^n]$ and an $n$-by-$n$ diagonal matrix $D = \operatorname{diag}(\lambda_1, \ldots, \lambda_n)$ with real entries such that

$$A^*A = PDP^*.$$

This, in particular, yields eigendecompositions

$$A^*Ap^i = \lambda_i p^i$$

for $1 \leq i \leq n$.

Since $P$ is unitary, $p^i$ is a unit vector, and so

$$(p^i)^* A^*Ap^i = \lambda_i (p^i)^* p^i = \lambda_i.$$

We now observe that

$$(p^i)^* A^* Ap^i = (Ap^i)^* (Ap^i) = (Ap^i) \cdot (Ap^i) \geq 0$$

by the positive-semidefiniteness of the dot product (§4.1.2). It follows that

$$\lambda_i \geq 0$$

for all $1 \leq i \leq n$.

6.2.3. The spectral theorem for Hermitian matrices guarantees that there exists a $k$ such that $\lambda_i > 0$ for all $1 \leq i \leq k$ and $\lambda_i = 0$ for all $k+1 \leq i \leq n$. We shall find an $m$-by-$m$ unitary matrix $U$ and an $n$-by-$n$ unitary matrix $V$ such that

$$A = U \Sigma V^*,$$

where $\Sigma$ is the $m$-by-$n$ matrix with entries

$$\Sigma_{ij} = \begin{cases} \lambda_i^{1/2} & \mbox{ if } i = j \\ 0 & \mbox{ otherwise}. \end{cases}$$

This decomposition is called the singular value decomposition of $A$. The diagonal entries of $\Sigma$ are called the singular values of $A$, denoted by $\sigma_1(A),\ldots,\sigma_{\min(m,n)}(A)$. We note that the rank of $A$ is precisely the number of nonzero singular values of $A$. Indeed, $U$ and $V^*$ are unitary isomorphisms (§4.3.1) and do not have an effect on the rank. Therefore, the rank of $A$ equals the rank of $\Sigma$, which can be computed by counting the number of nonzero diagonal entries.

We now establish the existence of the singular decomposition.

We suppose for now that $k = n$, i.e., all diagonal entries of $D$ are strictly positive. We define $\Sigma^{-1}$ to be the $n$-by-$m$ matrix with entries

$$\Sigma_{ij}^{-1} = \begin{cases} \lambda_i^{-1/2} & \mbox{ if } i \neq j, \\ 0 & \mbox{ otherwise}. \end{cases}$$

Observe that $\Sigma \Sigma^{-1} = I_m$, the $m$-by-$m$ identity matrix, and that $\Sigma^{-1} \Sigma = I_n$, the $n$-by-$n$ identity matrix. We define $U = AP\Sigma^{-1}$, so that

$$U\Sigma P^* = AP\Sigma^{-1}\Sigma P^* = APP^* = A$$

by the unitarity of $P$.

Now,

$$\begin{align*} U^*U &= (\Sigma^{-1})^*P^*A^*AP\Sigma^{-1} \\ &= (\Sigma^{-1})^*P^*PDP^*P\Sigma^{-1} \\ &= (\Sigma^{-1})^*D\Sigma^{-1}. \end{align*} $$

by the unitarity of $P$. Since $D$ is strictly positive, $(\Sigma^{-1})^* D \Sigma^{-1} = I_m$, and so $U$ is unitary. It follows that

$$A = U\Sigma V^*,$$

where $U$ and $V = P$ are unitary.

We now suppose that $k < n$, i.e., there is at least one zero on the diagonal of $D$. We define $V_1 = [p^1 \mid \cdots \mid p^k]^*$ and $V_2 = [p^{k+1} \mid \cdots \mid p^n]^*$, so that $V_1$ is a $k$-by-$n$ matrix, and $V_2$ is a $(n-k)$-by-$n$ matrix. Observe that

$$V_1A^*AV_1^* = V_1[\lambda_1 p^1 \mid \cdots \mid \lambda_k p^k] = \operatorname{diag}\left(\lambda_1,\ldots,\lambda_k\right).$$

Observe also that

$$V_2A^*AV_2^* = V_2[0 \mid \cdots \mid 0] = O_{n-k},$$

the $(n-k)$-by-$(n-k)$ zero matrix.

Define $m$-by-$k$ matrix $U_1 = AV_1^*\operatorname{diag}\left(\lambda_1^{-1/2},\ldots,\lambda_k^{-1/2}\right)$, so that

$$\begin{align*} & U_1\operatorname{diag}\left(\lambda_1^{1/2},\ldots,\lambda_k^{1/2}\right)V_1 \\ &= AV_1^*\operatorname{diag}\left(\lambda_1^{-1/2},\ldots,\lambda_k^{-1/2}\right)\operatorname{diag}\left(\lambda_1^{1/2},\ldots,\lambda_k^{1/2}\right)V_1 \\ &= AV_1^*I_kV_1 \\ &= AV_1^*V_1 \\ &= A. \end{align*}$$

Now,

$$\begin{align*} U_1^*U_1 &= \operatorname{diag}\left(\overline{\lambda_1^{-1/2}},\ldots,\overline{\lambda_k^{-1/2}}\right) V_1 A^* A V_1^* \operatorname{diag}\left(\lambda_1^{-1/2},\ldots,\lambda_k^{-1/2}\right) \\ &= \operatorname{diag}\left(\overline{\lambda_1^{-1/2}},\ldots,\overline{\lambda_k^{-1/2}}\right) \left( V_1 PDP^* V_1^* \right) \operatorname{diag}\left(\lambda_1^{-1/2},\ldots,\lambda_k^{-1/2}\right) \\ &= \operatorname{diag}\left(\overline{\lambda_1^{-1/2}},\ldots,\overline{\lambda_k^{-1/2}}\right) \operatorname{diag}\left(\lambda_1,\ldots,\lambda_k\right) \operatorname{diag}\left(\lambda_1^{-1/2},\ldots,\lambda_k^{-1/2}\right) \\ &= I_k,\\ \end{align*}$$

and so the column vectors $u^1,\ldots,u^k$ of $U_1$ are orthonormal. We can extend $\{u^1,\ldots,u^k\}$ to an orthonormal basis $\{u^1,\ldots,u^k,u^{k+1},\ldots,u^m\}$ of $\mathbb{C}^m$. We set $U = [u^1 \mid \cdots \mid u^m]$, so that $U$ is unitary.

Since $A = U_1\operatorname{diag}\left(\lambda_1^{1/2},\ldots,\lambda_k^{1/2}\right)V_1$, it follows that

$$A = U \Sigma V^*,$$

where $U$ and $V = P$ are unitary.

We have thus shown that the singular value decomposition for an arbitrary matrix $A$ exists. Typically, we order the singular values of an $m$-by-$n$ matrix $A$ in decreasing order:

$$\sigma_1(A) \geq \sigma_2(A) \geq \cdots \geq \sigma_k(A) \geq 0,$$

where $k = \min(m,n)$.

We remark that the same argument can be made with $AA^*$ in place of $A^*A$. This, in particular, shows that $AA^*$ and $A^*A$ have the same non-zero eigenvalues.

6.2.4. Let $A$ be an $m$-by-$n$ matrix. We shall show that

$$\sigma_1(A) = \|A\|_2,$$

the operator 2-norm of $A$ (§3.8.4).

To see this, we invoke the spectral theorem for Hermitian matrices (§4.5.5) to obtain the eigendecomposition

$$A^*Ap^i = \lambda_i p^i, \hspace{1em} 1 \leq i \leq n.$$

Since the eigenvalues of $A^*A$ are real, we can assume that

$$\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n.$$

We now fix $v \in \mathbb{C}^n$ and take the basis expansion

$$v = v_1 p^1 + \cdots + v_n p^n.$$

We observe that

$$\begin{align*} \|Av\|_2 &= \sqrt{(Av) \cdot (Av)} \\ &= \sqrt{v \cdot (A^*Av)} \\ &= \sqrt{(v_1 p^1 + \cdots + v_1 p^1) \cdot (\lambda_1 v_1 p^1 + \cdots + \lambda_n v_n p^n)} \\ &= \sqrt{ v_1\overline{\lambda_1 v_1} + v_2 \overline{\lambda_2 v_2} + \cdots + v_n \overline{\lambda_n v_n}} \\ &\leq \sqrt{v_1\overline{\lambda_1 v_1} + v_2 \overline{\lambda_1 v_2} + \cdots + v_n \overline{\lambda_1 v_n}} \\ &= \sqrt{\lambda_1} \|v\|_2 = \sigma_1(A) \|v\|_2. \end{align*}$$

Therefore, $\|A\|_2 \leq \sigma_1(A)$. Conversely,

$$\begin{align*} \|Ap^1\|_2 &= \sqrt{(Ap^1) \cdot (Ap^1)} \\ &= \sqrt{(p^1) \cdot (A^*A p1)} \\ &= \sqrt{(p^1) \cdot (\lambda_1 p^1)} \\ &= \sqrt{\lambda_1} = \sigma_1(A), \end{align*}$$

and so $\|A\|_2 \geq \sigma_1(A).$ We conclude that $\|A\|_2 = \sigma_1(A).$

6.2.5. The spectral radius of an $n$-by-$n$ matrix $A$ is the quantity

$$\rho(A) = \max\{|\lambda| : \lambda \mbox{ is an eigenvalue of } A\}.$$

Note that $\sigma_1(A) = \sqrt{\rho(A^*A)}$,and so

$$\|A\|_2 = \sqrt{\rho(A^*A)}.$$

If $A$ is Hermitian, then the spectral theorem for Hermitian matrices (§4.5.5) yields the decomposition

$$A = PDP^*,$$

where $P$ is unitary and $D$ is a diagonal matrix with real entries. Observe that

$$A^2 = PDP^*PDP^* = PD^2P^*,$$

which implies that the eigenvalues of $A^2$ are the squares of the eigenvalues of $A$. Since $A^* = A$, we have $A^*A = A^2$. It now follows that

$$\|A\|_2 = \sigma_1(A) = \sqrt{\rho(A^*A)} = \sqrt{\rho(A)^2} = \rho(A),$$

as $\rho(A) \geq 0$.

The spectral radius is the smallest of all matrix norms (§3.9.1), in the sense that

$$\rho(A) = \inf\{\|A\| : \|\cdot\| \mbox{ is a norm on the space of all $n$-by-$n$ matrices}\}.$$

From this fact, we can deduce Gelfand's formula, which states that

$$\rho(A) = \lim_{k \to \infty} \|A^k\|^{1/k}$$

for any matrix norm $\|\cdot\|$.

To see this, we first observe that

$$\rho(A) = \rho(A^k)^{1/k} \leq \|A^k\|^{1/k}.$$

We fix $\epsilon > 0$. The above inequality implies that it suffices to show

$$\|A^k\|^{1/k} < \rho(A) + \epsilon$$

for all sufficiently large $k$. To this end, we invoke the minimality of the spectral radius to find a norm $\|\cdot\|_*$ that satisfies the estimate

$$\|A\|_* < \rho(A) + \epsilon.$$

By the equivalence of norms on a finite-dimensional space (§4.10.3), we can find a constant $C$ such that $\|A\| \leq C\|A\|_*$. We then have the estimate

$$\|A^k\|^{1/k} \leq C^{1/k} \|A^k\|_*^{1/k} \leq \|A\|_* < C^{1/k}(\rho(A) + \epsilon).$$

Now, $C > 0$, and so there exists a constant $K$ such that $C < 1$ for all $k \geq K$. It then follows that

$$\|A^k\|^{1/k} < C^{1/k}(\rho(A) + \epsilon) < \rho(A) + \epsilon$$

for all $k \geq K$, as was to be shown.

6.2.6. The argument to show that $\rho_1(A) = \rho(A)$ for a Hermitian matrix $A$ (§6.2.5) can be generalized to all singular values. Indeed, if $A$ is Hermitian, then the eigenvalues of $A^*A$ are the squares of the eigenvalues of $A$. It follows that the singular values of a Hermitian matrix $A$ are the absolute values of the eigenvalues of $A$.

6.2.7. Let $A$ be an $m$-by-$n$ matrix. We shall show that the Frobenius norm $\|A\|_F$ (§3.9.1) of $A$ satisfies the identity

$$\|A\|_F = \sqrt{\sigma_1(A)^2 + \cdots + \sigma_{\min(m, n)}(A)^2}.$$

To see this, we take the singular value decomposition

$$A = U\Sigma V^*.$$

By the trace formula

$$\|M\|_F = \sqrt{\operatorname{tr}(M^*M)}$$

for the Frobenius norm (§3.9.2),

$$\begin{align*} \|A\|_F &= \sqrt{\operatorname{tr}\left((U \Sigma V)^*(U \Sigma V)\right)} \\ &= \sqrt{\operatorname{tr}\left(V^* \Sigma^* U^* U \Sigma V\right)} \\ &= \sqrt{\operatorname{tr}\left(V^* \Sigma^* \Sigma V\right)} \\ &= \sqrt{\operatorname{tr}\left(V^* \Sigma^* \Sigma V\right)} \\ &= \sqrt{\operatorname{tr}\left( (\Sigma V)^* (\Sigma V) \right)}\\ &= \|\Sigma V\|_F. \end{align*}$$

We now apply the $MM^*$ form of the trace formula

$$\|M\|_F = \sqrt{\operatorname{tr}(MM^*)}$$

to carry out analogous computations, concluding that

$$\|A\|_F = \|\Sigma\|_F.$$

The desired result now follows from the definition of the Frobenius norm.

6.2.8. The Eckhart–Young theorem allows us to construct low-rank approximations of a matrix using singular values.

Given the SVD $A = U \Sigma V^*$ with unitary matrices $U = [u^1 \mid \cdots \mid u^m]$ and $V = [v^1 \mid \cdots \mid v^n]$, the sum

$$\sum_{i=1}^n \sigma_i(A) u^i (v^i)^*$$

precisely equals $A$. If the rank of $A$ is $r$, then $\sigma_i(A) = 0$ for all $i > r$, and so

$$A = \sum_{i=1}^k \sigma_i(A) u^i (v^i)^*.$$

The Eckhart–Young theorem generalizes this construction. Given a matrix $A$ of rank $r$ and a positive integer $k < r$,

$$A_k = \sum_{i=1}^k \sigma_i(A) u^i (v^i)^*$$

is the best rank-$k$ approximation of $A$, i.e.,

$$\min_{\operatorname{rank}(B) = k} \|A - B\|_2 = \|A - A_k\|_2 = \sigma_{k+1}(A)$$

and

$$\min_{\operatorname{rank}(B) = k} \|A-B\|_F = \|A - A_k\|_F = \sqrt{\sigma_{k+1}(A)^2 + \cdots + \sigma_r(A)^2}.$$

6.2.9. The Moore–Penrose pseudoinverse of an $m$-by-$n$ matrix $\Sigma$ whose non-diagonal entries are zero is the $n$-by-$m$ matrix $\Sigma^\dagger$, defined by setting

$$\Sigma^\dagger_{ij} = \begin{cases} \Sigma_{ji}^{-1} & \mbox{ if } i = j \mbox{ and } \Sigma_{ji} \neq 0; \\ 0 & \mbox{ otherwise}.\end{cases}$$

In general, we define the Moore–Penrose pseudoinverse of an $m$-by-$n$ matrix $A$ to be the $n$-by-$m$ matrix

$$A^\dagger = V\Sigma^\dagger U^*,$$

where $A = U\Sigma V^*$ is the singular value decomposition of $A$.

The Moore–Penrose pseudoinverse satisfies the following properties:

  • $AA^\dagger A = A$;
  • $A^\dagger AA^\dagger = A^\dagger$;
  • $(A^\dagger A)^* = A^\dagger A$;
  • $(AA^\dagger)^* = AA^\dagger$.

In addition, if $A$ is invertible, then $A^\dagger = A^{-1}$.

As a quick application, we note that

$$\min_x \|Ax - b\|_2 = \|A(A^\dagger b) - b\|_2$$

even when $A$ fails to be invertible. The minimum is achived by vectors of the form

$$A^\dagger b - (I - A^\dagger A)y,$$

where $y$ can be any vector.

6.2.10. The computation of singular values is a complex topic that requires a development of several preliminary numerical algorithms. We refer the reader to Trefethen–Bau or Golub–van Loan.