Lecture 6: Matrix decompositions

Syllabus

Week 1: Python intro
Week 2: Matrices, vectors, norms, ranks
Week 3: Linear systems, eigenvectors, eigenvalues
Week 4: Matrix decompositions: LU, QR, SVD + test

Recap of the previous lecture

Eigenvectors and their applications (Schrodinger + PageRank)
Gershgorin circles
Computing eigenvectors using power method, $x := Ax$
Schur theorem, $A = U T U^*$
Normal matrices $AA^* = A^* A$.

Today lecture

Today we will talk about the matrix factorizations

LU decomposition and Gaussian elimination
QR decomposition and Gram-Schmidt orthogonalization

What is the LU factorization

From the title, it is clear, that LU factorization is the representation of a given matrix $A$ as a product
of a lower triangular matrix $L$ and an upper triangular matrix $U$.
The matrix $L$ has to be non-singular with ones on the diagonal, and the matrix $U$ has to be non-singular.

Main goal of the LU decomposition is to solve linear systems, because $$ A^{-1} f = (L U)^{-1} f = U^{-1} L^{-1} f, $$ and this reduces to the solution of a linear system $$ L y = f, $$ and $$ U x = y. $$ Solving linear systems with triangular matrices is easy: you just solve the equation with one unknown, put into the second equation, and so on. The computational cost of such algorithm in $\mathcal{O}(n^2)$ arithmetic operations, once $LU$ factorization is known.

Note that the inverse of triangular matrix is a triangular matrix.

LU decomposition: the algorithm

We can do some symbolic computations using the Python sympy package for our Hilbert matrix, to illustrate how LU decomposition looks like.



In [39]:

    
import numpy as np
import sympy
from sympy.matrices import Matrix
import IPython
sympy.init_printing(use_latex=True)
n = 5
w = Matrix(n, n, lambda i, j: 1/(i + j +  sympy.Integer(1)/2))
L, U, tmp = w.LUdecomposition()
#Generate the final expression
expr = (L + U)
fn = '%s \\times %s = %s' % (sympy.latex(L), sympy.latex(U), sympy.latex(w))
IPython.display.Math(fn)









    Out[39]:





$$\left[\begin{matrix}1 & 0 & 0 & 0 & 0\\\frac{5}{7} & 1 & 0 & 0 & 0\\\frac{5}{9} & \frac{14}{11} & 1 & 0 & 0\\\frac{5}{11} & \frac{189}{143} & \frac{9}{5} & 1 & 0\\\frac{5}{13} & \frac{84}{65} & \frac{198}{85} & \frac{44}{19} & 1\end{matrix}\right] \times \left[\begin{matrix}\frac{2}{5} & \frac{2}{7} & \frac{2}{9} & \frac{2}{11} & \frac{2}{13}\\0 & \frac{8}{441} & \frac{16}{693} & \frac{24}{1001} & \frac{32}{1365}\\0 & 0 & \frac{128}{127413} & \frac{128}{70785} & \frac{256}{109395}\\0 & 0 & 0 & \frac{512}{8690825} & \frac{2048}{15011425}\\0 & 0 & 0 & 0 & \frac{32768}{9256590525}\end{matrix}\right] = \left[\begin{matrix}\frac{2}{5} & \frac{2}{7} & \frac{2}{9} & \frac{2}{11} & \frac{2}{13}\\\frac{2}{7} & \frac{2}{9} & \frac{2}{11} & \frac{2}{13} & \frac{2}{15}\\\frac{2}{9} & \frac{2}{11} & \frac{2}{13} & \frac{2}{15} & \frac{2}{17}\\\frac{2}{11} & \frac{2}{13} & \frac{2}{15} & \frac{2}{17} & \frac{2}{19}\\\frac{2}{13} & \frac{2}{15} & \frac{2}{17} & \frac{2}{19} & \frac{2}{21}\end{matrix}\right]$$

A simple implementation of the LU factorization can be done in 3 cycles:



In [40]:

    
L = Matrix(n, n, lambda i, j: 0)
U = Matrix(n, n, lambda i, j: 0)
a = w.copy() #Our matrix
for k in xrange(n): #Eliminate one row
    
    L[k, k] = 1
    for i in xrange(k+1, n):
        L[i, k] = a[i, k] / a[k, k]
        for j in xrange(k+1, n):
            a[i, j] = a[i, j] - L[i, k] * a[k, j]
    for j in xrange(k, n):
        U[k, j] = a[k, j]



In [41]:

    
U









    Out[41]:





$$\left[\begin{matrix}\frac{2}{5} & \frac{2}{7} & \frac{2}{9} & \frac{2}{11} & \frac{2}{13}\\0 & \frac{8}{441} & \frac{16}{693} & \frac{24}{1001} & \frac{32}{1365}\\0 & 0 & \frac{128}{127413} & \frac{128}{70785} & \frac{256}{109395}\\0 & 0 & 0 & \frac{512}{8690825} & \frac{2048}{15011425}\\0 & 0 & 0 & 0 & \frac{32768}{9256590525}\end{matrix}\right]$$

Strictly regular matrices and LU decomposition

A matrix $A$ is strictly regular, if all of its principal minors (i.e, submatrices in the first $k$ rows and $k$ columns) are non-singular. In this case, there always exists a LU-decomposition. The reverse is also true.

Proof. If there is a LU-decomposition, then the matrix is strictly regular -- this follows from the fact that to get a minor you multiply a corresponding submatrix of $L$ by a corresponding submatrix of $U$, and they are non-singular (since non-singularity of triangular matrices is equivalent to the fact that their diagonal elements are not equal to zero).

The other way can be proven by induction. Suppose that we know that for all $(n-1) \times (n-1)$ matrices will non-singular minors LU-decomposition exists.

Then, consider the block form $$ A = \begin{bmatrix} a & c^{\top} \\ b & D \end{bmatrix}, $$ where $D$ is $(n-1) \times (n-1)$.

Then we do "block elimination": $$ \begin{bmatrix} 1 & 0 \\ -z & I \end{bmatrix} \begin{bmatrix} a & c^{\top} \\ b & D \end{bmatrix}= \begin{bmatrix} a & c^{\top} \\ 0 & A_1 \end{bmatrix}, $$ where $z = \frac{b}{a}, \quad A_1 = D - \frac{1}{a} b c^{\top}$. We can show that $A_1$ is also strictly regular, thus it has (by induction) the LU-decomposition:
$$ A_1 = L_1 U_1, $$

And then $$ L = \begin{bmatrix} 1 & 0 \\ z & L_1 \end{bmatrix}, \quad U = \begin{bmatrix} a & c^{\top} \\ 0 & U_1 \end{bmatrix} $$ we get $LA = U$ and $L$ and $U$ have the necessary triangular structure.

When LU fails

What happens, if the matrix is not strictly regular (or the pivots in the Gaussian elimination are really small?). There is classical $2 \times 2$ example of a bad LU decomposition. The matrix we look at is
$$ A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 1 \end{pmatrix} $$ If $\varepsilon$ is sufficiently small, we might fail.



In [47]:

    
import numpy as np
eps = 1.1 * 1e-16
a = [[eps, 1],[1,  1]]
a = np.array(a)
a0 = a.copy()
n = a.shape[0]
L = np.zeros((n, n))
U = np.zeros((n, n))
for k in xrange(n): #Eliminate one row   
    L[k, k] = 1
    for i in xrange(k+1, n):
        L[i, k] = a[i, k] / a[k, k]
        for j in xrange(k+1, n):
            a[i, j] = a[i, j] - L[i, k] * a[k, j]
    for j in xrange(k, n):
        U[k, j] = a[k, j]

print 'L * U - A:' 
print np.dot(L, U) - a0
print L









    



L * U - A:
[[ 0.  0.]
 [ 0.  1.]]
[[  1.00000000e+00   0.00000000e+00]
 [  9.09090909e+15   1.00000000e+00]]

The concept of pivoting

We can do pivoting, i.e. permute rows and columns to maximize $A_{kk}$ that we divide over.
The simplest but effective strategy is the row pivoting: at each step, select the index that is maximal in modulus, and put it onto the diagonal.

It gives us the decomposition

$$A = P L U,$$

where $P$ is a permutation matrix.

What makes pivoting good?
It is made good by the fact that

$$ | L_{ij}|<1, $$

but the elements of $U$ can grow, up to $2^n$! (in practice, this is very rarely encountered).

Cholesky factorization

A very important class of matrices is the class of symmetric positive definite matrices. By positive definiteness we mean that the associated quadratic form is always positive $$ (Ax, x) > 0, \quad A > 0 $$

In this case, the LU-factorization always exists and moreover, it can be written in the form of Cholesky factorization:
$$ A = R R^* $$ And we always can select pivots on the diagonal!

Summary on the LU for dense matrices

Principal submatrices non-singular - there exists LU
Pivoting is needed to avoid error growth
Complexity is $\mathcal{O}(n^3)$ for the factorization step (can be speeded by using Strassen ideas) and $\mathcal{O}(n^2)$ for a solve

LU-factorization for sparse matrices

The $\mathcal{O}(n^3)$ cost can be reduced, if the matrix $A$ is sparse. For example, for a three-diagonal matrix, the factors $L$ and $U$ will be bidiagonal, and could be computed in $\mathcal{O}(n)$ cost. Some demo...



In [26]:

    
import numpy as np
import scipy as sp
from scipy.sparse import csc_matrix, linalg as sla
%matplotlib inline 
import matplotlib.pyplot as plt
plt.xkcd()
#A = csc_matrix([[1,2,0,4],[1,0,0,1],[1,0,2,1],[2,2,1,0.]])

n = 10
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex,  -2*ex, ex)), [-1, 0, 1], n, n, 'csc'); 
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
A = csc_matrix(A)

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].spy(A)
ax[0].set_title('Sparsity pattern of a matrix')
lu = sla.splu(A)
ax[1].spy(lu.L)
ax[1].set_title('Sparsity of an L factor')









    Out[26]:





<matplotlib.text.Text at 0x113033490>



In [8]:









    Out[8]:





<5x5 sparse matrix of type '<type 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Column format>

QR-decomposition

The next decomposition: QR decomposition. Again from the name it is clear that a matrix is represented as a product
$$ A = Q R, $$ where $Q$ is an orthogonal (unitary) matrix and $R$ is upper triangular.
The matrix sizes: $Q$ is $n \times m$, $R$ is $m \times m$.

This algorithm plays a crucial role in many problems:

Computing orthogonal bases ior a linear space
Used in the preprocessing step for the SVD
QR-algorithm for the computation of eigenvectors and eigenvalues (one of the 10 most important algorithms of the 20th century) is based on the QR-decomposition

Theorem

Every rectangular $n \times m$ matrix, $n \geq m$ matrix has a QR-decomposition. There are several ways to prove it and compute it.

(mathematical) Using the Gram matrices and Cholesky factorization
(geometrical) Using the Gram-Schmidt orthogonalization
(practical) Using Householder/Givens transformations

Mathematical way

$$A = QR,$$

then $A^* A = R^* R$, the matrix $A^* A$ is called Gram matrix, and its elements are scalar products of the columns of $A$.

It is simple to show that $A^* A$ is positive definite: $$ (A^* A y, y) = (Ay, Ay) = \Vert Ay \Vert^2 \geq 0. $$ Therefore, $A^* A = R^* R$ always exists. Then the matrix $A R^{-1}$ is orthogonal:
$$ (A R^{-1})^* (AR^{-1})= R^{-*} A^* A R^{-1} = R^{-*} R^* R R^{-1} = I. $$

Gram matrices are not good for numerical stability!

Let us show that on random and Hilbert matrices :) The Gram-matrix method may lead to the loss of orthogonality.



In [50]:

    
import numpy as np
n = 10
r = 5
#a = np.random.randn(n, r)
a = [[1.0/(i+j+0.5) for i in xrange(r)] for j in xrange(n)]
a = np.array(a)
q, Rmat = np.linalg.qr(a)
e = np.eye(r)
print 'Built-in QR orth', np.linalg.norm(np.dot(q.T, q) - e)
gram_matrix = a.T.dot(a)
Rmat1 = np.linalg.cholesky(gram_matrix)
q1 = np.dot(a, np.linalg.inv(Rmat1.T))
print 'Via Gram matrix:', np.linalg.norm(np.dot(q1.T, q1) - e)









    



Built-in QR orth 8.10893700925e-16
Via Gram matrix: 1.78151575967e-08

Second way: Gram-Schmidt orthogonalization

QR-decomposition is a mathematical way of writing down the Gram-Schmidt orthogonalization process.
Given a sequence of vectors $a_1, \ldots, a_m$ we want to find orthogonal basis $q_1, \ldots, q_m$ such that every $a_i$ is a linear combination of such vectors.

Gram-Schmidt:

$q_1 := a_1/\Vert a_1 \Vert$
$q_2 := a_2 - (a_2, q_1) q_1, \quad q_2 := q_2/\Vert q_2 \Vert$
$q_3 := a_3 - (a_3, q_1) q_1 - (a_3, q_2) q_2), \quad q_2 := q_3/\Vert q_3 \Vert$
And go on Note that the transformation from $Q$ to $A$ has triangular structure, since from the $k$-th vector we subtract only the previous ones.

Gram-Schmidt is unstable

Gram-Schmidt can be very unstable (i.e., the produced vectors will be not orthogonal, especially if $q_k$ has small norm).
This is called loss of orthogonality.
There is a remedy, called modified Gram-Schmidt method. Instead of doing $$q_k := a_k - (a_k, q_1) q_1 - \ldots - (a_k, q_{k-1}) q_{k-1}$$ we do it step-by-step. First we set $q_k := a_k$ and orthogonalize sequentially: $$ q_k := q_k - (q_k, q_1)q_1, \quad q_k := q_{k} - (q_k,q_2)q_2, \ldots $$ In exact arithmetic, it is the same. In floating point it is absolutely different!

Note that the complexity in $\mathcal{O}(nm^2)$ operations

QR-decomposition: the (almost) practical way

If $A = QR$, then
$$ R = Q^* A, $$ and we need to find a certain orthogonal matrix $Q$ that brings a matrix into orthogonal form.
For simplicity, we will look for an $n \times n$ matrix such that $$ Q^* A = \begin{bmatrix}

& & \ 0 & & \ 0 & 0 & * \ &0_{(n-m) \times m} \end{bmatrix} $$

We will do it column-by-column.
First, we find a Householder matrix $H_1 = (I - 2 uu^{\top})$ such that (we illustrate on a $4 \times 3$ matrix)

$$ H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \end{bmatrix} $$

Then, $$ H_2 H_1 A = \begin{bmatrix}

* & * & * \\
0 & * & * \\
0 & 0 & * \\
0 & 0 & *

\end{bmatrix}, $$ where $$ H_2 = \begin{bmatrix} 1 & 0 \\ 0 & H'_2, \end{bmatrix} $$ and $H'_2$ is a $3 \times 3$ Householder matrix.

Finally, $$ H_3 H_2 H_1 A = \begin{bmatrix}

* & * & * \\
0 & * & * \\
0 & 0 & * \\
0 & 0 & 0

\end{bmatrix}, $$

A simple implementation of the Householder method in Python looks as follows.



In [28]:

    
#Initialization
import numpy as np
n = 4
r = 3
a = np.random.randn(n, r)
k = 0



In [31]:

    
def householder(x):
    u = x.copy()
    u = u / np.linalg.norm(u)
    u[0] = u[0] - 1 #Or plus
    u = u / np.linalg.norm(u)
    u = np.reshape(u, (-1, 1))
    return u


x = a[:, k:k+1]
u0 = householder(a[k:, k:k+1])
u = np.zeros((n, 1))
u[k:] = u0
H = np.eye(n) - 2 * u.dot(u.T)
a = H.T.dot(a)
print a
k = k + 1









    



[[  2.55804239e+00  -3.02487125e-01   1.22104370e-01]
 [  2.18724250e-16   1.12507545e+00  -3.09474748e-01]
 [ -2.64480936e-17  -3.45664708e-16   4.80083115e-01]
 [  2.76406866e-17   1.07140777e-16  -5.55111512e-17]]

The QR algorithm

The QR algorithm (Kublanovskaya and Francis independently proposed it in 1961). Do not mix QR algorithm and QR decomposition. QR-decomposition is the representation of a matrix, whereas QR algorithm uses QR decomposition computes the eigenvalues!

Computation of the eigenvectors and eigenvalues of a matrix.

Initialization: $$A_0 := A$$
Iteration step: $$Q_k R_k = A_k, \quad A_{k+1} = R_k Q_k,$$ so we compute the QR-factorization of a matrix, and then multiply them in the opposite order.
Good news: the spectrum of $A_k$ is the same as of $A_0$, and $A_k$ converges to a triangular matrix!
It is too tempting not to implement it :)



In [32]:

    
import numpy as np
n = 4
a = [[1.0/(i + j + 0.5) for i in xrange(n)] for j in xrange(n)]
niters = 10
for k in xrange(niters):
    q, rmat = np.linalg.qr(a)
    a = rmat.dot(q)
    print 'current a:'
    print a









    



current a:
[[  2.40047183e+00   1.43485636e-01   4.99605047e-03  -5.56291523e-05]
 [  1.43485636e-01   3.59286592e-01   1.60534687e-02  -1.94225770e-04]
 [  4.99605047e-03   1.60534687e-02   1.60692682e-02  -2.80885670e-04]
 [ -5.56291523e-05  -1.94225770e-04  -2.80885670e-04   2.40684296e-04]]
current a:
[[  2.41031150e+00   2.09437993e-02   3.18722176e-05   5.45279213e-09]
 [  2.09437993e-02   3.50196113e-01   6.87806955e-04   1.28079871e-07]
 [  3.18722176e-05   6.87806955e-04   1.53250849e-02   4.17732654e-06]
 [  5.45279205e-09   1.28079871e-07   4.17732654e-06   2.35678648e-04]]
current a:
[[  2.41051991e+00   3.04114601e-03   2.02621961e-07  -5.33227597e-13]
 [  3.04114601e-03   3.49989111e-01   3.00995528e-05  -8.62074652e-11]
 [  2.02621961e-07   3.00995528e-05   1.53236760e-02  -6.42429491e-08]
 [ -5.33147641e-13  -8.62074827e-11  -6.42429491e-08   2.35677492e-04]]
current a:
[[  2.41052431e+00   4.41545673e-04   1.28806658e-09   1.32059800e-16]
 [  4.41545673e-04   3.49984720e-01   1.31786000e-06   5.80333420e-14]
 [  1.28806664e-09   1.31786000e-06   1.53236733e-02   9.88053919e-10]
 [  5.21260158e-17   5.80510077e-14   9.88053906e-10   2.35677492e-04]]
current a:
[[  2.41052440e+00   6.41081268e-05   8.18816607e-12  -7.99356474e-17]
 [  6.41081268e-05   3.49984627e-01   5.77009674e-08  -2.14109052e-17]
 [  8.18822358e-12   5.77009674e-08   1.53236733e-02  -1.51962427e-11]
 [ -5.09637195e-21  -3.90911829e-17  -1.51962302e-11   2.35677492e-04]]
current a:
[[  2.41052440e+00   9.30787458e-06   5.19949264e-14   7.99300814e-17]
 [  9.30787458e-06   3.49984626e-01   2.52637036e-09  -1.76560777e-17]
 [  5.20524342e-14   2.52637031e-09   1.53236733e-02   2.33729939e-13]
 [  4.98273388e-25   2.63237618e-20   2.33717423e-13   2.35677492e-04]]
current a:
[[  2.41052440e+00   1.35141258e-06   2.73389069e-16  -7.99300126e-17]
 [  1.35141258e-06   3.49984625e-01   1.10614263e-10   1.76826923e-17]
 [  3.30896669e-16   1.10614211e-10   1.53236733e-02  -3.60708065e-15]
 [ -4.87162969e-29  -1.77262591e-23  -3.59456477e-15   2.35677492e-04]]
current a:
[[  2.41052440e+00   1.96211922e-07  -5.54040645e-17   7.99300027e-17]
 [  1.96211922e-07   3.49984625e-01   4.84316731e-12  -1.76827548e-17]
 [  2.10350596e-18   4.84311567e-12   1.53236733e-02   6.78001457e-17]
 [  4.76300288e-33   1.19367537e-26   5.52842647e-17   2.35677492e-04]]
current a:
[[  2.41052440e+00   2.84880569e-08  -5.74941943e-17  -7.99300013e-17]
 [  2.84880568e-08   3.49984625e-01   2.12101875e-13   1.76827614e-17]
 [  1.33719609e-20   2.12050235e-13   1.53236733e-02  -1.33661508e-17]
 [ -4.65679822e-37  -8.03813647e-30  -8.50269816e-19   2.35677492e-04]]
current a:
[[  2.41052440e+00   4.13618795e-09  -5.75074807e-17   7.99300011e-17]
 [  4.13618792e-09   3.49984625e-01   9.33601458e-15  -1.76827623e-17]
 [  8.50053870e-23   9.28437503e-15   1.53236733e-02   1.25289581e-17]
 [  4.55296169e-41   5.41283161e-33   1.30771163e-20   2.35677492e-04]]

Take home message

LU decomposition, Cholesky, their complexity is $\mathcal{O}(n^3)$, used to solve dense linear systems
QR decomposition, find orthogonal basis, the basic component for the computation of eigenvalues and eigenvectors

Next time

How do we compute LU factorizations for large-scale problems?
And: the SVD (with some examples)



In [51]:

    
n = 100
a = np.random.randn(n, n)
%timeit np.linalg.qr(a)
%timeit scipy.linalg.lu(a)









    



1000 loops, best of 3: 777 µs per loop
1000 loops, best of 3: 218 µs per loop



In [34]:

    
%timeit l, u = np.linalg.cholesky?



In [35]:

    
import scipy

Questions?



In [137]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[137]: