Instead of solving a linear system, we can minimize the residual:
$$R(x) = \Vert A x - f \Vert_2.$$The direct first-order conditions for the minimum of this functional gives
$$A^* A x = A^* f,$$thus it has squared condition number, so direct minimization of the residual by standard optimization methods is rarely used.
For the symmetric positive definite case there is a much simpler functional.
Let $A = A^* > 0$, then the following functional
$$\Phi(x) = (Ax, x) - 2(f, x)$$is strictly convex, and its global optimum satisfies
$$A x_* = f.$$Indeed,
$ \delta \Phi = \delta (Ax, x) - 2(f, \delta x)=$ $(Ax, \delta x) + (A \delta x, x) - 2(f, \delta x)$ $= ((A + A^{*})x, \delta x) - 2(f, \delta x) =$
$ = 2(A x - f, \delta x) = 2 (\nabla \Phi, \delta x) $.
Thus, $$\nabla \Phi = 2(Ax - f).$$ (and simple iteration is the gradient descent) and the stationary point $\nabla \Phi = 0$ yields $$A x_* = f.$$
We assume, that we can only multiply a matrix by vector as a black-box in a fast way (say, $\mathcal{O}(N))$.
Nothing else!
The most general situation: we compute
$$y_k = A x_k, \quad k = 1, \ldots, M$$for some input vectors $x_k$, and then we have a linear subspace, generated by these $M$ vectors.
Given a linear $M$-dimensional subspace, we want to find an approximate solution of
$$A x \approx f, \quad x = \sum_{k=1}^M \widehat x_k y_k,$$where $\widehat x$ is the vector of coefficients.
In the symmetric positive definite case we need to minimize
$$(Ax, x) - 2(f, x)$$subject to $$x = Y \widehat x,$$
where $Y=[y_1,\dots,y_M]$ is $n \times M$ and vector $\widehat x$ has length $M$.
Using the representation of $x$, we have the following minimization for $\widehat x$:
$$\widehat{\Phi}(\widehat x) = (A Y \widehat x, Y \widehat x) - 2(f, Y \widehat x) = (Y^* A Y \widehat x, \widehat x) - 2(Y^* f, \widehat x).$$Note that this is the same functional, but for the Galerkin projection of $A$
$$Y^* A Y \widehat x = Y^* f,$$which is an $M \times M$ linear system with symmetric positive definite matrix if $Y$ has full column rank.
Instead of multiplying different vectors $x_k$, in the Krylov subspace we generate the whole subspace from a single vector $f$:
$$y_0\equiv k_0 = f, \quad y_1\equiv k_1 = A f, \quad y_2\equiv k_2 = A^2 f, \ldots, \quad y_{M-1}\equiv k_{M-1} = A^{M-1} f.$$This gives the Krylov subpace
$$K_M(A, f) = \mathrm{Span}(f, Af, \ldots, A^{M-1} f).$$It is known to be quasi-optimal space given only matrix-vector product operation.
The natural basis in the Krylov subspace is very ill-conditioned, since
$$k_i = A^i f \rightarrow \lambda_\max^i v,$$where $v$ is the eigenvector, corresponding to the maximal eigenvalue of $A$,
i.e. $k_i$ become more and more collinear.
Solution: Compute orthogonal basis in the Krylov subspace.
In order to have stability, we first orthogonalize the vectors from the Krylov subspace using Gram-Schmidt orthogonalization process (or, QR-factorization).
$$K_j = \begin{bmatrix} f & Af & A^2 f & \ldots & A^{j-1} f\end{bmatrix} = Q_j R_j, $$and the solution will be approximated as $$x \approx Q_j \widehat{x}_j.$$
The Krylov matrix $K_j$ satisfies an important recurrent relation (called Arnoldi relation)
$$A Q_j = Q_j H_j + h_{j, j-1} q_j e^{\top}_{j-1},$$where $H_j$ is upper Hessenberg, and $Q_{j+1} = [q_0,\dots,q_j]$.
Let us prove it (given for $j = 3$ for simplicity):
$$A \begin{bmatrix} k_0 & k_1 & k_2 \end{bmatrix} = \begin{bmatrix} k_1 & k_2 & k_3 \end{bmatrix} = \begin{bmatrix} k_0 & k_1 & k_2 \end{bmatrix} \begin{bmatrix} 0 & 0 & \alpha_0 \\ 1 & 0 & \alpha_1 \\ 0 & 1 & \alpha_2 \\ \end{bmatrix} + \begin{bmatrix} 0 & 0 & k_3 - \alpha_0 k_0 - \alpha_1 k_1 - \alpha_2 k_2 \end{bmatrix}, $$where $\alpha_s$ will be selected later. Denote $\widehat{k}_3 = k_3 - \alpha_0 k_0 - \alpha_1 k_1 - \alpha_2 k_2$.
In the matrix form,
$$A K_3 = K_3 Z + k_3 e^{\top}_2,$$ where $Z$ is the lower shift matrix with the last column $(\alpha_0,\alpha_1,\alpha_2)^T$, and $e_2$ is the last column of the identity matrix.
Let $$K_3 = Q_3 R_3$$ be the QR-factorization. Then,
$$A Q_3 R_3 = Q_3 R_3 Z + \widehat{k}_3 e^{\top}_2,$$
$$ A Q_3 = Q_3 R_3 Z R_3^{-1} + \widehat{k}_3 e^{\top}_2 R_3^{-1}.$$
Note that
$$e^{\top}_2 R_3^{-1} = \begin{bmatrix} 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \end{bmatrix} = \gamma e^{\top}_2,$$and
$$R_3 Z R_3^{-1} = \begin{bmatrix} * & * & * \\* & * & * \\ 0 & * & * \\ \end{bmatrix}, $$
in the general case it will be an upper Hessenberg matrix $H$, i.e. a matrix
that $$H_{ij} = 0, \quad \mbox{if } i > j + 1.$$
Let $Q_j$ be the orthogonal basis in the Krylov subspace, then we have almost the Arnoldi relation
$$A Q_j = Q_j H_j + \widehat{k}_j e^{\top}_{j-1},$$where $H_j$ is an upper Hessenberg matrix, and
$$\widehat{k}_j = k_j - \sum_{s=0}^{j-1} \alpha_j k_j.$$We select $\alpha_j$ in such a way that
$$Q^*_j \widehat{k}_j = 0.$$Then, $\widehat{k}_j = h_{j, j-1} q_j,$ where $q_j$ is the last column of $Q_{j+1}$.
In order to get $q_j$, we need to compute just the last column of
$$T_{j, j-1} q_j = (A Q_j - Q_j T_j) e_{j-1} = A q_{j-1} - T_{j-1, j-1} q_{j-1} - T_{j-2, j-1} q_{j-2}. $$The coefficients $\alpha_j = T_{j-1, j-1}$ and $\beta_j = T_{j-2, j-1}.$
can be recovered from orthogonality constraints
$(q_j, q_{j-1}) = 0, \quad (q_j, q_{j-2}) = 0$
All the other constraints will be satisfied automatically!!
And we only need to store two vectors to get the new one.
We can now get from the Lanczos recurrence to the famous conjugate gradient method.
We have for $A = A^* > 0$
$$A Q_j = Q_j T_j + T_{j, j-1} q_j.$$The approximate solution of $Ax \approx f$ with $x_j = Q_j \widehat{x}_j$ can be found by solving a small system
$$Q^*_j A Q_j \widehat x_j = T_j \widehat{x}_j = Q^*_j f .$$Since $f$ is the first Krylov subspace, then Note!!! (recall what the first column in $Q_j$ is) $$Q^*_j f = \Vert f \Vert_2^2 e_0 = \gamma e_0.$$
We have a tridiagonal system of equations for $\widehat x$:
$$T_j \widehat{x}_j = \gamma e_0$$and $x_j = Q_j \widehat{x}_j$.
Since $A$ is positive definite, $T_j$ is also positive definite, and it allows an LU-decomposition
$T_j = L_j U_j$, where $L_j$ is a bidiagonal matrix with ones on the diagonal, $U_j$ is a upper bidiagonal matrix.
We need to define one subdiagonal in $L$ (with elements $c_1, \ldots, c_{j-1}$), main diagonal of $U_j$ (with elements $d_0, \ldots, d_{j-1}$ and superdiagonal of $U_j$ (with elements $b_1, \ldots, b_{j-1}$.
They have convenient recurrences:
$$c_i = b_i/d_{i-1}, \quad d_i = \begin{cases} a_1, & \mbox{if } i = 0, \\ a_i - c_i b_i, & \mbox{if } i > 0. \end{cases}$$For the solution we have
$$x_j = Q_j T^{-1}_j \gamma e_0 = \gamma Q_j (L_j U_j)^{-1} e_0 = \gamma Q_j U^{-1}_j L^{-1}_j e_0.$$We introduce two new quantities:
$$P_j = Q_j U^{-1}_j, \quad z_j = \gamma L^{-1}_j e_0.$$Due to the recurrence relations, we have
$$P_j = \begin{bmatrix} P_{j-1} & p_j \end{bmatrix}, $$and
$$z_j = \begin{bmatrix} z_{j-1} \\ \xi_{j} \end{bmatrix}.$$For $p_j$ and $\xi_j$ we have short-term recurrence relations (due to bidiagonal structure)
$$p_j = \frac{1}{d_j}\left(q_j - b_j p_{j-1} \right), \quad \xi_j = -c_j \xi_{j-1}.$$Thus, we arrive at short-term recurrence for $x_j$:
$$x_j = P_j z_j = P_{j-1} z_{j-1} + \xi_j p_j = x_{j-1} + \xi_j p_j.$$and $q_j$ are found from the Lanczos relation (see slides above).
This method for solving linear systems is called a direct Lanczos method. It is closely related to the conjugate gradient method.
We have the direct Lanczos method, where we store
$$p_{j-1}, q_j, x_{j-1}$$to get a new estimate of $x_j$.
The main problem is with $q_j$: we have the three-term recurrence, but in the floating point arithmetic the orthogonality is can be lost, leading to numerical errors.
Let us do some demo.
In [9]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy as sp
import scipy.sparse
import scipy.sparse.linalg as spla
import scipy
from scipy.sparse import csc_matrix
n = 128
ex = np.ones(n);
A = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
rhs = np.ones(n)
nit = 64
q1 = rhs/np.linalg.norm(rhs)
q2 = A.dot(q1)
q2 = q2 - np.dot(q2, q1)*q1
q2 = q2/np.linalg.norm(q2)
qall = [q1, q2]
for i in range(nit):
qnew = A.dot(qall[-1])
qnew = qnew - np.dot(qnew, qall[-1])*qall[-1]
qnew = qnew/np.linalg.norm(qnew)
qnew = qnew - np.dot(qnew, qall[-2])*qall[-2]
qnew = qnew/np.linalg.norm(qnew)
qall.append(qnew)
qall_mat = np.vstack(qall).T
print(np.linalg.norm(qall_mat.T.dot(qall_mat) - np.eye(qall_mat.shape[1])))
Instead of $q_j$ (last vector in the modified Gram-Schmidt process), it is more convenient to work with the residual
$$r_j = f - A x_j.$$The resulting recurrency has the form
$x_j = x_{j-1} + \alpha_{j-1} p_{j-1}$
$r_j = r_{j-1} - \alpha_{j-1} A p_{j-1}$
$p_j = r_j + \beta_j p_{j-1}$.
Hence the name conjugate gradient: to the gradient $r_j$ we add a conjugate direction $p_j$.
We have orthogonality of residuals (check!):
$$(r_i, r_j) = 0, \quad i \ne j$$and A-orthogonality of conjugate directions (check!):
$$ (A p_i, p_j) = 0,$$which can be checked from the definition.
The equations for $\alpha_j$ and $\beta_j$ can be now defined explicitly from these two properties.
We have $(r_{j}, r_{j-1}) = 0 = (r_{j-1} - \alpha_{j-1} A r_{j-1}, r_{j-1})$,
thus
$$\alpha_{j-1} = \frac{(r_{j-1}, r_{j-1})}{(A r_{j-1}, r_{j-1})}.$$In the similar way, we have
$$\beta_{j-1} = \frac{(r_j, r_j)}{(r_{j-1}, r_{j-1})}.$$Recall that
$x_j = x_{j-1} + \alpha_{j-1} p_{j-1}$
$r_j = r_{j-1} - \alpha_{j-1} A p_{j-1}$
$p_j = r_j + \beta_j p_{j-1}$.
Only one matrix-by-vector product per iteration.
More details here: https://www.siam.org/meetings/la09/talks/oleary.pdf
When Hestenes worked on conjugate bases in 1936, he was advised by a Harvard professor that it was too obvious for publication
We need to store 3 vectors.
Since it generates $A$-orthogonal sequence $p_1, \ldots, p_N$, aften $n$ steps it should stop (i.e., $p_{N+1} = 0$.)
In practice it does not have this property in finite precision, thus after its invention in 1952 by Hestens and Stiefel it was labeled unstable.
In fact, it is a brilliant iterative method.
The CG method compute $x_k$ that minimizes the energy functional over the Krylov subspace, i.e. $x_k = p(A)f$, where $p$ is a polynomial of degree $k+1$, so
$$\Vert x_k - x_* \Vert_A = \inf\limits_{p} \Vert \left(p(A) - A^{-1}\right) f \Vert_A. $$Using eigendecomposition of $A$ we have
$$A = U \Lambda U^*, \quad g = U^* f,$$and
$\Vert x - x_* \Vert^2_A = \displaystyle{\inf_p} \Vert \left(p(\Lambda) - \Lambda^{-1}\right) g \Vert_\Lambda^2 = \displaystyle{\sum_{i=1}^n} \frac{(\lambda_i p(\lambda_i) - 1)^2 g^2_i}{\lambda_i} = \displaystyle{\inf_{q, q(0) = 1}} \displaystyle{\sum_{i=1}^n} \frac{q(\lambda_i)^2 g^2_i}{\lambda_i} $
Selection of the optimal $q$ depends on the eigenvalue distribution.
We have $$\Vert x - x_* \Vert^2_A \leq \sum_{i=1}^n \frac{g^2_i}{\lambda_i} \inf_{q, q(0)=1} \max_{j} q({\lambda_j})^2$$
The first term is just $$\sum_{i=1}^n \frac{g^2_i}{\lambda_i} = (A^{-1} f, f) = \Vert x_* \Vert^2_A.$$
And we have relative error bound
$$\frac{\Vert x - x_* \Vert_A }{\Vert x_* \Vert_A} \leq \inf_{q, q(0)=1} \max_{j} |q({\lambda_j})|,$$so if matrix has only 2 different eigenvalues, then there exists a polynomial of degree 2 such that $q({\lambda_1}) =q({\lambda_2})=0$, so in this case CG converges in 2 iterations.
If eigenvalues are clustered and there are $l$ outliers, then after first $\mathcal{O}(l)$ iterations CG will converge as if there are no outliers (and hence the effective condition number is smaller).
The intuition behind this fact is that after $\mathcal{O}(l)$ iterations the polynomial has degree more than $l$ and thus is able to zero $l$ outliers.
Let us find another useful upper-bound estimate of convergence. Since
$$ \inf_{q, q(0)=1} \max_{j} |q({\lambda_j})| \leq \inf_{q, q(0)=1} \max_{\lambda\in[\lambda_\min,\lambda_\max]} |q({\lambda})| $$The last term is just the same as for the Chebyshev acceleration, thus the same upper convergence bound holds:
$$\frac{\Vert x_k - x_* \Vert_A }{\Vert x_* \Vert_A} \leq \gamma \left( \frac{\sqrt{\mathrm{cond}(A)}-1}{\sqrt{\mathrm{cond}(A)}+1}\right)^k.$$Before we discussed symmetric positive definite systems. What happens if $A$ is non-symmetric?
We can still orthogonalize the Krylov subspace using Arnoldi process, and get
$$A Q_j = Q_j H_j + h_{j,j-1}q_j e^{\top}_{j-1}.$$Let us rewrite the latter expression as
$$ A Q_j = Q_j H_j + h_{j,j-1}q_j e^{\top}_{j-1} = Q_{j+1} \widetilde H_j, \quad \widetilde H_j = \begin{bmatrix} h_{0,0} & h_{0,1} & \dots & h_{0,j-2} & h_{0,j-1} \\ h_{1,0} & h_{1,1} & \dots & h_{1,j-2} & h_{1,j-1} \\ 0& h_{2,2} & \dots & h_{2,j-2} & h_{2,j-1} \\ 0& 0 & \ddots & \vdots & \vdots \\ 0& 0 & & h_{j,j-1} & h_{j-1,j-1} \\ 0& 0 & \dots & 0 & h_{j,j-1}\end{bmatrix}$$Then, if we need to minimize the residual over the Krylov subspace, we have
$$x_j = Q_j \widehat{x_j} $$and $x_j$ has to be selected as
$$ \Vert A Q_j \widehat{x_j} - f \Vert_2 \rightarrow \min.$$Using the Arnoldi recursion, we have
$$ \Vert Q_{j+1} \widetilde H_j \widehat{x_j} - f \Vert_2 \rightarrow \min.$$Using the orthogonal invariance under multiplication by unitary matrix, we get
$$ \Vert \widetilde H_j \widehat{x_j} - \gamma e_0 \Vert_2 \rightarrow \min,$$where we have used that $Q^*_{j+1} f = \gamma e_0.$
This is just a linear least squares with $(j+1)$ equations and $j$ unknowns.
The matrix is also upper Hesseberg, thus its QR factorization can be computed in a very cheap way.
This allows the computation of $\widehat{x}_j$. This method is called GMRES (generalized minimal residual)
In [1]:
from IPython.core.display import HTML
def css_styling():
styles = open("./styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: