Lecture 10: Iterative methods

Syllabus

Week 1: Python intro
Week 2: Matrices, vectors, norms, ranks
Week 3: Linear systems, eigenvectors, eigenvalues
Week 4: Singular value decomposition + test + homework seminar
Week 5: Sparse & structured matrices
Week 6: Iterative methods, preconditioners, matrix functions

Recap of the previous lecture

  • Shift-invariant structure: Toeplitz matrices
  • Circulant matrices and FFT
  • What is an inverse of a Toeplitz matrix?
  • Displacement rank (and Gohberg-Semencul formula)
  • Multilevel matrices

Today lecture

Today we will talk about iterative methods. List of topics (we will possibly not cover all) includes

  • Concept of iterative methods
  • Richardson iteration, Chebyshev acceleration
  • Jacobi/Gauss-Seidel methods
  • Idea of Krylov methods
  • Conjugate gradient methods for symmetric positive definite matrices
  • Generalized minimal residual methods
  • The Zoo of iterative methods: BiCGStab, QMR, IDR, ...
  • The idea of preconditioners
  • Jacobi/Gauss-Seidel as preconditioner, successive overrelaxation
  • Incomplete LU

Iterative methods: basic idea

The basic idea of an iterative method is simple. Suppose we have a sparse matrix (or structured) matrix $A$ and we can compute $Ax$ fast. A typical inversion algorithm is $\mathcal{O}(N^3)$. So, maybe we can just multiply a matrix by a vector many times instead.

Richardson iteration

The simplest idea is the "simple iteration method" or Richardson iteration.

$$Ax = f,$$ $$\tau (Ax - f) = 0,$$ $$x + \tau (Ax - f) = x,$$ $$x_{k+1} = x_k + \tau (Ax_k - f).$$

Convergence of the Richardson method

Let $x_*$ be the solution; introduce an error $e_k = x_{k} - x_*$, then
$$ e_{k+1} = (I - \tau A) e_k, $$ therefore if $\Vert I - \tau A \Vert < 1$ the iteration converges. For symmetric positive definite case it is always possible to select $\tau$

Optimal parameter choice

The optimal choice for $\tau$ for $A = A^* > 0$ is
$$ \tau = \frac{2}{\lambda_{\min} + \lambda_{\max}}. $$

Different time steps

Suppose we change $\tau$ every step, i.e. $$ x_{k+1} = x_k - \tau_k (A x_k - f). $$

Then, $$e_{k+1} = (I - \tau_k A) e_k = (I - \tau_k A) (I - \tau_{k-1} A) e_{k-1} = \ldots = p(A) e_0, $$ where $p(A)$ is a matrix polynomial (simplest matrix function)
$$ p(A) = (I - \tau_k A) \ldots (I - \tau_0 A), $$ and $p(0) = 1$.

Chebyshev polynomials

For $A = A^*$ the problem reduces to finding a polynomial $p(w)$ such that $p(0) = 1$ and it is a close to $0$ as possible on $[\lambda_{\min}, \lambda_{\max}].$
The answer on $[-1, 1]$ is given by Chebyshev polynomials, and on a general interval - by their scaled version.

Plot some Chebyshev polynomials


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-1, 1, 128)
p = np.polynomial.Chebyshev((0, 0, 0, 0, 0, 0, 0, 1), (-1, 1)) #These are Chebyshev series, a proto of "chebfun system" in MATLAB
plt.plot(x, p(x))


Out[1]:
[<matplotlib.lines.Line2D at 0x10e33a310>]

In [5]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy as sp
import scipy.sparse
import scipy.sparse.linalg as spla
import scipy
from scipy.sparse import csc_matrix
n = 20
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex,  -2*ex, ex)), [-1, 0, 1], n, n, 'csr'); 
rhs = np.ones(n)
ev1, vec = spla.eigs(lp1, k=2, which='LR')
ev2, vec = spla.eigs(lp1, k=2, which='SR')
lam_max = ev1[0]
lam_min = ev2[0]

tau_opt = 2.0/(lam_max + lam_min)

fig, ax = plt.subplots()
#plt.close(fig)

niters = 1000
x = np.zeros(n)
res_all = []
for i in xrange(niters):
    rr = lp1.dot(x) - rhs
    x = x - tau_opt * rr
    res_all.append(np.linalg.norm(rr))
#Convergence of an ordinary Richardson (with optimal parameter)
plt.semilogy(res_all)


Out[5]:
[<matplotlib.lines.Line2D at 0x10fbe9d90>]

Chebyshev and beyond

Chebyshev method is efficient when you know that your spectrum is on an interval.
But what happens if the spectrum has not that nice structure?
For example, if the spectrum is contained on two intervals, the solution can be
written in terms of Zolotarev polynomials and elliptic functions, and for three intervals
you have to go to the hyperelliptic functions.

Krylov subspaces

The Chebyshev method produces the approximation in the Krylov subspace of the form
$$ K(A, f) = \mathrm{Span}(f, Af, A^2 f, \ldots, ) $$ Then we can minimize the residual:

$$\Vert f - Ax \Vert,$$

over all $x \in \mathcal{K}_n(A, f)$, where the norms can be different (and even strange).

Conjugate gradient method

Conjugate gradient method (CG) is the best method for solving symmetric positive definite systems, if the only information about the system is the matrix-by-vector product.

Idea of the CG method

Then, we can introduce a A-norm as $$ \Vert x \Vert_A = \sqrt{(Ax, x)}, $$ and minimize the A-norm of the error. Let $x_*$ be the true solution of $Ax = f$, then we want to minimize the A-norm of $$ \Vert x_n - x_* \Vert_A, $$ over the $n$-th Krylov subspace.

CG formulas

$x_n = x_0 + y_n, \quad y_n = \arg \min \Vert x_0 + y - z \Vert_A, \quad y \in K_n$.

Using Pythagoreus theorem,

$(x_n, y) = (z, y)_A$ for all $y \in K_n$, $\rightarrow$ $r_n = Ax_n - f \perp K_n$

CG-method: formulas

$$ \alpha_n = (r_{n-1}, r_{n-1})/(Ap_n, p_n), $$$$ x_n = x_{n-1} + \alpha_n p_n, $$$$ r_n = r_{n-1} - \alpha_n A p_n, $$$$ \beta_n = (r_n, r_n)/(r_{n-1},r_{n-1}), $$$$ p_{n+1} = r_n + \beta_n p_n $$

The vectors $p_i$ constitute an A-orthogonal basis in $K_n$.

CG-method: history

  1. In was proposed in by Hestenes and Stiefel in 1952.
  2. In exact arithmetics it should give exact solution at $n$ iterations
  3. In floating-point arithmetics it did not - people thought it is unstable compared to Gaussian elimination
  4. Only decades later it was realized that it is a wonderful iterative method

CG-method: properties

  1. Convergence estimate: $$ \Vert e_i \Vert \leq 2 \Big( \frac{1 - \sqrt{\frac{m}{M}}}{1 + \sqrt{\frac{m}{M}}} \Big)^i \Vert e_0 \Vert. $$

  2. Clustering of eigenvalues: if there are $k$ "outliers", then in $k$ iterations they are "removed" (we can illustrate it on the board)

GMRES

For non-symmetric matrices, there are two methods that are typically used: GMRES and BiCGStab.

GMRES idea is very simple: we construct orthogonal basis in the Krylov subspace.
Given the basis $q_1, \ldots, q_n$ we compute the residual $r_ n = Ax_n - f$ and orthogonalize it to the basis.

A Python realization is available as scipy.sparse.linalg.gmres (although quite buggy).

Disadvantage of GMRES

The main disadvantage of GMRES: we have to store all the vectors, so the memory costs grows with each step.
We can do restarts (i.e. get a new residual and a new Krylov subspace). BiCGSStab method avoids that using "short recurrences" like in the CG method.

Idea of BiCGStab

Use CG method for $A^{\top} A x = A^{\top}f$. The condition number is squared, thus a certain stablilization was proposed by Van der Vorst et al. Used a lot in practice for solving non-symmetric linear systems, it requires product by $A$ and $A^{\top}$ at one iteration step.

Now we can do a short demo


In [11]:
import numpy as np
import scipy.sparse.linalg


n = 100
ex = np.ones(n);
lp1 = -sp.sparse.spdiags(np.vstack((ex,  -2*ex, ex)), [-1, 0, 1], n, n, 'csr'); 
rhs = np.ones(n)
ee = sp.sparse.eye(n)

#lp2 = sp.kron(lp1, ee) + sp.kron(ee, lp1)
#rhs = np.ones(n * n)
res_all = []
res_all_bicg = []
def my_print(r):
    res_all.append(r)

def my_print2(x): #For BiCGStab they have another callback, please rewrite
    res_all_bicg.append(np.linalg.norm(lp1.dot(x) - rhs))
    
sol = scipy.sparse.linalg.gmres(lp1, rhs, callback=my_print, restart=100)
plt.semilogy(res_all, marker='x',label='GMRES')
sol2 = scipy.sparse.linalg.bicgstab(lp1, rhs, callback=my_print2)
plt.xlabel('Iteration number')
plt.ylabel('Residual')
plt.semilogy(res_all_bicg, label='BiCGStab', marker='o')
plt.legend(loc='best')


Out[11]:
<matplotlib.legend.Legend at 0x1113e4a50>

In [8]:
scipy.sparse.linalg.gmres?

Concept of preconditioning

If $\mathrm{cond}(A)$ is large, we need to use preconditioners.

Instead of solving $Ax = f$ we solve

$$ AP^{-1} x = f, $$

where $P y = z$ is easy to be solved (i.e. $P$ is diagonal, triangular, circulant ...).

Another possibility to used left preconditioner:
$$ P A x = P f.$$

Jacobi method (as preconditioner)

We can you as preconditioners some standard methods. For example, Jacobi method uses
$$P = \mathrm{diag}(A)$$ as preconditioner. Jacobi method has applications in multigrid.

Gauss-Seidel (as preconditioner)

Gauss-Seidel method. Given $A = A^{\top} > 0$ we have
$$A = L + D + L^{\top},$$ where $D$ is the diagonal of $A$, $L$ is lower-triangular part, and Gauss-Seidel is $(L + D)^{-1}$.
Also, it can be shown that $\rho(I + (L+D)^{-1} A) < 1.$

Successive overrelaxation (as preconditioner)

$$(D + w L) x = w b - (w U + (w-1) D) x,$$$$P = (D+wL)^{-1} (w U + (w-1) D).$$

Optimal selection of $w$ is not trivial.

Incomplete LU (as preconditioner)

Finally, the incomplete LU is typically the method of choice if you do not anything about your sparse matrix.

You start your Gaussian elimination and either:

  1. Throw away small elements (ILUT)
  2. Throw away elements that are not in the original sparse matrix ILU($0$)
  3. Throw away elements which are in the neighbourhood on the graph ILU(k)

ILU for Laplace2D: short demo


In [17]:
import scipy
n = 250
ex = np.ones(n);
lp1 = -sp.sparse.spdiags(np.vstack((ex,  -2*ex, ex)), [-1, 0, 1], n, n, 'csr'); 
ee = scipy.sparse.eye(n, n)
res_all = []
res_all_bicg = []
lp2 = scipy.sparse.kron(lp1, ee) + scipy.sparse.kron(ee, lp1)
rhs = np.ones(n * n)

def my_print(r):
    res_all.append(r)

def my_print2(x): #For BiCGStab they have another callback, please rewrite
    res_all_bicg.append(np.linalg.norm(lp2.dot(x) - rhs))

    
M = scipy.sparse.linalg.spilu(lp2, drop_tol=1e-3)
#Create prec
lo = scipy.sparse.linalg.LinearOperator(lp2.shape, lambda x: M.solve(x))
sol = scipy.sparse.linalg.gmres(lp2, rhs, callback=my_print, M=lo, restart=50)
plt.semilogy(res_all, marker='x',label='GMRES')
sol2 = scipy.sparse.linalg.bicgstab(lp2, rhs, callback=my_print2, M=lo)
plt.xlabel('Iteration number')
plt.ylabel('Residual')
plt.semilogy(res_all_bicg, label='BiCGStab', marker='o')
plt.legend(loc='best')


Out[17]:
<matplotlib.legend.Legend at 0x1125cbfd0>

Some other possibilities

  • Algebraic multigrid
  • Sparse approximate inverse
  • Domain decomposition
  • For Toeplitz: Circulant!

Take home message

  • Krylov subspaces are methods of choice
  • CG + BiCGStab + GMRES are the methods to be used
  • Incomplete ILU is good for sparse matrices (software) and work typically "not bad"

Next lecture

  • Matrix functions: what is that?
  • Matrix exponential: the "queen" of matrix functions
  • Methods to compute matrix functions (dense and sparse)
  • (Some) applications
Questions?

In [123]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[123]: