Jupyter notebooks

This is a Jupyter notebook using Python. You can install Jupyter locally to edit and interact with this notebook.

Algebraic solvers

Whe solving elliptic boundary value problems as well as when using implicit methods for transient PDE, we need to solve algebraic systems of equations. We will write such systems as $$ F(u) = 0 $$ where $u$ is a vector of state variables and $F(u)$ is a vector of residuals of the same length. We will primarily be interested in defect correction methods of the form \begin{gather} A \delta u = - F(u) \\ u \gets u + \gamma \delta u \end{gather} where $A$ is a matrix and $\gamma$ is a scalar parameter often found using a line search.

  • If $A = I$, this is a Richardson iteration, which is related to gradient descent. Such methods are usually quite slow unless $F(u)$ is especially "nice".
  • If $A = \partial F/\partial u$, this is a Newton method and $\gamma=1$ can often be used.

Newton-Raphson methods for systems

The Jacobian of $F$ is $$ J(u) = \frac{\partial F}{\partial u}(u) = \begin{bmatrix} \frac{\partial F_0}{\partial u_0} & \frac{\partial F_0}{\partial u_1} & \dotsb \\ \frac{\partial F_1}{\partial u_0} & \frac{\partial F_1}{\partial u_1} & \\ \vdots & & \ddots \end{bmatrix}(u) . $$ The method can be derived by taking the Taylor expansion of $F(u)$ at $u$, $$ F(u + \delta u) = F(u) + \frac{\partial F}{\partial u}(u) (\delta u) + \frac{\partial^2 F}{\partial u^2}(u) (\delta u \otimes \delta u) / 2 + \dotsb $$ Note that each higher term is a higher rank tensor, thus computationally unweildy. If we truncate the series with the linear term and set equal to zero, we have a linear equation for $\delta u$ $$ \frac{\partial F}{\partial u}(u) \delta u = - F(u) $$ which will hopefully make $F(u + \partial u) \approx 0$. This is Newton's method.

  • Each iteration requires evaluating $F(u)$ -- almost any method will have this property.
  • Each iteration requires evaluating the Jacobian matrix $J(u)$ -- this either requires custom code, algorithmic differentiation, or a finite difference approximation (we'll revisit this later).
  • Each iteration requires solving a linear system with the matrix $J(u)$. This may be expensive.

In [28]:
%matplotlib inline
import numpy
from matplotlib import pyplot
pyplot.style.use('ggplot')

def fsolve_newton(F, J, u0, rtol=1e-10, maxit=50, verbose=False):
    u = u0.copy()
    Fu = F(u)
    norm0 = numpy.linalg.norm(Fu)
    enorm_last = numpy.linalg.norm(u - numpy.array([1,1]))
    for i in range(maxit):
        du = -numpy.linalg.solve(J(u), Fu)
        u += du
        Fu = F(u)
        norm = numpy.linalg.norm(Fu)
        if verbose:
            enorm = numpy.linalg.norm(u - numpy.array([1,1]))
            print('Newton {:d} anorm {:6.2e} rnorm {:6.2e} eratio {:6.2f}'.
                  format(i+1, norm, norm/norm0, enorm/enorm_last**2))
            enorm_last = enorm
        if norm < rtol * norm0:
            break
    return u, i

def rostest(a,b):
    def F(u):
        x = u[0]; y = u[1]
        return numpy.array([-2*(a-x) + 4*b*x**3 - 4*b*x*y,
                           2*b*(y-x**2)])
    def J(u):
        x = u[0]; y = u[1]
        return numpy.array([[2 + 12*b*x**2 - 4*b*y, -4*b*x],
                           [-4*b*x, 2*b]])
    return F, J

F, J = rostest(1,3)
fsolve_newton(F, J, numpy.array([0, 1.]), verbose=True)


Newton 1 anorm 2.51e+00 rnorm 3.96e-01 eratio   1.56
Newton 2 anorm 9.91e+00 rnorm 1.57e+00 eratio   0.56
Newton 3 anorm 3.83e-01 rnorm 6.05e-02 eratio   0.22
Newton 4 anorm 5.11e-01 rnorm 8.08e-02 eratio   0.25
Newton 5 anorm 5.24e-04 rnorm 8.28e-05 eratio   0.36
Newton 6 anorm 9.76e-07 rnorm 1.54e-07 eratio   0.21
Newton 7 anorm 3.61e-15 rnorm 5.72e-16 eratio   0.27
Out[28]:
(array([ 1.,  1.]), 6)
  • Can the iteration break down? How?
  • How does the method depend on the initial guess?
  • It turns out that Newton's method has locally quadratic convergence to simple roots, $$\lim_{i \to \infty} |e_{i+1}|/|e_i^2| < \infty .$$
  • "The number of correct digits doubles each iteration."
  • Now that we know how to make a good guess accurate, the effort lies in getting a good guess.

Matrix-free Jacobian via finite differencing

It can be error-prone and complicated to implement the Jacobian function J(u). In such cases, we can use the approximation

$$ J(u) v \approx \frac{F(u+\epsilon v) - F(u)}{\epsilon} $$

where $\epsilon$ is some "small" number. Now can't access individual entries of $J$, but we can apply its action to an arbitrary vector $u$.

We know that this approximation is first order accurate in $\epsilon$, $$ \left\lVert J(u) v - \frac{F(u+\epsilon v) - F(u)}{\epsilon} \right\rVert \in O(\epsilon) . $$ But if $\epsilon$ is too small, we will lose accuracy due to rounding error. If $F$ has been scaled such that its norm is of order 1, then $\epsilon = \sqrt{\epsilon_{\text{machine}}}$ is a good default choice.


In [31]:
import scipy.sparse.linalg as splinalg

def fsolve_newtonkrylov(F, u0, epsilon=1e-8, rtol=1e-10, maxit=50, verbose=False):
    u = u0.copy()
    Fu = F(u)
    norm0 = numpy.linalg.norm(Fu)
    for i in range(maxit):
        def Ju_fd(v):
            return (F(u + epsilon*v) - Fu) / epsilon
        Ju = splinalg.LinearOperator((len(Fu),len(u)), matvec=Ju_fd)
        du, info = splinalg.gmres(Ju, Fu)
        if info != 0:
            raise RuntimeError('GMRES failed to converge: {:d}'.format(info))
        u -= du
        Fu = F(u)
        norm = numpy.linalg.norm(Fu)
        if verbose:
            print('Newton {:d} anorm {:6.2e} rnorm {:6.2e}'.format(i, norm, norm/norm0))
        if norm < rtol * norm0:
            break
    return u, i

fsolve_newtonkrylov(F, numpy.array([0.,1]), rtol=1e-6, verbose=True)


Newton 0 anorm 2.51e+00 rnorm 3.96e-01
Newton 1 anorm 9.91e+00 rnorm 1.57e+00
Newton 2 anorm 3.83e-01 rnorm 6.05e-02
Newton 3 anorm 5.11e-01 rnorm 8.08e-02
Newton 4 anorm 5.24e-04 rnorm 8.28e-05
Newton 5 anorm 9.76e-07 rnorm 1.54e-07
Out[31]:
(array([ 1.        ,  0.99999992]), 5)

Sparse and iterative linear algebra

Many matrices in applications, particularly the study of physical systems and graphs/networks, have entries that are mostly equal to zero. We can more efficiently store such systems by storing only the nonzero elements. We will discuss storage and optimized implementations later. Many of the methods for sparse systems apply to solving systems with matrices $A$ that can be applied to a vector ($y \gets A x$) in significantly less than $O(n^2)$ complexity, or that are "well-conditioned" such that an iterative method converges in significantly less than $n$ iterations.

PETSc, the Portable Extensible Toolkit for Scientific computing, is an open source software package for the parallel solution of algebraic and differential-algebraic equations. This includes linear algebra, for which PETSc offers a broad variety of implementations. For general information about PETSc, I refer to this primer.

NumPy does not provide sparse matrix support so we will use SciPy in this course.

Direct solves

The complexity of this solve is potentially dominant, so we should understand its cost in terms of the problem size. The standard method for a direct solve is $LU$ (or Cholesky) factorization. Given a $2\times 2$ block matrix, the algorithm proceeds as \begin{split} \begin{bmatrix} A & B \\ C & D \end{bmatrix} &= \begin{bmatrix} L_A & \\ C U_A^{-1} & 1 \end{bmatrix} \begin{bmatrix} U_A & L_A^{-1} B \\ & S \end{bmatrix} \end{split} where $L_A U_A = A$ and $S = D - C A^{-1} B$.

For a sparse operator, the complexity depends on the ordering of degrees of freedom.

  • "natural" ordering
  • low-bandwidth ordering
  • nested dissection ordering

For a structured grid, the "natural" ordering is the ordering of the original operator.

A sparse direct solve of this system will result in fill up to the bandwidth.

These plots can be produced in PETSc using -mat_view draw and -pc_type lu -pc_factor_mat_ordering_type natural -mat_factor_view draw (e.g., with -draw_pause 2 to wait 2 seconds for the viewer).

The Reverse Cuthill-McKee (rcm) ordering applies a breadth-first search to produce a low-bandwidth ordering.

The nested dissection (nd) ordering recursively bisects the domain.

The asymptotic costs are different for these approaches.

Convergence of stationary iterative methods

Richardson iteration

The simplest iterative method is Richardson's method, which solves $A x = b$ by the iteration $$ x_{k+1} = x_k + \omega (b - A x_k) $$ where $\omega > 0$ is a damping parameter and $x_0$ is an initial guess (possibly the zero vector). If $b = A x_*$, this iteration is equivalent to $$ x_{k+1} - x_* = (x_k - x_*) - \omega A (x_k - x_*) = (I - \omega A) (x_k - x_*) .$$ It is convenient for convergence analysis to identify the "error" $e_k = x_k - x_*$, in which this becomes $$ e_{k+1} = (I - \omega A) e_k $$ or $$ e_k = (I - \omega A)^k e_0 $$ in terms of the initial error. Evidently powers of the iteration matrix $I - \omega A$ tell the whole story. Suppose that the eigendecomposition $$ X \Lambda X^{-1} = I - \omega A $$ exists. Then $$ (I - \omega A)^k = (X \Lambda X^{-1})^k = X \Lambda^k X^{-1} $$ and the convergence (or divergence) rate depends only on the largest magnitude eigenvalue. This analysis isn't great for two reasons:

  1. Not all matrices are diagonalizable.
  2. The matrix $X$ may be very ill-conditioned.

We can repair these weaknesses by using the Schur decomposition $$ Q R Q^h = I - \omega A $$ where $R$ is right-triangular and $Q$ is unitary (i.e., orthogonal if real-valued; $Q^h$ is the Hermitian conjugate of $Q$). The Schur decomposition always exists and $Q$ has a condition number of 1.

  • Where are the eigenvalues in $R$?

Evidently we must find $\omega$ to minimize the maximum eigenvalue of $I - \omega A$. We can do this if $A$ is well conditioned, but not in general.

Preconditioning

Preconditioning is the act of creating an "affordable" operation "$P^{-1}$" such that $P^{-1} A$ (or $A P^{-1}$) is is well-conditoned or otherwise has a "nice" spectrum. We then solve the system

$$ P^{-1} A x = P^{-1} b \quad \text{or}\quad A P^{-1} \underbrace{(P x)}_y = b $$

in which case the convergence rate depends on the spectrum of the iteration matrix $$ I - \omega P^{-1} A . $$

  • The preconditioner must be applied on each iteration.
  • It is not merely about finding a good initial guess.

There are two complementary techniques necessary for efficient iterative methods:

  • "accelerators" or Krylov methods, which use orthogonality to adaptively converge faster than Richardson
  • preconditioners that improve the spectrum of the preconditioned operator

Although there is ongoing research in Krylov methods and they are immensely useful, I would say preconditioning is 90% of the game for practical applications, particularly as a research area.

Krylov subspaces

All matrix iterations work with approximations in a Krylov subspace, which has the form

$$ K_n = \big[ b \big| Ab \big| A^2 b \big| \dotsm \big| A^{n-1} b \big] . $$

This matrix is horribly ill-conditioned and cannot stably be computed as written. Instead, we seek an orthogonal basis $Q_n$ that spans the same space as $K_n$. We could write this as a factorization

$$ K_n = Q_n R_n $$

where the first column $q_0 = b / \lVert b \rVert$. The $R_n$ is unnecessary and hopelessly ill-conditioned, so a slightly different procedure is used.

Arnoldi iteration

The Arnoldi iteration applies orthogonal similarity transformations to reduce $A$ to Hessenberg form, starting from a vector $q_0 = b$,

$$ A = Q H Q^h . $$

Let's multiply on the right by $Q$ and examine the first $n$ columns,

$$ A Q_n = Q_{n+1} H_n $$

where $H_n$ is an $(n+1) \times n$ Hessenberg matrix.

Conditioning

This representation is well-conditioned because $Q$ is orthogonal and

$$ \lVert H_n \rVert \le \lVert Q_{n+1}^h \rVert \lVert A \rVert \lVert Q_n \rVert \le \lVert A \rVert $$

.

For a lower bound, we have

$$ \sigma_{\min}(A)^2 \le x^h A^h A x $$

for all $x$ of norm 1. It must also be true for any $x = Q_n y$ where $\lVert y\rVert = 1$, thus

$$ \sigma_{\min}(A)^2 \le y^h Q_n^h A^h A Q_n y = y^h H_n^h Q_{n+1}^h Q_{n+1} H_n y = y^h H_n^h H_n y . $$

GMRES

GMRES (Generalized Minimum Residual) minimizes $$ \lVert A x - b \rVert $$ over the subspace $Q_n$. I.e., $x = Q_n y$ for some $y$. By the recurrence above, this is equivalent to $$ \lVert Q_{n+1} H_n y - b \lVert $$ which can be solved by minimizing $$ \lVert H_n y - Q_{n+1}^h b \rVert . $$ Since $q_0 = b/\lVert b \lVert$, the least squares problem is to minimize $$ \Big\lVert H_n y - \lVert b \rVert e_0 \Big\rVert . $$ The solution of this least squares problem is achieved by incrementally updating a $QR$ factorization of $H_n$.

Notes

  • GMRES minimizes the 2-norm of the residual $\lVert r_n \rVert$ which is equivalent to the $A^T A$ norm of the error $\lVert x_n - x_* \rVert_{A^T A}$.
  • The solution $x_n$ constructed by GMRES at iteration $n$ is not explicitly available. If a solution is needed, it must be constructed by solving the $(n+1)\times n$ least squares problem and forming the solution as a linear combination of the $n$ vectors $Q_n$. The leading cost is $2mn$ where $m \gg n$.
  • The residual vector $r_n = A x_n - b$ is not explicitly available in GMRES. To compute it, first build the solution $x_n$ as above.
  • GMRES needs to store the full $Q_n$, which is unaffordable for large $n$ (many iterations). The standard solution is to choose a "restart" $k$ and to discard $Q_n$ and start over with an initial guess $x_k$ after each $k$ iterations. This algorithm is called GMRES(k). PETSc's default solver is GMRES(30) and the restart can be controlled using the run-time option -ksp_gmres_restart.
  • Most implementations of GMRES use classical Gram-Schmidt because it is much faster in parallel (one reduction per iteration instead of $n$ reductions per iteration). The PETSc option -ksp_gmres_modifiedgramschmidt can be used when you suspect that classical Gram-Schmidt may be causing instability.
  • There is a very similar (and older) algorithm called GCR that maintains $x_n$ and $r_n$. This is useful, for example, if a convergence tolerance needs to inspect individual entries. GCR requires $2n$ vectors instead of $n$ vectors, and can tolerate a nonlinear preconditioner. FGMRES is a newer algorithm with similar properties to GCR.

Lanczos iteration: the symmetric case

If $A$ is symmetric, then $H = Q^T A Q$ is also symmetric. Since $H$ is Hessenberg, this means $H$ is tridiagonal. Instead of storing $Q_n$, it is sufficient to store only the last two columns since the iteration satisfies a 3-term recurrence. The analog of GMRES for the symmetric case is called MINRES and is also useful for solving symmetric indefinite problems.

Conjugate Gradients

Instead of minimizing the $A^T A$ norm of the error, the Conjugate Gradient method minimizes the $A$ norm of the error. For $A$ to induce a norm, it must be symmetric positive definite. Jeremy Shewchuck's guide to CG is an excellent resource.

Preconditioning

Classical methods

We have discussed the Jacobi preconditioner $$ P_{\text{Jacobi}}^{-1} = D^{-1} $$ where $D$ is the diagonal of $A$. Gauss-Seidel is $$ P_{GS}^{-1} = (L+D)^{-1} $$ where $L$ is the (strictly) lower triangular part of $A$. The upper triangular part may be used instead, or a symmetric form $$ P_{SGS}^{-1} = (L+U)^{-1} A \Big( I - (L+D)^{-1} \Big) . $$

Domain decomposition

Given a linear operator $A : V \to V$, suppose we have a collection of prolongation operators $P_i : V_i \to V$. The columns of $P_i$ are "basis functions" for the subspace $V_i$. The Galerkin operator $A_i = P_i^T A P_i$ is the action of the original operator $A$ in the subspace.

Define the subspace projection

$$ S_i = P_i A_i^{-1} P_i^T A . $$
  • $S_i$ is a projection: $S_i^2 = S_i$
  • If $A$ is SPD, $S_i$ is SPD with respect to the $A$ inner product $x^T A y$
  • $I - S_i$ is $A$-orthogonal to the range of $P_i$

These projections may be applied additively

$$ I - \sum_{i=0}^n S_i, $$

multiplicatively

$$ \prod_{i=0}^n (I - S_i), $$

or in some hybrid manner, such as

$$ (I - S_0) (I - \sum_{i=1}^n S_i) . $$

In each case above, the action is expressed in terms of the error iteration operator.

Examples

  • Jacobi corresponds to the additive preconditioner with $P_i$ as the $i$th column of the identity
  • Gauss-Seidel is the multiplicate preconditioner with $P_i$ as the $i$th column of the identity
  • Block Jacobi corresponds to labeling "subdomains" and $P_i$ as the columns of the identity corresponding to non-overlapping subdomains
  • Overlapping Schwarz corresponds to overlapping subdomains
  • $P_i$ are eigenvectors of $A$
  • A domain is partitioned into interior $V_I$ and interface $V_\Gamma$ degrees of freedom. $P_I$ is embedding of the interior degrees of freedom while $P_\Gamma$ is "harmonic extension" of the interface degrees of freedom. Consider the multiplicative combination $(I - S_\Gamma)(I - S_I)$.

Convergence theory

The formal convergence is beyond the scope of this course, but the following estimates are useful. We let $h$ be the element diameter, $H$ be the subdomain diameter, and $\delta$ be the overlap, each normalized such that the global domain diameter is 1. We express the convergence in terms of the condition number $\kappa$ for the preconditioned operator.

  • (Block) Jacobi: $\delta=0$, $\kappa \sim H^{-2} H/h = (Hh)^{-1}$
  • Overlapping Schwarz: $\kappa \sim H^{-2} H/\delta = (H \delta)^{-1}$
  • 2-level overlapping Schwarz: $\kappa \sim H/\delta$

In [ ]: