Hierarchical and low-rank techniques

Some processes are represented by dense matrices that have structured representations. The most common technique is to approximate blocks using a truncated SVD $U S V^T$.

Inverses of differential operators sometimes have this property.


In [12]:
%matplotlib notebook
import numpy

m = 101
A = numpy.eye(m) - .5*numpy.eye(m,k=1) - .5*numpy.eye(m,k=-1)
#A[-1,-1] = 0.5
Ainv = numpy.linalg.inv(A)

B = Ainv[:m//2,m//2+1:]
U, S, V = numpy.linalg.svd(B)

from matplotlib import pyplot
pyplot.style.use('ggplot')
pyplot.figure()
pyplot.plot(S, '.')


Out[12]:
[<matplotlib.lines.Line2D at 0x7f50e1321940>]

In [13]:
pyplot.figure()
pyplot.plot(V.T[:,0])


Out[13]:
[<matplotlib.lines.Line2D at 0x7f50e134df98>]

Particle interactions

The gravitational (or electrostatic) force between particles is proportional to $1/|r|^2$ where $r$ is the distance between them. The direction of this force is in the direction $r$ with magnitude equal to the product of the masses (or charges). Let's consider two separated groups of particles.


In [22]:
m, n = 100, 200
x = numpy.random.randn(m,3)  # first cluster of particle locations
y = numpy.random.randn(n,3)
y[:,0] += 100 # second cluster is translated away

A = numpy.zeros((3*m,n))
for i in range(m):
    r = y.T - numpy.outer(x[i,:], numpy.ones(n))
    A[3*i:3*i+3, :] = r / numpy.linalg.norm(r, axis=0)**3

U, S, V = numpy.linalg.svd(A, full_matrices=False)

pyplot.figure()
pyplot.semilogy(S, '.')


Out[22]:
[<matplotlib.lines.Line2D at 0x7f50e0ae47b8>]

Treecodes to FMM

Treecodes were introduced by Barnes and Hut in 1986 and have been a staple in cosmology simulation ever since, as exemplified by Warren's SC13 paper. Each particle "sees" close neighbors (shaded region) directly while distant neighbors are approximated by their multipole expansions on increasingly coarse "boxes". Tuning parameters include the box size, the degree of the multipole expansion, and the threshold angle $\theta$ (distant boxes must lie within this angle to use a multipole expansion -- thus it describes the coarsening rate). Treecodes have $O(n \log n)$ complexity and typically small constants for modest accuracy requirements (the constants increase as the accuracy tolerance becomes stricter).

The Fast Multipole Method (FMM) was introduced by Greengard and Rokhlin in 1987 and is an $O(n)$ algorithm that often outperforms treecodes for strict accuracy requirements. It differs in that particles do not directly interact with distant particles; instead, the (outward-looking) multipole expansions interact and are used to create (inward-looking) "local" expansions. Yokota and Barba (2012) provide a prospectus for FMM optimization on emerging architectures.

Classical FMM uses analytic descriptions of the "kernel" (the force relationship) and requires analytic methods to compute the multipole and local interactions. At (typically) some moderate overhead, these can be computed numerically using only the kernel $K(r) = 1/r$ (for example), using the "kernel-independent FMM" introduced by Anderson (1992) and refined by Ying, Biros, Zorin, and Langstron (2003). In these methods, a number of "check points" are used to map from the original particles to a set of "equivalent points" (left), to upscale those to coarser multipole expansions (middle), and to form local expansions (right).

Tradeoffs

Design tradeoffs affecting performance and accuracy of treecodes and FMMs include:

  • Number of particles per box
    • More particles - more direct force evaluation, less tree work
  • Angle $\theta$ that defines "far field"
    • Smaller angle requires interaction between more neighbors (versus deferring to parents in the tree)
  • Expansion basis
    • Cartesian Taylor series
    • Spherical harmonics
    • Plane waves
    • Equivalent charges

Yokota provides a detailed comparison of these tradeoffs.

Structured matrices

Low rank matrices (or blocks thereof) are typically represented using a reduced SVD

$$ A \approx U \Sigma V^T $$

where $U$ has only $k \ll n$ columns. We are often given an operator that we can apply, but for which we don't have direct access to the entries (or direct access would be too slow). An example is a block of the inverse of a PDE operator, e.g., applied via multigrid.

Randomized SVD

Given an orthogonal $m\times k$ matrix $Q$ that represents most of the energy of the $m \times n$ matrix $A$, in the sense that $A \approx Q Q^T A$, we can compute an approximate SVD of $A$ using the following steps:

  1. Compute $B = Q^T A$, a $k\times n$ matrix
  2. Compute the SVD $B = \hat U \Sigma V^T$
  3. Let $U = Q \hat U$

This factorization satisfies

$$ U \Sigma V^T = Q (\hat U \Sigma V^T) = Q Q^T A \approx A . $$

But how do we identify the subspace $Q$? One popular method with an elegant theory (e.g., Halko, Martinsson, Tropp (2011)) is to choose a random $m \times k$ matrix $X$ and run a power iteration on the normal equations

  1. $Y = A (A^T A)^q X$
  2. $QR = Y$ (discard $R$ and use $Q$ in the algorithm above)

Factorization preconditioners

Schur complements in PDE solvers often have low-rank structure away from the diagonal. Since factoring these dense "supernodes" is the bottleneck for sparse direct solvers, an attractive option is to approximate them using structured matrix representations. STRUMPACK is a package available in PETSc that is based on these principles.

  • -pc_type lu -pc_factor_mat_solver_package strumpack
  • -pc_type ilu -pc_factor_mat_solver_package strumpack

Rank-revealing QR

A (truncated) rank-revealing QR factorization of the matrix $A$ is $$ A P \approx Q \begin{bmatrix} R_L & R_R \end{bmatrix} $$ where $P$ is a permutation, $Q$ is an $m\times k$ orthogonal matrix, and $R_L$ is a $k\times k$ invertible triangular matrix. The rank-revealing QR (RRQR) decomposition can be computed efficiently using randomized methods.

Interpolative decomposition

The permutation in RRQR identifies the columns of $A$ that span the important part of its range. If $P_k$ is the first $k$ columns of $P$, then this subspace is $A P_k = Q R_L$. The interpolative decomposition is

$$ A \approx (AP_k) Z $$

where $Z = \begin{bmatrix} I & R_L^{-1} R_R \end{bmatrix} P^T$. The interpolative decomposition enables preservation of sparsity and other structure with many applications including Butterfly Algorithms which is a sort of generalized FFT.

The Elemental Library

Elemental is an excellent parallel library for a broad variety of dense and structured linear algebra.

Data assimilation

Consider the noisy linear model

$$ y = A x + \epsilon $$

where $y$ is an observable quantity with measurement error $\epsilon$ an unbiased random variable described by the SPD covariance matrix $C_m = \mathbb{E}[\epsilon \epsilon^T]$ > 0. We wish to estimate the parameters $x$ consistent with observations $\hat y$. With no prior information about $x$, this is equivalent to minimizing the misfit, $$ \hat x = \arg \min_x \frac 1 2 \lVert A x - \hat y \rVert^2_{C_m^{-1}} . $$ In a Bayesian framework, we have $$ \underbrace{\pi(x | \hat y)}_{\text{posterior}} \propto \underbrace{\pi(x)}_{\text{prior}} \pi(\hat y | x) $$ where $$ \pi(\hat y |x) \propto \exp\Big( - \frac 1 2 \lVert A x - \hat y \rVert^2_{C_m^{-1}} \Big) . $$ We may also assume a Gaussian model for the prior $$ \pi(x) \propto \exp\Big( - \frac 1 2 \lVert x - x_p \rVert_{C_p^{-1}}^2 \Big) . $$ Maximizing the posterior $\pi(x | \hat y)$ is equivalent to minimizing the negative log posterior $- \log \pi(x | \hat y)$, which is a quadratic minimization problem. $$ \hat x = \arg \min_x \frac 1 2 \lVert A x - \hat y \rVert^2_{C_m^{-1}} + \frac 1 2 \lVert x - x_p \rVert_{C_p^{-1}}^2 . $$

The Hessian of this minimization problem is $$ \underbrace{A^T C_m^{-1} A}_{\text{misfit}} + \underbrace{C_p^{-1}}_{\text{prior}} $$ which is an SPD matrix since $C_m$ and $C_p$ are SPD. It is common for the misfit Hessian to be low rank (the data is informative about a subspace) and this is often best exposed when preconditioned by the prior, i.e., $$ C_p^{-T/2} \Big( \underbrace{C_p^{T/2} A^T C_m^{-1} A C_p^{1/2}}_{\text{low rank}} + I \Big) C_p^{-1/2} .$$ This structure is key to understanding information content in inference problems where the parameter space is high dimensional.

Optimization

Unconstrained optimization problems are written $$ \min_x f(x) $$ where $f : \mathbb R^m \to \mathbb R$ is a scalar-valued function of the parameter vector $x$. For a differentiable function $f$, a necessary condition for convergence is the "first order optimality condition" $$ g(x) = \nabla f(x) = 0 . $$

Is solving the first order optimality conditions equivalent to minimization?

Solutions of the first order optimality condition can be a local

  • maximum
  • minimum
  • saddle point. To identify a minimum, the second order condition that the Hessian $\nabla^2 f(x)$ be positive definite.

Newton methods

Given a nonlinear equation $$ g(x) = 0, $$ Newton's method solves $$ J(x_n) \delta x = - g(x_n), \quad x_{n+1} = x_n + \delta x $$ where $J(x) = \nabla g(x)$.

Newton methods are often augmented with a line search that allows for a shorter step to be taken. For nonlinear problems, this is often a requirement to find the smallest non-negative $k$ such that $$ \lVert g(\underbrace{x_n + 2^{-k} \delta x}_{x_{n+1}}) \rVert < \lVert g(x_n) \rVert .$$

  • Why doesn't this work for optimization?

For optimization, we line search in $f$ instead of in $\lVert g \rVert$ and satisfy the Wolfe conditions. This gives us stronger guarantees and helps us avoid local minima.

Example: Quadratic optimization

Suppose $f(x) = x^T A x - x^T b$ where $A$ is SPD. What is the gradient and Hessian?


In [ ]: