In [1]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'sky',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
})
%load_ext tikzmagic


/Users/ivan/anaconda/lib/python2.7/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
/Users/ivan/anaconda/lib/python2.7/site-packages/IPython/utils/path.py:282: UserWarning: locate_profile has moved to the IPython.paths module
  warn("locate_profile has moved to the IPython.paths module")

Low-rank and time integration

Ivan Oseledets
Skolkovo Institute of Science and Technology
Based on joint work with C. Lubich, B. Vandereycken, D. Kolesnikov

Outline of the talk

I would like to talk about three results:

  • A new method for the low-rank solution of the Lyapunov equation
  • Time integration of tensor trains (dynamical low-rank approximation, projector-splitting scheme)
  • Curvature-independent convergence on low-rank manifolds (we need to rethink our understanding of topology of low-rank manifolds!)

Lyapunov equation

Consider continious-time Lyapunov equation

$$AX + XA^{\top} = -y_0 y^{\top}_0$$

.

We want to compute low-rank approximation to the solution:

$$X \approx U Z U^{\top}, \quad U \in \mathbb{R}^{n \times r}, \quad U^{\top} U = I.$$

Lyapunov equation as a minimization problem

$ \def\vec{\mathop{\mathrm{vec}}\nolimits} \def\trace{\mathop{\mathrm{tr}}\nolimits} \def\rk{\mathop{\mathrm{rank}}\nolimits} \def\Span{\mathop{\mathrm{span}}\nolimits} $ A typical way to define ``best'' rank-$r$ approximation is to formulate as a minimization problem

$$R(X) \rightarrow \min,$$

where typically

$$ T_1(X) = \Vert AX + XA^{\top} - y_0 y^{\top}_0 \Vert_F. $$

For $A = A^{\top} > 0$ we can use the energy functional

$$ T_2(X) = \trace X A X + \trace X y_0 y^{\top}_0. $$

For $A$ being a discrete differential operators, these can be ``ill-conditioned'' functionals!

Our approach

In Kolesnikov, Oseledets, arxiv 1410.3335

Motivated by the seminal paper by Y. Saad, we propose to use the following functional:

$$F(U) = \int_{0}^{\infty} \Vert y - \widehat{y} \Vert^2 dt,$$

where $y = e^{At} y_0, \widehat{y} = U e^{Bt} U^{\top} y_0, B = U^{\top} A U$

Connection to the Lyapunov equation

Well-known in model order reduction (however for this particular case we still miss the exact reference). Consider an ODE $$\frac{dy}{dt} = Ay, \quad y(0) = y_0.$$ And look for the optimal subspace $$y(t) \approx \widetilde{y} = U c(t)$$. The error measure: $$F_0(U) = \int^{\infty}_0 \Vert y - \widetilde{y} \Vert^2 dt,$$ best $U$ can be obtained from $r$ leading eigenvectors of $X$ $$AX + XA^{\top} = -y_0 y^{\top}_0, \quad F_0(U) = \Vert X - X_r \Vert.$$

Our functional

Given $U$, the optimal $c(t) = U^{\top} y(t)$ is not computable.

We replace it with the Galerkin projection:

$$\frac{dc}{dt} = B c, \quad c(0) = c_0, \quad B = U^{\top} A U, \quad c_0 = U^{\top} y_0.$$

And introduce the final functional

$$F(U) = \int_{0}^{\infty} \Vert y - \widehat{y} \Vert^2 dt,$$

where $y = e^{At} y_0, \widehat{y} = U e^{Bt} U^{\top} y_0, B = U^{\top} A U$

It can be computed!

This functional can be computed

$$ F(U) = \mathrm{tr}(X) - 2 \mathrm{tr}(U^{\top}(P - UZ)) - \mathrm{tr}(Z)$$

where $$B Z + Z B^{\top} = c_0 c^{\top}_0, \quad A P + P B^{\top} = y_0 c^{\top}_0, \quad c_0 = U^{\top} y_0.$$

In principle, we can use optimization over Stiefel manifold now, but we do not know the rank.

Solution accuracy

If $F(U)$ is small, we get the approximation to the solution ($P$ is the solution of the Sylvester equation).

$$\Vert X - UP^{\top} - PU^{\top} - UZU^{\top} \Vert_F \leq F(U).$$

``Doubling'' method

To adapt the rank, we can ``fix-point'' iteration:

Solve for $$A P + P B^{\top} = -y_0 c^{\top}_0$$

Add $P$ to the basis:

$$U := \mathrm{orth}[U, P]$$

(Similar to IRKA, Sorensen...);

A rational Krylov method in disguise, Too many linear solvers ($r$ at step)

Useful residual theorems

$$T_1(U) = \Vert A UZU^{\top} + UZU^{\top} A^{\top} - y_0 y^{\top}_0 \Vert$$$$T_2(U) = \Vert A UZ + UZ B^{\top} - y_0 c^{\top}_0 \Vert$$

.

Theorem
If $UU^{\top} y_0 = y_0$, then

$$T_1(U) = \sqrt{2} T_2(U) = \sqrt{2} \Vert (AU - UB) Z \Vert.$$

Based on this theorem, it is a good idea to add Krylov & Rational Krylov vector at each step.

Additing Krylov & Rat. Krylov

Theorem
If we add a rational Krylov vector and a Rational Krylov vector at each step to $U$, the residual of the Sylvester equation stays rank-$1$:

$$A P_1 + P_1 B^{\top} = w q^{\top},$$

where $P_1 = P - UZ$.

Alternating low-rank (ALR) method

Assume rank-$1$ approximation to $P_1$, with $P_1 \approx v q^{\top}$, $q = \frac{z}{\Vert z \Vert}$.

It leads to the following equation for $v$:

$$(A + (q^{\top} B^{\top} q) I) v = w.$$

Note the remarkably simple formula for the new shift!

We only need to solve the small Lyapunov equation ($\mathcal{O}(r)^3$ cost) $$BZ + ZB^{\top} = c_0 c^{\top}_0$$ and take the last column of $Z$ (and normalize it).

Comparison with other methods

  • RKSM/ERKSM method (Druskin, Simoncini), adapt the shifts globally
  • KPIK / extended Krylov (Knizhnermann, Druskin, Simoncini), take $A$, $A^{-1}$ spaces

The shifts produced by ALR lies in the range of $B = U^{\top} A U$, thus are much less spread than those for RKSM;
we can reuse the AMG hierarchy for the shifted system.

Numerical example

$$ Lu = u_{xx} + u_{yy} + u_{zz} - 10x u_{x} - 1000y u_{y} - u_{z}$$

with Dirichlet boundary conditions on the unit cube using 7-point stencil operator.

$$ \begin{array}{lllllr} \textrm{grid} &\textrm{method} &\textrm{precomp.} &\textrm{1 solver time} &\textrm{final time} &\textrm{it-s/rank}\\ &ALR & -- &0.6424 &3.2881 &6/13 \\ &ALR(AMG) &0.2231 &0.0466 &\bf{0.5020} &6/13 \\ &KPIK(LU) &0.6687 &\bf{0.0086} &0.8351 &9/19 \\ &KPIK(AMG) &0.2231 &0.0436 &0.6787 &9/19 \\ 20\times 20 \times 20 &KPIK & -- &0.8533 &7.0838 &9/19 \\ &RKSM &\bf{0.0265} &0.6610 &5.5919 &9/10 \\ &RKSM(AMG) &0.2496 &0.0491 &0.9061 &9/10 \\ &ERKSM &\bf{0.0265} &0.7303 &4.8876 &7/15 \\ &ERKSM(AMG) &0.2496 &0.0541 &0.9108 &6/13 \\ \hline &ALR & -- &9.2135 &53.2061 &7/15 \\ &ALR(AMG) &0.9827 &0.1987 &\bf{2.3924} &7/15 \\ &KPIK(LU) &10.1743 &\bf{0.0497} &11.0392 &9/19 \\ &KPIK(AMG) &0.9827 &0.2055 &3.0282 &9/19 \\ 30\times 30\times 30 &KPIK & -- &12.7750 &105.1641 &9/19 \\ &RKSM &\bf{0.1215} &9.3640 &85.8903 &10/11 \\ &RKSM(AMG) &1.1042 &0.1792 &3.2822 &10/11 \\ &ERKSM &\bf{0.1215} &9.5372 &58.2637 &7/15 \\ &ERKSM(AMG) &1.1042 &0.1609 &3.2703 &8/17 \\ \end{array} $$

Shifts distribution

The shifts are much less spread!

Summary and part-2

  • We have a new functional for the Lyapunov equation, and we have a very simple shift-generating procedure (parameter-free).
  • Now we go from low-rank in "space-time" to low-rank in the space variables (dynamical low-rank approximation)

Dynamical low-rank approximation

Dynamical low-rank rank approximation is a low-rank approximation of time-varying matrix:

$$A(t) \approx Y(t) = U(t) S(t) V^{\top}(t), $$

where $U(t) \in \mathbb{R}^{n \times r}, V(t) \in \mathbb{R}^{m \times r}, S(t) \in \mathbb{R}^{r \times r}$

Simple DLR

You can use point-wise SVD:

$$\min_{\mathrm{rank}(X(t)) = r}\Vert A(t) - X(t) \Vert $$

but this has some drawbacks:

  • The dependence is non-smooth in time
  • The energy is not conserved
  • No history $A(t)$ is taken into account!

Instead, use Dirac-Frenkel principle

Dirac-Frenkel principle can be used instead:

$$ \langle \frac{dA}{dt} - \frac{dX}{dt}, Z \rangle_F = 0, $$

for all $Z$ in the tangent space of the manifold $\mathcal{M}_r$ of all matrices of rank $r$,
and tangent space is defined as

$$ Z = \delta U S V^{\top} + U \delta S V^{\top} + U S \delta V^{\top} $$

Equations of motion in 2D case

(Lubich, Koch, Meyer,...)

$$\begin{split} &\frac{dU}{dt} S = \frac{dA}{dt} V,\\ &\frac{dV}{dt} S^{\top} = \frac{dA^{\top}}{dt} U, \\ &\frac{dS}{dt} = U^{\top} \frac{dA}{dt} V. \end{split}$$

Suppose, you start from rank-$1$ and evolve into rank-$10$ case: the matrix $S$ will be singular!

Equations of motion, 2D case (2)

The equations of motion can be rewritten for the full matrix

$$ \frac{dX}{dt} = P_X(\frac{dA}{dt}), $$

where $P_X$ is the projector onto the tangent space

Projector-splitting

Theorem: $P_X$ can be represented as

$$P_X(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top}) = UU^{\top} Z + Z VV^{\top} - UU^{\top} Z VV^{\top},$$$$P_X(Z) = P_1 + P_2 - P_3$$

Integration of the substeps

Theorem (O., Lubich, 2014): The equations with $P_1$, $P_2$, $P_3$ can be efficiently integrated

  • Step "L": $$\frac{dX}{dt} = UU^{\top}\frac{dA}{dt} \rightarrow \frac{d(VS^{\top})}{dt} = \frac{dA^{\top}}{dt} U, \frac{dU}{dt}=0,$$

    Thus $L_1 = L_0 + (A_1 - A_0)^{\top} U_0, U_1 = U_0$.

  • Step "K": $$\frac{dX}{dt} = \frac{dA}{dt} VV^{\top} \rightarrow \frac{d(US)}{dt} = \frac{dA}{dt} V, \frac{dV}{dt}=0,$$

    Thus $K_1 = K_0 + (A_1 - A_0) V_0, V_1 = V_0$

  • Step "S": $$\frac{dX}{dt} = UU^{\top}\frac{dA}{dt} VV^{\top} \rightarrow \frac{d(S)}{dt} = U\frac{dA}{dt} V, \frac{dV}{dt}=0, \frac{dU}{dt} = 0,$$

    Thus $S_1 = S_0 \mathbf{-} U^{\top}_0(A_1 - A_0) V_0, V_1 = V_0, U_1 = U_0$
    Here $A_0 = A(t), A_1 = A(t + \tau)$

Splitting integrator

The final splitting steps are

  • Step "L": $$\begin{split}L_0 = V_0 S^{\top}_0 \ L_1 = L_0 + (A_1 - A_0)^{\top} U_0 \
     [V_1, S^{\top}_1] = QR(L_1)\end{split}$$
  • Step "K": $$\begin{split} K_0 = U_0 S^{\top}_0 \ K_1 = K_0 + (A_1 - A_0) V_0 \

     [U_1, S_1] = QR(K_1)\end{split}$$
  • Step "L": $$S_1 = S_0 - U^{\top}_0 (A_1 - A_0) V_0$$ (note the minus!) The order of the steps matters

A note on minus

(O., Lubich, Tubingen 2012)

Exactness result

If $A_0 = U_0 S_0 V^{\top}_0$ and $A_1$ is of rank $r$, the projector-splitting scheme in the KSL order is exact.

Generalizing to dD

  • How to generalize to many dimensions?

  • Then we have to replace "low-rank approximation" by something.

  • High-dimensional problems are actually the main application of DLR

Canonical format: bad idea

A simplest (but naive) way to generalize low-rank to many dimensions is to use sum-of-products defintion of the matrix rank:

$$ A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha) $$

This is called canonical format and canonical rank

The canonical format can be unstable and in general can not be efficiently computed!

SVD-based formats: good idea

In 2009 active development of stable tensor formats began.

Two "similar" formats we proposed

  • Tree Tucker (O., Tyrtyshnikov) -> Tensor Train (O.),
    later found as Matrix Product States (MPS) in physics
    (White, Schwollwock, Vidal, Verstraete)
  • Hierarchical Tucker (Hackbusch, Grasedyck), later found as ML-MCTDH (Meyer, Manthe) in chemistry

Tensor-train format

A tensor is said to be in the TT-format, if

$$X(i_1, \ldots, i_d) = C_1(i_1) \ldots C_d(i_d),$$

where $C_k(i_k)$ is an $r_{k-1} \times r_k$ matrix for each fixed $i_k$ and $r_0 = r_d = 1$.

Some notations for the TT-format

$\def\matX {{\boldsymbol{\mathrm{X}}}} \def\matY {{\boldsymbol{\mathrm{Y}}}} \def\matZ {{\boldsymbol{\mathrm{Z}}}} \def\matC {{\boldsymbol{\mathrm{X}}}} $

Notation is hard; we use one from (Lubich, Vandreycken, O., 2014).

We can introduce left product of cores and right product of cores

$$ \matX_{\le k} (i_1,\dots,i_k,:) = \matC_1(i_1) \cdots \matC_k(i_k) $$

and the right partial product $\matX_{\ge k+1} \in \mathbb{R}^{n_{k+1} \times \cdots \times n_d \times r_k}$ as

$$ \matX_{\ge k+1} (i_{k+1},\dots,i_d,:) = \matC_{k+1}(i_{k+1} \cdots \matC_d(i_d). $$

Some notations (2)

The k-th unfolding satisfies

$$\mathrm{Ten}_k(X) = \matX_{\leq k} \matX_{\geq k+1}^{\top}$$

and there is a recursive connection

$$\matX_{\ge k} = ( \matX_{\ge k+1} \otimes \matI_{n_k} ) \matC_K^>$$

This forms the basis for the orthogonalization:

Given QR decomposition of $\matX_{\ge k}$ it is very simple
to orthogonalize $\matX_{\ge k+1}$ just by orthogonalizing the next core from left to right.

Dynamical low-rank in the TT-format

The DLR problem for the TT-format is formulated in the same way!

Projector-splitting in the dD-case

$$ \frac{Y(t)}{dt} = P_{Y(t)}(\frac{dA(t)}{dt}), \qquad Y(t_0)=Y_0\in \mathcal{M}, $$$$ P_X =\sum_{k=1}^{d-1} (P_{\le k-1} P_{\ge k+1} - P_{\le k} P_{\ge k+1}) + P_{\le d-1} P_{\ge d+1}, $$$$ \begin{split} &P_{\le k}\colon \R^{n_1 \times \cdots \times n_d} \to T_X \mathcal{M}, \ Z \mapsto \mathrm{Ten}_k ( \matP_{\le k} \matZ^{\langle k \rangle})\\ &P_{\ge i}\colon \R^{n_1 \times \cdots \times n_d} \to T_X \mathcal{M}, \ Z \mapsto \mathrm{Ten}_{k-1} ( \matZ^{\langle k-1 \rangle} \matP_{\ge k} ) . \end{split} $$

$ \matP_{\le k} = \matQ_{\le k}\matQ_{\le k}^\top, \qquad\hbox{and}\qquad \matP_{\ge k} = \matQ_{\ge k}\matQ_{\ge k}^\top, $

The scheme of the algorithm

The algorithm is organized through sweeps

Solving equations

It is very easy to use KSL for solving equations!

$$ \frac{dY}{dt} = F(Y), $$

you just replace $\frac{dA}{dt}$ by $F(Y)$.

In the case of linear $F(Y)$, the local system for the cores will be also linear,
which is a huge advantage over other methods!

Numerical experiments

Now we can go to numerical experiments

Application: Vibrational spectra

We compare the TT-KSL scheme with the MCTDH package for the benchmark problem with Henon-Heiles potential \begin{equation} \frac{d\psi}{dt} = i H \psi, \quad \psi(0) = \psi_0, \end{equation} where $H$ has the form \begin{equation} \def\Hlap{-\frac12 \Delta} \def\Hhar{\frac12 \sum_{k=1}^f q^2_k} \def\Hanh{\sum_{k=1}^{f-1}\left(q^2_k q_{k+1} - \frac13 q^3_{k+1} \right)} H = \Hlap + \underbrace{\Hhar + \overbrace{\lambda \Hanh}^{\textrm{anharmonic part}}}_{\textrm{Henon-Heiles potential}~V(q_1,\ldots,q_f)}. \end{equation}

Spectra comparison

Spectra for a 10D problem

</div>

Zoomed spectra

</div>

Computational time: 54354 sec (MCTDH) vs 4425 sec (TT-KSL)

Conclusions

  • A stable explicit scheme for DLR based on QR-decomposition with very cheap step
  • Direct applications (quantum spin systems, chemical master equation)
  • Indirect applications (projection of the update, optimization, ...)

Projector-splitting for optimization

The projector-splitting scheme has high potential for optimization, since it is discrete.

Given a process $x_{k+1} = x_k + F_k$, we create a manifold version

$$x_{k+1} = I(x_k, F_k).$$

What about the convergence?
Typical estimates use the curvature of the manifold; for low-rank matrices the curvature is given by the condition number,

I.e. can be arbitrary high. In practice, it makes no difference.

Manifold iteration is better

$$x = Qx + f, \quad \Vert Q \Vert \leq \delta, $$ Solution $x^*$ has rank $5$ with $3$ large singular values, one $10^{-3}$ of the largest, one $10^{-6}$ of the largest. Blue : original gradient method, Green : Manifold-projected gradient, Cyan : The normal component of the error


In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[2]:

In [ ]: