In [4]:
from IPython.html.services.config import ConfigManager
from IPython.paths import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'sky',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
              'scroll': True,
})


Out[4]:
{u'scroll': True,
 u'start_slideshow_at': 'selected',
 u'theme': 'sky',
 u'transition': 'zoom'}

Tensor trains and tensor networks

Ivan Oseledets, Skolkovo Institute of Science and Technology
oseledets.github.io, i.oseledets@skoltech.ru

In [ ]:
%reload_ext tikzmagic

CP-decomposition

A Canonical Polyadic (CP) decomposition is the decomposition of a general tensor in the form

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) U_2(i_2, \alpha) \ldots U_d(i_d, \alpha).$$

Many work has been done (nice review by B. Bader & T. Kolda with all proper references).

My original motivation

I have been working on low-rank approximation of matrices, where singular value decomposition (SVD) gives the best rank-$r$ approximation, which can be written as

$$A_{ij} \approx \sum_{\alpha=1}^r U_{i \alpha} V_{j \alpha}.$$

And best rank-$r$ approximation always exists, efficient algorithms are in LAPACK, efficient sampling via skeleton decomposition is possible from $\mathcal{O}(dnr)$ entries.

My motivation

Go to 3 indices:

$$A_{ijk} \approx \sum_{\alpha=1}^r U_{i \alpha} V_{j \alpha} W_{k \alpha}.$$

Everything is more complicated (there are exceptions, like if $r$ is equal to one of the mode sizes, you can do it via generalized eigenvalue/simultaneous matrix factorizations...)

Example 1:

The first example of a tensor that is bad for CP is the following one: $$A(i_1, \ldots, i_d) = i_1 + \ldots + i_d.$$

Define $$P(i_1, \ldots, i_d, t) = (1 + i_1 t)(1 + i_2 t) \ldots (1 + i_d t), $$ then

$$A = P'(0) = \frac{P(h) - P(0)}{h} + \mathcal{O}(h).$$

CP-rank is d, can be approximated to any accuracy with rank 2.

Example 2:

The CP-rank can depend on the field, the following example shows it:

$$A(i_1, \ldots, i_d) = \sin(h (i_1 + \ldots + i_d)).$$

Simple trigonometry: $$r = 2^d$$

Advanced trigonometry: $$r = d$$

Complex arithmetics: $$r = 2.$$

Tensor of matrix-by-matrix multiplication

There is an example of $9 \times 9 \times 9$ tensor, for which the value of the canonical rank is not known!

Tucker decomposition

Tucker decomposition is an old-time alternative to CPD, but it is amenable to the curse of dimensionality:

$$A(i_1, \ldots, i_d) \approx \sum_{\alpha_1, \ldots, \alpha_d} G(\alpha_1, \ldots, \alpha_d) U_1(i_1, \alpha) \ldots U_d(i_d, \alpha).$$

The number of parameters is $dnr + r^d$ (very ok for compression purposes and 3D-..., but not for 100D).

Good news: Quasioptimal approximation can be computed by SVD, best approximation always exist (closed set).

Inbetween

A simple question: is there anything between CPD and Tucker that is free from inheritant curse of dimensionality, but is a stable tensor format?

The answer is definitely yes, and the simplest example of such new tensor factorization is the linear tensor network or tensor train format.

TT-format

The Tensor is said to be in the TT-format if

$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ for a fixed $i_k$ is a matrix of size $r_{k-1} \times r_k$.

Index form: $$A(i_1, \ldots, i_d) = \sum_{\alpha_1, \ldots, \alpha_{d-1}} G_1(i_1, \alpha_1) G_2(\alpha_1,i_2, \alpha_2) \ldots G_d(\alpha_d, i_d).$$

Other stable manifolds

The concept of tensor networks can be extended to other types of graphs, the H-Tucker approach being the most general;

the TT-format is the simplest algebraically and typically there is no increase in complexity.

In general, the tensor network should not have loops

Loopy tensor networks are hard

Harder than CPD: even computing a single element is exponential (the storage is not, every node of the network in a 5D tensor).

(and in quantum information theory this serves as a justification for the quantum computer.

TT as an intersection of matrix manifolds

There exists a TT-decomposition with

$$r_k = \mathrm{rank}(A_k),$$

where $A_k$ is the k-th unfolding of the tensor, i.e.

$$A_{k} = A(i_1 \ldots i_k; i_{k+1} \ldots i_d).$$

TT-SVD

The quasioptimal approximation can be computed via sequential SVD of the auxiliary matrices:

$$A(i_1; i_2 \ldots i_d) \approx \sum_{\alpha=1}^{r_1} G_1(i_1, \alpha_1) G_2(\alpha_1 i_2; i_3 \ldots i_d),$$

then we separate $\alpha_1 i_2$ by another SVD and so on.

TT-SVD can be coded in $\sim$ 50 lines of MATLAB / Python code.

Other important stuff

  • Always $r_k \leq r$, where $r$ is the canonical rank.
  • Basic linear algebra in the TT-format
  • Fast rounding (i.e. TT with suboptimal ranks, costs $\mathcal{O}(dnr^3)$ to compute the TT-SVD exactly)
  • TT-cross approximation
  • Open-source TT-Toolbox (both for MATLAB and Python)
  • Optimization over TT-manifolds
  • Dynamical approximation

TT-cross approximation

Given the tensor $\mathcal{A}$ as a black-box subroutine that allows us to compute any prescribed element.

Suppose we have apriori knowledge that $r_k \leq r$.

Then the tensor can be exactly recovered from $\mathcal{O}(dnr^2)$ samples,

and we have a fairly robust deterministic algorithm (TT-cross) to compute those entries.

Reminder: the matrix case

It is a generalization of the fabolous skeleton decomposition of a rank-$r$ matrix:

any rank-$r$ matrix can be recovered from $r$ columns and $r$ rows as

$$A = C \widehat{A}^{-1} R.$$

Maxvol estimate

Goreinov, Tyrtyshnikov:

If $\widehat{A}$ is of maximal volume (absolute value of the determinant), then

$$\Vert A - A_{skel} \Vert_C \leq (r + 1) \sigma_{r+1}.$$

The maximum volume (together with the maxvol algorithm) gives a simple algorithm for the search of good/rows and columns.

TT-cross & skeleton

The TT-cross result is a direct generalization of the matrix skeleton decomposition (and also comes with several adaptive algorithms).

Experience: The randomization is needed to improve robustness of the algorithms, but a good algorithm can not rely only on random sampling of entries, it should be adaptive.

Software packages

Implement basic operations, but also quite advanced algorithms like

solving optimization problems with TT-rank constraint:

$$F(X) \rightarrow \min$$

$X$ has low TT-ranks.

This includes high-dimensional linear systems, eigenvalue problems, dynamical low-rank approximation.

Optimization over TT-manifolds

An optimization over TT-manifolds is a separate story.

The structure is polilynear, thus ALS is the most simple choice, but its convergence is tremendously better

I.e. it is typical that the methods converges in $5-20$ iterations to very high accuracy.

This is due to the right parametrization and stable representation (i.e. ALS in 2D converges very fast).

Application areas

  • TensorNet layer (Novikov et. al, NIPS 2015) - tremendous compression of a FC layer of a DNN
  • Inference in MRF (Novikov et. al, ICML 2014) - very accurate computations
  • Quantum chemistry
  • Multidimensional integrals
  • (multiscale) PDEs on very fine grids
  • Uncertainty quantification
  • $\ldots$

Hiearchical uncertainty quantification

Given a model $$y = f(x)$$ where $x$ are now random variables, represent in a tensor-structured way the distribution of the output variables $$\rho(y_1, \ldots, y_k).$$

That requires algebraic manipulations over multivariate functions, and this all can be implemented though the TT-formalism (including the non-linear operations).

Quantization idea

The software is robust, so let us find high dimensions where there are no high-dimensions.

Consider a 1D signal $$f(x) = \sin x$$ sampled on a uniform grid with $2^d$ points.

Reshape it into $2 \times 2 \times \ldots \times 2$ $d$-dimensional tensor.

That gives you a TT-tensor of rank 2, thus giving $\mathcal{O}(2dr^2) = \mathcal{O}(\log N)$ parameters.

PDEs with astronomically large number of "virtual" unknowns

This gives a concept: take the simplest representation but with a huge virtual grid with $2^d$ points.

Kazeev & Schwab (2015) have proved that for a wide class of 2D elliptic PDEs the solution has low-rank QTT structure.

We have recently implemented a solver that allows to "solve" problems with $2^{60}$ unknowns on a laptop in seconds.

Publications and software


In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: