Tensor trains: theory, algorithms, applications

Ivan Oseledets, Skolkovo Institute of Science and Technology
oseledets.github.io, i.oseledets@skoltech.ru

This talk

This talk is about tensors.

A tensor is just a multidimensional array

$$A(i_1, \ldots, i_d), \quad 1 \leq i_k \leq n_k.$$

Suppose that $n_k \approx n$, the the total amount of memory to be stored is $n^d$.

The core idea

Numerical tensor methods have many applications in machine learning, PDEs, data mining.

In one sentence, it is a generalization of standard linear algebra (singular value decomposition)

to multidimensional case.

Why do we need may dimensions?

Do we really need so many dimensions?

Consider image / video / signal processing.

A color image is a 3D array, a video is a 4D array, but we need more.

Consider a $512 \times 512$ grayscale image. Can we make a tensor out of it?

Introduce virtual dimensions!

1D case

A signal, $f(x)$, sampled on a uniform grid with $2^d$ points, reshape it into a $2 \times \ldots \times 2$ $d$-dimensional tensor

$$V(i_1, \ldots, i_d) = v(i_1 + 2 i_2+ \ldots+2^{d-1} i_d)$$

For example, an exponential signal: $f(x) = \exp(\lambda x)$ breaks into the product of vectors of length $2$!

2D case

First we tensorize indices, then we permute them to bring $i_k j_k$ together.

$$A(i_1 i_2 \ldots i_d; j_1 \ldots j_d) \rightarrow B(i_1 j_1; i_2 j_2; \ldots i_d j_d), \mbox{Hilbert curve ordering}$$

Why do we do it, it another question: our goal is to introduce a reasonable structure for image representation via tensors.

Signals -> Tensors

Now we have a way to make high-dimensional tensors. But how do we compress them? How we use it in image processing tasks?

We need compact representations, i.e. tensor decompositions

Tensor decompositions

The idea of tensor decomposition is to represent a tensor (multidimensional array) in terms of simpler arrays.

This idea comes from matrix factorizations (represent a matrix as a product of simpler ones).

What is the simplest tensor decomposition

The key idea is the idea of separation of variables:

$$A(i_1, \ldots, i_d) = U_1(i_1) \ldots U_d(i_d)$$

.

In probability theory, it corresponds to the idea of independence of variables.

This class is too narrow to represent useful functions (or probability distributions).

Sum-of-products

The sum-of-products representation is much more interesting:

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

also named canonical format, or multivariate factor model.

Two questions

  • What this decomposition is for $d=2$.
  • What is good and/or bad for $d \geq 3$?

Two-dimensional case

For two-dimensional case, it boils down to the approximation $$A \approx UV^{\top}, $$ where $U$ is $n \times r$ and $V$ is $m \times r$.

It is low-rank approximation of matrices, and best rank-$r$ approximation can be computed

using singular value decomposition

Multidimensional case

For $d > 3$ The canonical decomposition of the form

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

then it is very difficult to compute numerically; there is even an example of $9 \times 9 \times 9$

tensor for which the tensor rank is not known!

But this is not the only tensor decomposition we can use.

Literature on tensor formats

  • T. Kolda and B. Bader, Tensor decompositions and applications, SIREV (2009)
  • W. Hackbusch, Tensor spaces and numerical tensor calculus, 2012
  • L. Grasedyck, D. Kressner, C. Tobler, A literature survey of low-rank tensor approximation techniques, 2013

Software

Some times, it is better to study with the code and to try your own application

Basic stuff

Now we will briefly recall the two-dimensional case, which is important for the techniques for the computational of tensor decompositions.

Singular value decomposition

Given an $n \times m$ matrix $A$, its singular value decomposition (SVD) is defined as

$$A = U S V^{\top},$$

where $U^*U = V^* V = I$, and $S$ contains singular values on the diagonal.

Computation of the SVD

Standard algorithm for the computation of the SVD requires $\mathcal{O}(N^3)$ operations,
can be done for matrices $1000-5000$ on a laptop.

Large-scale SVD is a topic of ongoing research.

The crucial point why it is possible is based on the skeleton decomposition.

Skeleton decomposition

Given a rank-$r$ matrix, how many elements you need to compute to recover the full matrix?

Skeleton decomposition

It is enough to sample $r$ columns and $r$ rows to exactly recover a rank-$r$ matrix:

$$A = U \widehat{A}^{-1} V,$$

<img src="cross-pic.png" \img>

Approximate rank-r

What happens if the matrix is of approximate low rank?

$A \approx R + E, \quad \mathrm{rank~} R = r, \quad ||E||_C = \varepsilon$

Maximum-volume principle

Answer is given by the maximum-volume principle: select the columns and rows such that

the submatrix $\widehat{A}$ has maximal-volume among all submatrices, then quasi-optimality holds:

$$\Vert A - A_{skel} \Vert \leq (r+1)^2 \Vert A - A_{best} \Vert.$$

Back to earth: applications of maxvol

The maximal-volume principle is present in many areas, one of the earliest related references I know is the Design-optimality (Fedorov, 1972), and is related to the selection of optimal interpolation points for a given basis set:

$$f(x) \approx \sum_{\alpha=1}^r c_{\alpha} g_{\alpha}(x),$$

instead of doing linear regression, just do interpolation at $r$ points.

Some more ideas

  • Machine learning tasks: given millions of features, select most important features.

Linear algebra showcase

Given an $n \times r$ matrix, find the submatrix of largest volume it in.

Use greedy algorithm (now we know that it is D-optimality from at least 1972)

  • Take some rows, put them in the first $r$. Compute $B = A \widehat{A}^{-1}$
  • $B = \begin{pmatrix} I \\ Z \end{pmatrix}$
  • Suppose maximal element in $Z$ is in position $(i,j)$.
  • Swap $i$-th row with $j$-th row.
  • Stop if maximal element is less than $(1+\delta)$.

Key idea

Select subsets of raw data, that describe the behaviour sufficiently well (interpolation problem).

Cross approximation

For the two-dimensional case, the best algorithm to recover the rank-$r$ approximation is
the cross approximation algorithm.

Is it better than random sampling of columns?

Deterministic sampling is always better than random sampling!

Randomized techniques for low-rank approximation became popular recently, (Martinsson, Halko, Rokhlin) but no real comparisons are done .

Multidimensional case

How to generalize the idea of separation of variables to higher dimensions?

Important points

  • SVD is good
  • Best approximation exists
  • Interpolation via skeleton

Straightforward idea: canonical format

$A(i_1,\ldots,i_d) \approx \sum_{\alpha=1}^r U_1(i_1,\alpha)\ldots U_d(i_d,\alpha)$

$r$ is called (approximate) canonical rank, $U_k$ --- canonical factors.

Bad things about the canonical format:

  • Best approximation may not exist
  • Canonical rank is NP-complete (matrix rank is ...)
  • No good algorithm

Bad example (1):

$f(x_1,\ldots,x_d) = x_1 + x_2 + \ldots x_d$,

Canonical rank $d$, can be approximated with rank-$2$ with any accuracy!

Bad example (2):

Canonical rank may depend on the field (matrix rank can not!)

$f(x_1,\ldots,x_d) = \sin(x_1 + \ldots + x_d)$

  • Complex field: 2 (obvious!)
  • Real field: $d$

Another example: Tucker

Another attempt to avoid was the Tucker format

(Tucker 1966, Lathauwer, 2000+), used first in statistics, where the data was given as a data cube.

$$A(i,j,k) \approx \sum_{\alpha \beta \gamma} G(\alpha,\beta,\gamma) U_1(i,\alpha)V(j,\alpha)W(k,\alpha)$$

You can compute Tucker by means of the SVD:

  • Compute unfoldings : $A_1, A_2, A_3$
  • Compute left SVD factors: $A_i \approx U_i \Phi_i$
  • Compute the core: $G = A \times_1 U^{\top}_1 \times_2 U^{\top}_2 \times_3 U^{\top}_3$.

Main problem with Tucker format

The main problem with the Tucker format is that it can not be used to represent high-dimensional functions

We reduce $n^d$ complexity to $r^d$, which makes $d = 10,\ldots, 20$ tracktable, but not more.

Are there other ideas?

Main motivation

Matrices are good, so reduce everything to matrices!

Splitting into halves

Slit index set into two, get $n^{d/2} \times n^{d/2}$ matrix, compute its SVD:

$$A(I, J) = \sum_{\alpha=1}^r U_{I, \alpha} V_{J, \alpha},$$

then we have $d_1 +1 $ and $d_2 + 1$ - dimensional tensors. Do it recursively!

Another construction

Another construction, so-called Hierarchical Tucker, forms the construction:

  1. Compute Tucker decomposition, get a $r \times r \times \ldots \times r$ array;
  2. Merge indices by two, and compute transfer matrices of size $r \times r \times r$.
  3. Go up the tree

Simplest construction

The simplest construction separates one index at a time and gives the tensor-train (or matrix-product state) decomposition

$$A(i_1, \ldots, i_d) \approx G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ is $r_{k-1} \times r_k$ matrix for any fixed $r_k$.

Similar to Hidden Markov models

Tensor-train

Tensor-train (in physics known as matrix-product-state) representation has the form

$$ A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ has size $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.

Now we have d-1 ranks (we call them TT-ranks). The parametrization is defined by TT-cores.

Why tensor train?

TT-ranks are matrix ranks

Define unfoldings:

$A_k = A(i_1 \ldots i_k; i_{k+1} \ldots i_d)$, $n^k \times n^{d-k}$ matrix

Theorem: there exists a TT-decomposition with TT-ranks

$$r_k = \mathrm{rank} A_k$$

The proof

Proof is simple and give the TT-SVD algorithm (in quantum information it is known as Vidal algorithm).

Stability estimate

No exact ranks in practice -- stability estimate!

If $A_k = R_k + E_k$, $||E_k|| = \varepsilon_k$ $$||\mathbf{A}-\mathbf{TT}||_F \leq \sqrt{\sum_{k=1}^{d-1} \varepsilon^2_k}.$$

Fast and trivial linear algebra

Addition, Hadamard product, scalar product, convolution

All scale linear in $d$

$$C(i_1,\ldots,i_d) = A(i_1,\ldots,i_d) B(i_1,\ldots,i_d)$$

$$C_k(i_k) = A_k(i_k) \otimes B_k(i_k),$$ ranks are multiplied

Rounding

Rounding procedure solves the following problem: given a TT-representation

$$A(i_1, \ldots, i_d) = G_1(i_1) G_2(i_2) \ldots G_d(i_d)$$

find another one with smaller memory that approximates it with threshold $\epsilon$.

It can be done in $\mathcal{O}(dnr^3)$ operations using the linear structure of the format:

We can do orthogonalization of the cores!

Cross approximation in $d$-dimensions

Oseledets, Tyrtyshnikov, TT-cross approximation of multidimensional arrays, which generalizes the matrix result.

Theorem: A rank-$r$ tensor can be recovered from $\mathcal{O}(dnr^2)$ elements.

Quantized tensor-train

To used tensor decompositions, we have to approximate a continious problem by a discrete, where the vector of unknowns can be indexed by a $d$-index.

  1. (Standard) Given a function of $d$ variables, introduce a tensor-product basis set, and the tensor is the tensor of coefficients
  2. (Non-standard) Given a function of one variable, take it on a uniform grid with $2^d$ points, reshape into a $2 \times \ldots \times 2$ $d$-dimensional tensor

Quantized tensor train

It is a mapping from 1D functions to $d$-dimensional arrays, using the following representation:

$$v_k = f(x_k), \quad k = 1, \ldots, 2^d.$$

$V(i_1, \ldots, i_d) = v(a + i h)$, $\quad i = i_1 + 2 i_2 + \ldots 2^{d-1} i_d.$

It gives $\mathcal{O}(2dr^2) = \mathcal{O}(\log N)$ parameters.

Representation of some basic functions

Almost all important functions in 1D are of low QTT-ranks

  • $f(x) = \exp(\lambda x)$ -- rank $1$
  • $f(x) = \sin( w x + \alpha)$ -- rank $2$
  • $f(x)$ -- polynomial, rank $p+1$
  • Rational functions, even non-smooth functions (Weirstrass function has bounded QTT-ranks)

Wavelets and QTT

For 2D functions, simple structures lead to small QTT-ranks (i.e. characteristic function of a triangle).

Characteristic function of the triangle, QTT-ranks=2

Characteristic function of the circle, QTT-ranks full

Why is it so?

The meaning of the QTT-rank is that you split the image into $2^k \times 2^k$ patches, and then you look at the linear dependence between patches, and compute the basis.

For the circle, due to grid artifacts, the dimension of the linear span is big.

Alternative: interpret this as a filter bank, and transform locally.

This leads to WTT transform, where the filters vary from level to level and are computed adaptively for the image.

Optimization problem:

The optimization problem, which I do not know how to solve, is

$y \approx W s$, where $W$ is a special orthogonal matrix, and $s$ is sparse.

Another application of low-rank structure

In machine learning problems, often the low-rank structure comes not for good equality, but

  1. As regularizer, restricting the set of "good" solutions
  2. As initial approximation for a general optimization method

Example one: CNN optimizer

http://arxiv.org/abs/1412.6553 Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

A convolutional neural network layer is just a contraction with a 4D tensor. We approximated this tensor by a canonical decomposition (using wonderful TensorLab by De Lathauwer et al.)

and then put one more layer into the network!

Example two: regularization

Tensorizing neural networks, Novikov et. al., ICML 2015.

Instead of optimizing for fully-connected layer, they optimized for a so-called TT-matrix.

Result: small accuracy loss, 700x less number of parameters.

What is a multidimensional SVD

A multidimensional SVD adds you a tool, an efficient structure you can work with.

You can do:

  • Tensor completion
  • Solution of linear systems
  • Regularization of regression

By reducing to the solution of low-rank constraint optimization: $$F(X) \rightarrow \min,$$

subject to $X$ has "low TT-rank".

Optimization problems examples:

  • Linear systems: $$F(x) = <A(X), X> - 2<F, X>, \quad A = A^* > 0.$$
  • Eigenvalue problems: $$F(X) = <A(X), X>/<X, X>.$$
  • Completion problems:
  • Dynamical low-rank approximation: $$\left\Vert \frac{dF(t)}{dt} - \frac{dX(t)}{dt} \right\Vert \rightarrow \min.$$

For all we have efficient methods, based on DMRG and more recent AMEN methods.

DMRG

DMRG method, proposed by White is very specific for TT.

Given $$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d)$$ we optimize over a supercore $$W(i_k, i_{k+1}) = G_k(i_k) G_{k+1}(i_{k+1}))$$

and then split back by SVD (i.e., automatic rank adaptivity).

AMEN

AMEN method, proposed by Dolgov & Savostyanov, is does not require solution of $\mathcal{O}(n^2)$ system.

Instead, it uses the ALS step to update the basis, and then enriches the basis by the gradient projection.

At the moment, it is the most effecient black-box solver available (and we have its implementation in TT-Toolbox!).

Can we do better? I believe we can, and the way lies through the topology of the low-rank space.

Riemannian optimization

I became a big fan of Riemannian optimization techniques.

The set of TT-tensors with bounded has is a manifold, thus we can write projected gradient method

$$X_{k+1} = P_{\mathcal{T}}(X_k + \nabla F(X_k)),$$

and then do retraction $R$ onto the manifold. Together with C. Lubich & B. Vandereycken we have designed a simplest

retraction called projector-splitting, which is a half-step of ALS in disguise.

$$X_{k+1} = I(X_k, \nabla F(X_k)).$$

The big mystery: how such methods converge

The standard theory says that such methods convergence deteriorate when the curvature (i.e., inverse of the minimal singular value) is large. In practice: does not matter!

$$x = Qx + f, \quad \Vert Q \Vert \leq \delta, $$

Solution $x^*$ has rank $5$ with $3$ large singular values, one $10^{-3}$ of the largest, one $10^{-6}$ of the largest.
Blue : original gradient method, Green : Manifold-projected gradient, Cyan : The normal component of the error

Thoughts on interconnection

I think, based on 3 directions for low-rank approximation:

  • Nuclear norm-based methods
  • Riemannian optimization

Summary

  • Tensor train is a structure we can work with as efficient as with SVD
  • Riemannian optimization is great (nuclear norm? proxy? somebody, compare!)
  • The topology of low-rank matrices is not yet understood (it is linear!)

References

Questions?


In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[2]:

In [ ]: