Tensor methods: A tool for high-dimensional problems

Ivan Oseledets, Skolkovo Institute of Science and Technology

oseledets.github.io, i.oseledets@skoltech.ru

Skoltech

Founded in 2011 near Moscow http://faculty.skoltech.ru
No departments: priority areas "IT, Bio, Space and Nuclear"
At the moment, 160 master students, 30 professors, 40 postdocs, 50 PhD studens
I work at Center for Computational Data-Intensive Science and Engineering http://crei.skoltech.ru/cdise/

CDISE at Skoltech

Ivan Oseledets: tensors, multidimensional approximations, shape & topology optimization
Victor Lempitsky: computer vision, deep learning
Dmitry Vetrov: machine learning, deep learning
Panos Karras: data management, data science
Thanos Polimeridis, Maxwell equation, PDEs
Alexander Shapeev, PDEs, multiscale optimizatio
Dzmitry-Tsetserukou, robotics

Courses

We teach Computational Science & Data Science (NLA, PDEs, Comp Vision, Deep Learning)
We welcome new faculty, postdocs, PhD students, master students
We are open to collaboration with everyone

This talk

This talk is about tensors.

A tensor is just a multidimensional array

$$A(i_1, \ldots, i_d), \quad 1 \leq i_k \leq n_k.$$

Suppose that $n_k \approx n$, the the total amount of memory to be stored is $n^d$.

The core idea

Numerical tensor methods have many applications in machine learning, PDEs, data mining.

In one sentence, it is a generalization of standard linear algebra (singular value decomposition)

to multidimensional case.

Connection to stochastics/risk management/etc...

Computation of multivariate integrals (over Brownian motion) I. Sloan, H. Wozniakowski, ...
PDEs with random coefficients via Karhunen-Loeve decomposition, C. Schwab ..
Deep learning: connection via quantum information, graphical models, a new field emerging.

Statistics & machine learning

Before going into a multidimensional integrals. Now people talk about machine learning, but many things are known (see table).

Machine learning	Statistics
network, graphs model	weights parameters
learning	fitting
generalization	test set performance
supervised learning	regression/classiﬁcation
unsupervised learning	density estimation, clustering
large grant = \$1,000,000	large grant = $50,000

Example

Paskov, Traub, Faster valuation of financial derivatives
They considered a 360-dimensional collateralized mortgage obligation problems;

The geometric Brownian motion is the canonical model for the interest rate, and the value of the security is an expectation.
For mortgage backed securities the integrand is complicated, due to a prepayment option. The integral cannot be computed analytically.
A typical security of length 360 months yields a 360-dimensional integral

Monte-Carlo

A typical way is to use Monte-Carlo sampling: sample $N$ points, compute average

Accuracy is $\frac{1}{\sqrt{N}}$, so if you want $10^{-3}$ (absolute) accuracy you need $10^6$ samples.
Quasi-Monte Carlo is much more efficient, but still $\mathcal{O}(\frac{1}{N})$ in the general case.
Can we get something like $\exp(-c N^{\alpha})$?

Tensor decompositions

The idea of tensor decomposition is to represent a tensor (multidimensional array) in terms of simpler array. This idea comes from matrix factorizations (represent a matrix as a product of simpler ones).

What is the simplest tensor decomposition

The key idea is the idea of separation of variables:

$$A(i_1, \ldots, i_d) = U_1(i_1) \ldots U_d(i_d)$$

In probability theory, it corresponds to the idea of independence of variables.

This class is too narrow to represent useful functions (or probability distributions).

Sum-of-products

The sum-of-products representation is much more interesting:

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

also named canonical format, or multivariate factor model.

Two questions

What this decomposition is for $d=2$.
What is good and/or bad for $d \geq 3$?

Two-dimensional case

For two-dimensional case, it boils down to the approximation $$A \approx UV^{\top}, $$ where $U$ is $n \times r$ and $V$ is $m \times r$.

It is low-rank approximation of matrices, and best rank-$r$ approximation can be computed

using singular value decomposition

Multidimensional case

For $d > 3$ The canonical decomposition of the form

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

then it is very difficult to compute numerically; there is even an example of $9 \times 9 \times 9$

tensor for which the tensor rank is not known!

But this is not the only tensor decomposition we can use.

Literature on tensor formats

T. Kolda and B. Bader, Tensor decompositions and applications, SIREV (2009)
W. Hackbusch, Tensor spaces and numerical tensor calculus, 2012
L. Grasedyck, D. Kressner, C. Tobler, A literature survey of low-rank tensor approximation techniques, 2013

Software

Some times, it is better to study with the code and to try your own application

Tensor Toolbox 2.5 (T. Kolda)
TT-Toolbox, MATLAB (http://github.com/oseledets/TT-Toolbox)
TT-Toolbox, Python (http://github.com/oseledets/ttpy)

Basic stuff

Now we will briefly recall the two-dimensional case, which is important for the techniques for the computational of tensor decompositions.

Singular value decomposition

Given an $n \times m$ matrix $A$, its singular value decomposition (SVD) is defined as

$$A = U S V^{\top},$$

where $U^*U = V^* V = I$, and $S$ contains singular values on the diagonal.

Computation of the SVD

Standard algorithm for the computation of the SVD requires $\mathcal{O}(N^3)$ operations,
can be done for matrices $1000-5000$ on a laptop.

Large-scale SVD is a topic of ongoing research.

The crucial point why it is possible is based on the skeleton decomposition.

Skeleton decomposition

Given a rank-$r$ matrix, how many elements you need to compute to recover the full matrix?

Skeleton decomposition

It is enough to sample $r$ columns and $r$ rows to exactly recover a rank-$r$ matrix:

$$A = U \widehat{A}^{-1} V,$$

Approximate rank-r

What happens if the matrix is of approximate low rank?

$A \approx R + E, \quad \mathrm{rank~} R = r, \quad ||E||_C = \varepsilon$

Maximum-volume principle

Answer is given by the maximum-volume principle: select the columns and rows such that

the submatrix $\widehat{A}$ has maximal-volume among all submatrices, then quasi-optimality holds:

$$\Vert A - A_{skel} \Vert \leq (r+1)^2 \Vert A - A_{best} \Vert.$$

Back to earth: applications of maxvol

The maximal-volume principle is present in many areas, one of the earliest related references I know is the Design-optimality (Fedorov, 1972), and is related to the selection of optimal interpolation points for a given basis set:

$$f(x) \approx \sum_{\alpha=1}^r c_{\alpha} g_{\alpha}(x),$$

instead of doing linear regression, just do interpolation at $r$ points.

Some more ideas

Given a set of stock prices (many!) select the most important ones that we need to track (i.e., to get the insider info...) that will determine the others.
Machine learning tasks: given millions of features, select most important features.

Linear algebra showcase

Given an $n \times r$ matrix, find the submatrix of largest volume it in.

Use greedy algorithm (now we know that it is D-optimality from at least 1972)

Take some rows, put them in the first $r$. Compute $B = A \widehat{A}^{-1}$
$B = \begin{pmatrix} I \\ Z \end{pmatrix}$
Suppose maximal element in $Z$ is in position $(i,j)$.
Swap $i$-th row with $j$-th row.
Stop if maximal element is less than $(1+\delta)$.

Key idea

Select subsets of raw data, that describe the behaviour sufficiently well (interpolation problem).

Cross approximation

For the two-dimensional case, the best algorithm to recover the rank-$r$ approximation is
the cross approximation algorithm.

Is it better than random sampling of columns?

Randomized techniques for low-rank approximation became popular recently, but no real comparisons are done (Martinsson, Halko, Rokhlin).

Multidimensional case

How to generalize the idea of separation of variables to higher dimensions?

Important points

SVD is good
Best approximation exists
Interpolation via skeleton

Straightforward idea: canonical format

$A(i_1,\ldots,i_d) \approx \sum_{\alpha=1}^r U_1(i_1,\alpha)\ldots U_d(i_d,\alpha)$

$r$ is called (approximate) canonical rank, $U_k$ --- canonical factors.

Bad things about the canonical format:

Best approximation may not exist
Canonical rank is NP-complete (matrix rank is ...)
No good algorithm

Bad example (1):

$f(x_1,\ldots,x_d) = x_1 + x_2 + \ldots x_d$,

Canonical rank $d$ (no proof is known), can be approximated with rank-$2$ with any accuracy!

Bad example (2):

Canonical rank may depend on the field (matrix rank can not!)

$f(x_1,\ldots,x_d) = \sin(x_1 + \ldots + x_d)$

Complex field: 2 (obvious!)
Real field: $d$

Another example: Tucker

Another attempt to avoid was the Tucker format

(Tucker 1966, Lathauwer, 2000+), used first in statistics, where the data was given as a data cube.

$$A(i,j,k) \approx \sum_{\alpha \beta \gamma} G(\alpha,\beta,\gamma) U_1(i,\alpha)V(j,\alpha)W(k,\alpha)$$

You can compute Tucker by means of the SVD:

Compute unfoldings : $A_1, A_2, A_3$
Compute left SVD factors: $A_i \approx U_i \Phi_i$
Compute the core: $G = A \times_1 U^{\top}_1 \times_2 U^{\top}_2 \times_3 U^{\top}_3$.

Main problem with Tucker format

The main problem with the Tucker format is that it can not be used to represent high-dimensional functions

We reduce $n^d$ complexity to $r^d$, which makes $d = 10,\ldots, 20$ tracktable, but not more.

Are there other ideas?

Main motivation

Matrices are good, so reduce everything to matrices!

Splitting into halves

Slit index set into two, get $n^{d/2} \times n^{d/2}$ matrix, compute its SVD:

$$A(I, J) = \sum_{\alpha=1}^r U_{I, \alpha} V_{J, \alpha},$$

then we have $d_1 +1 $ and $d_2 + 1$ - dimensional tensors. Do it recursively!

Another construction

Another construction, so-called Hierarchical Tucker, forms the construction:

Compute Tucker decomposition, get a $r \times r \times \ldots \times r$ array;
Merge indices by two, and compute transfer matrices of size $r \times r \times r$.
Go up the tree

Simplest construction

The simplest construction separates one index at a time and gives the tensor-train (or matrix-product state) decomposition

$$A(i_1, \ldots, i_d) \approx G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ is $r_{k-1} \times r_k$ matrix for any fixed $r_k$.

Similar to Hidden Markov models

Tensor-train

Tensor-train (in physics known as matrix-product-state) representation has the form

$$ A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ has size $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.

Now we have d-1 ranks (we call them TT-ranks). The parametrization is defined by TT-cores.

Why tensor train?

TT-ranks are matrix ranks

Define unfoldings:

$A_k = A(i_1 \ldots i_k; i_{k+1} \ldots i_d)$, $n^k \times n^{d-k}$ matrix

Theorem: there exists a TT-decomposition with TT-ranks

$$r_k = \mathrm{rank} A_k$$

The proof

Proof is simple and give the TT-SVD algorithm (in quantum information it is known as Vidal algorithm).

Stability estimate

No exact ranks in practice -- stability estimate!

If $A_k = R_k + E_k$, $||E_k|| = \varepsilon_k$ $$||\mathbf{A}-\mathbf{TT}||_F \leq \sqrt{\sum_{k=1}^{d-1} \varepsilon^2_k}.$$

Fast and trivial linear algebra

Addition, Hadamard product, scalar product, convolution

All scale linear in $d$

$$C(i_1,\ldots,i_d) = A(i_1,\ldots,i_d) B(i_1,\ldots,i_d)$$

$$C_k(i_k) = A_k(i_k) \otimes B_k(i_k),$$ ranks are multiplied

Rounding

Rounding procedure solves the following problem: given a TT-representation

$$A(i_1, \ldots, i_d) = G_1(i_1) G_2(i_2) \ldots G_d(i_d)$$

find another one with smaller memory that approximates it with threshold $\epsilon$.

It can be done in $\mathcal{O}(dnr^3)$ operations using the linear structure of the format:

We can do orthogonalization of the cores! (whiteboard again).

Cross approximation in $d$-dimensions

Oseledets, Tyrtyshnikov, TT-cross approximation of multidimensional arrays, which generalizes the matrix result.

Theorem: A rank-$r$ tensor can be recovered from $\mathcal{O}(dnr^2)$ elements.

Quantized tensor-train

To used tensor decompositions, we have to approximate a continious problem by a discrete, where the vector of unknowns can be indexed by a $d$-index.

(Standard) Given a function of $d$ variables, introduce a tensor-product basis set, and the tensor is the tensor of coefficients
(Non-standard) Given a function of one variable, take it on a uniform grid with $2^d$ points, reshape into a $2 \times \ldots \times 2$ $d$-dimensional tensor

Quantized tensor train

It is a mapping from 1D functions to $d$-dimensional arrays, using the following representation:

$$v_k = f(x_k), \quad k = 1, \ldots, 2^d.$$

$V(i_1, \ldots, i_d) = v(a + i h)$, $\quad i = i_1 + 2 i_2 + \ldots 2^{d-1} i_d.$

It gives $\mathcal{O}(2dr^2) = \mathcal{O}(\log N)$ parameters.

Representation of some basic functions

Almost all important functions in 1D are of low QTT-ranks

$f(x) = \exp(\lambda x)$ -- rank $1$
$f(x) = \sin( w x + \alpha)$ -- rank $2$
$f(x)$ -- polynomial, rank $p+1$
Rational functions, even non-smooth functions (Weirstrass function has bounded QTT-ranks)

Hierarchical uncertainty quantification

One of the examples:

We have a (hierarchical) model with random input.
Each output becomes a random variable as well.

http://arxiv.org/pdf/1407.3023.pdf Enabling High-Dimensional Hierarchical Uncertainty Quantification by ANOVA and Tensor-Train Decomposition, Zheng Zhang, Xiu Yang, Ivan V. Oseledets, George Em Karniadakis, and Luca Daniel

For the stochastic circuit simulator it gives over 200x speedup for a 45-parameter system!

Path integrals

Integral over Brownian motion the one-dimensional reaction-diffusion equation with initial distribution $f(x): \mathbb{R} \to \mathbb{R}^{+}$ and a constant diffusion coefficient~$\sigma$

\begin{equation} \left\{ \begin{aligned} & \frac{\partial}{\partial t} u(x,t) = \sigma \frac{\partial^2}{\partial x^2} u(x,t) - V(x,t) u(x,t),\\ & u(x,0)=f(x) \end{aligned} \right. \qquad t \in [0, T], \quad x \in \mathbb{R}. \end{equation}

Feyman-Kac formula

The solution can be expressed by the Feynman-Kac formula \begin{equation} u_{f}(x,T)=\int_{\mathcal C\{x,0; T \}} f(\xi(T)) e^{-\int_{0}^{T}\! V(\xi(\tau),T-\tau) d\tau } \mathcal{D}_{\xi}, \end{equation} where the integration is done over a set of all continuous paths $\xi(T): [0,T]\to \mathbb{R}$ from the Banach space $\Xi([0,T], \mathbb{R})$ starting at $\xi(0)=x$ and stopping at arbitrary endpoints at time $T$. $\mathcal{D}_{\xi}$ is the Wiener measure, and $\xi(t)$ is the Wiener process.

Discretization

A standard way to discretize the path integral is to break the time range $[0,T]$ into $n$ intervals \begin{equation} \tau_k=k \cdot \delta t, \qquad 0\le k\le n, \qquad n\!: \tau_n=T. \end{equation} \begin{equation} u^{(n)}(x,t)=\int_{-\infty}^{\infty} \mathcal{D}^{(n)}_{\xi} \, f\!\left( \xi^{(n)}\right) \! \prod_{i=0}^{n} e^{ -w_{i} V^{(n)}_{i} \delta t }, \end{equation} The integration sign here denotes an $n$-dimensional integration over the particle steps $\xi_k$,

Numerical experiment

For numerical experiment we used diffusion equation in a weird potential which is non-periodic.

Solution (and convergence)

Potential and initial condition

Financial integral

We compute PV (Present Discounted Value) that takes into account that value of money changes in time due to basic economic parameters. \begin{equation} PV=\left( \frac{\alpha}{\pi} \right)^{d/2} \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} v(\xi_1,\ldots, \xi_d) \, e^{-\alpha \xi_{1}^2}\ldots e^{-\alpha \xi_{d}^2} d\xi_d \ldots d\xi_1 \qquad v(\xi_1,\ldots, \xi_d) = \sum_{k=1}^{d} u_k m_k \end{equation} \begin{equation*} u_k=\prod_{j=0}^{k-1} \frac{1}{1+i_j}, \qquad m_k=cr_k(1-w_k +w_k c_k), \qquad r_k=\prod_{j=1}^{k-1}(1-w_j), \end{equation*} \begin{equation} c_k = \sum^{d-k}_{j=0}\frac{1}{(1+i_0)^j} \qquad w_k=K_1 + K_2 \arctan(K_3 i_k + K_4), \qquad i_k = K^{k}_0 e^{\xi_1 + \ldots + \xi_k} i_0 \end{equation}

Results

Accuracy of the cross approximation $\varepsilon = 10^{-10}$. Dimension of the integral is labeled by $n$, It corresponds to the number of monthes in mortgage-backed security model. Its parameters are $(i_0,K_1,K_2,K_3, K_4, \sigma^2)=(0.007,0.01, -0.005, 10, 0.5,0.0004)$. Ranks of the matrix are presented in column labeled by $r$.

n	r	$T_{cross}$ (min.)	$T_{MC}$ (min.)
36	7	0.1	7.7
48	10	0.16	11.6
90	10	0.3	21.4
180	10	0.5	41.5
360	10	0.8	81.6
720	9	1.56	178
1440	9	2.81	354

What I did not talk about

PDEs with random coefficients (use KL-decomposition, obtain a multivariate function, approximate in the TT-format)
Connection to other high-dimensional tools, including machine learning

Summary

Tensors are a natural model for multivariate probability distributions
Can be used to compute expectations, quantiles, solve stochastic PDEs
Tensor decompositions are needed to reduce the curse of dimensionality
Goal: is to beat Monte-Carlo by using low-rank structure
In order to build tensor decompositions, we need to know about matrix methods
Standard tensor decomposition are difficult to be used numerically

Questions?



In [2]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[2]: