Numerical tensor methods in (stochastical) multidimensional problems

Ivan Oseledets, Skolkovo Institute of Science and Technology

oseledets.github.io, i.oseledets@skoltech.ru

Skoltech

Founded in 2011 near Moscow http://faculty.skoltech.ru
No departments: priority areas "IT, Bio, Space and Nuclear"
At the moment, 160 master students, 30 professors, 40 postdocs, 50 PhD studens
Computational Data-Intensive Science and Engineering

IT at Skoltech

Ivan Oseledets: tensors, multidimensional approximations, shape & topology optimization
Victor Lempitsky: computer vision, deep learning
Dmitry Vetrov: machine learning, deep learning
Panos Karras: data management, data science
Thanos Polimeridis, Maxwell equation, PDEs
Alexander Shapeev, PDEs, multiscale optimizatio
Dzmitry-Tsetserukou, robotics

Courses

We teach Computational Science & Data Science (NLA, PDEs, Comp Vision, Deep Learning)
We welcome new faculty, postdocs, PhD students, master students
We are open to collaboration with everyone

This mini-course

This course is about tensors.

A tensor is just a multidimensional array

$$A(i_1, \ldots, i_d), \quad 1 \leq i_k \leq n_k.$$

Suppose that $n_k \approx n$, the the total amount of memory to be stored is $n^d$.

These two lectures

Lecture 1: Monday, 17.00 - 19.00. Introduction, motivation, matrix background
Lecture 2: Tuesday 17.00 - 19.00. Novel tensor formats, advanced algorithms

The core idea

Numerical tensor methods have many applications in machine learning, PDEs, data mining.

In one sentence, it is a generalization of standard linear algebra (singular value decomposition)

to multidimensional case.

Connection to stochastics

Computation of multivariate integrals (over Brownian motion) I. Sloan, H. Woznyakovsky, ...
PDEs with random coefficients via Karhunen-Loeve decomposition, C. Schwab ..
Deep learning: connection via quantum information, graphical models, a new field emerging.

Statistics & machine learning

Before going into a multidimensional integrals. Now people talk about machine learning, but many things are known (see table).

Machine learning	Statistics
network, graphs model	weights parameters
learning	fitting
generalization	test set performance
supervised learning	regression/classiﬁcation
unsupervised learning	density estimation, clustering
large grant = \$1,000,000	large grant = $50,000

Example

Paskov, Traub, Faster valuation of financial derivatives
They considered a 360-dimensional collateralized mortgage obligation problems;

The geometric Brownian motion is the canonical model for the interest rate, and the value of the security is an expectation.
For mortgage backed securities the integrand is complicated, due to a prepayment option. The integral cannot be computed analytically.
A typical security of length 360 months yields a 360-dimensional integral

Monte-Carlo

A typical way is to use Monte-Carlo sampling: sample $N$ points, compute average

Accuracy is $\frac{1}{\sqrt{N}}$, so if you want $10^{-3}$ (absolute) accuracy you need $10^6$ samples.
Quasi-Monte Carlo is much more efficient, but still $\mathcal{O}(\frac{1}{N})$ in the general case.
Can we get something like $\exp(-c N^{\alpha})$?

Tensor decompositions

The idea of tensor decomposition is to represent a tensor (multidimensional array) in terms of simpler array. This idea comes from matrix factorizations (represent a matrix as a product of simpler ones).

What is the simplest tensor decomposition

The key idea is the idea of separation of variables:

$$A(i_1, \ldots, i_d) = U_1(i_1) \ldots U_d(i_d)$$

In probability theory, it corresponds to the idea of independence of variables.

This class is too narrow to represent useful functions (or probability distributions).

Sum-of-products

The sum-of-products representation is much more interesting:

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

also named canonical format, or multivariate factor model.

Two questions

What this decomposition is for $d=2$.
What is good and/or bad for $d \geq 3$?

Two-dimensional case

For two-dimensional case, it boils down to the approximation $$A \approx UV^{\top}, $$ where $U$ is $n \times r$ and $V$ is $m \times r$.

It is low-rank approximation of matrices, and best rank-$r$ approximation can be computed

using singular value decomposition

Multidimensional case

For $d > 3$ The canonical decomposition of the form

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

then it is very difficult to compute numerically; there is even an example of $9 \times 9 \times 9$

tensor for which the tensor rank is not known!

But this is not the only tensor decomposition we can use.

Literature on tensor formats

T. Kolda and B. Bader, Tensor decompositions and applications, SIREV (2009)
W. Hackbusch, Tensor spaces and numerical tensor calculus, 2012
L. Grasedyck, D. Kressner, C. Tobler, A literature survey of low-rank tensor approximation techniques, 2013

Software

Some times, it is better to study with the code and to try your own application

Tensor Toolbox 2.5 (T. Kolda)
TT-Toolbox, MATLAB (http://github.com/oseledets/TT-Toolbox)
TT-Toolbox, Python (http://github.com/oseledets/ttpy)

Basic stuff

Now we will briefly recall the two-dimensional case, which is important for the techniques for the computational of tensor decompositions.

Singular value decomposition

Given an $n \times m$ matrix $A$, its singular value decomposition (SVD) is defined as

$$A = U S V^{\top},$$

where $U^*U = V^* V = I$, and $S$ contains singular values on the diagonal.

Computation of the SVD

Standard algorithm for the computation of the SVD requires $\mathcal{O}(N^3)$ operations,
can be done for matrices $1000-5000$ on a laptop.

Large-scale SVD is a topic of ongoing research.

The crucial point why it is possible is based on the skeleton decomposition.

Skeleton decomposition

Given a rank-$r$ matrix, how many elements you need to compute to recover the full matrix?

Skeleton decomposition

It is enough to sample $r$ columns and $r$ rows to exactly recover a rank-$r$ matrix:

$$A = U \widehat{A}^{-1} V,$$

Approximate rank-r

What happens if the matrix is of approximate low rank?

$A \approx R + E, \quad \mathrm{rank~} R = r, \quad ||E||_C = \varepsilon$

Maximum-volume principle

Answer is given by the maximum-volume principle: select the columns and rows such that

the submatrix $\widehat{A}$ has maximal-volume among all submatrices, then quasi-optimality holds:

$$\Vert A - A_{skel} \Vert \leq (r+1)^2 \Vert A - A_{best} \Vert.$$

Back to earth: applications of maxvol

The maximal-volume principle is present in many areas, one of the earliest related references I know is the Design-optimality (Fedorov, 1972), and is related to the selection of optimal interpolation points for a given basis set:

$$f(x) \approx \sum_{\alpha=1}^r c_{\alpha} g_{\alpha}(x),$$

instead of doing linear regression, just do interpolation at $r$ points.

Some more ideas

Given a set of stock prices (many!) select the most important ones that we need to track (i.e., to get the insider info...) that will determine the others.
Machine learning tasks: given millions of features, select most important features.

Linear algebra showcase

Given an $n \times r$ matrix, find the submatrix of largest volume it in.

Use greedy algorithm (now we know that it is D-optimality from at least 1972)

Take some rows, put them in the first $r$. Compute $B = A \widehat{A}^{-1}$
$B = \begin{pmatrix} I \\ Z \end{pmatrix}$
Suppose maximal element in $Z$ is in position $(i,j)$.
Swap $i$-th row with $j$-th row.
Stop if maximal element is less than $(1+\delta)$.

Key idea

Select subsets of raw data, that describe the behaviour sufficiently well (interpolation problem).

Cross approximation

For the two-dimensional case, the best algorithm to recover the rank-$r$ approximation is
the cross approximation algorithm.

Is it better than random sampling of columns?

Randomized techniques for low-rank approximation became popular recently, but no real comparisons are done (Martinsson, Halko, Rokhlin).

Multidimensional case

How to generalize the idea of separation of variables to higher dimensions?

Important points

SVD is good
Best approximation exists
Interpolation via skeleton

Straightforward idea: canonical format

$A(i_1,\ldots,i_d) \approx \sum_{\alpha=1}^r U_1(i_1,\alpha)\ldots U_d(i_d,\alpha)$

$r$ is called (approximate) canonical rank, $U_k$ --- canonical factors.

Bad things about the canonical format:

Best approximation may not exist
Canonical rank is NP-complete (matrix rank is ...)
No good algorithm

Bad example (1):

$f(x_1,\ldots,x_d) = x_1 + x_2 + \ldots x_d$,

Canonical rank $d$ (no proof is known), can be approximated with rank-$2$ with any accuracy!

Bad example (2):

Canonical rank may depend on the field (matrix rank can not!)

$f(x_1,\ldots,x_d) = \sin(x_1 + \ldots + x_d)$

Complex field: 2 (obvious!)
Real field: $d$

Another example: Tucker

Another attempt to avoid was the Tucker format

(Tucker 1966, Lathauwer, 2000+), used first in statistics, where the data was given as a data cube.

$$A(i,j,k) \approx \sum_{\alpha \beta \gamma} G(\alpha,\beta,\gamma) U_1(i,\alpha)V(j,\alpha)W(k,\alpha)$$

You can compute Tucker by means of the SVD:

Compute unfoldings : $A_1, A_2, A_3$
Compute left SVD factors: $A_i \approx U_i \Phi_i$
Compute the core: $G = A \times_1 U^{\top}_1 \times_2 U^{\top}_2 \times_3 U^{\top}_3$.

Main problem with Tucker format

The main problem with the Tucker format is that it can not be used to represent high-dimensional functions

We reduce $n^d$ complexity to $r^d$, which makes $d = 10,\ldots, 20$ tracktable, but not more.

Are there other ideas?

Main motivation

Matrices are good, so reduce everything to matrices!

Splitting into halves

Slit index set into two, get $n^{d/2} \times n^{d/2}$ matrix, compute its SVD:

$$A(I, J) = \sum_{\alpha=1}^r U_{I, \alpha} V_{J, \alpha},$$

then we have $d_1 +1 $ and $d_2 + 1$ - dimensional tensors. Do it recursively!

Another construction

Another construction, so-called Hierarchical Tucker, forms the construction:

Compute Tucker decomposition, get a $r \times r \times \ldots \times r$ array;
Merge indices by two, and compute transfer matrices of size $r \times r \times r$.
Go up the tree

Simplest construction

The simplest construction separates one index at a time and gives the tensor-train (or matrix-product state) decomposition

$$A(i_1, \ldots, i_d) \approx G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ is $r_{k-1} \times r_k$ matrix for any fixed $r_k$.

Similar to Hidden Markov models

Summary

Tensors are a natural model for multivariate probability distributions
Can be used to compute expectations, quantiles, solve stochastic PDEs
Tensor decompositions are needed to reduce the curse of dimensionality
Goal: is to beat Monte-Carlo by using low-rank structure
In order to build tensor decompositions, we need to know about matrix methods
Standard tensor decomposition are difficult to be used numerically

Next lecture (tomorrow)

Basic properties of the tensor trains
Basic problems & algorithms
Application to graphical models
Application to path integrals (integration over Brownian motion)
Application to financial integrals
Application to machine learning problems

Questions?



In [2]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[2]: