Lecture-2

Ivan Oseledets, Skolkovo Institute of Science and Technology
oseledets.github.io, i.oseledets@skoltech.ru

Previous lecture

  • Tensors are a natural model for multivariate probability distributions
  • Can be used to compute expectations, quantiles, solve stochastic PDEs
  • Tensor decompositions are needed to reduce the curse of dimensionality
  • Goal: is to beat Monte-Carlo by using low-rank structure
  • In order to build tensor decompositions, we need to know about matrix methods
  • Standard tensor decomposition are difficult to be used numerically

Today lecture

  • Basic properties of the tensor trains
  • Basic problems & algorithms
  • Application to path integrals (integration over Brownian motion)
  • Application to financial integrals
  • Graphical models
  • (If time permits) Demo of Jupyter notebooks

Tensor-train

Tensor-train (in physics known as matrix-product-state) representation has the form

$$ A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d), $$

where $G_k(i_k)$ has size $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.

Now we have d-1 ranks (we call them TT-ranks). The parametrization is defined by TT-cores.

Why tensor train?

TT-ranks are matrix ranks

Define unfoldings:

$A_k = A(i_1 \ldots i_k; i_{k+1} \ldots i_d)$, $n^k \times n^{d-k}$ matrix

Theorem: there exists a TT-decomposition with TT-ranks

$$r_k = \mathrm{rank} A_k$$

The proof

Proof is simple and give the TT-SVD algorithm (in quantum information it is known as Vidal algorithm).

Let us do it on the whiteboard

Stability estimate

No exact ranks in practice -- stability estimate!

If $A_k = R_k + E_k$, $||E_k|| = \varepsilon_k$ $$||\mathbf{A}-\mathbf{TT}||_F \leq \sqrt{\sum_{k=1}^{d-1} \varepsilon^2_k}.$$

Fast and trivial linear algebra

Addition, Hadamard product, scalar product, convolution

All scale linear in $d$

$$C(i_1,\ldots,i_d) = A(i_1,\ldots,i_d) B(i_1,\ldots,i_d)$$

$$C_k(i_k) = A_k(i_k) \otimes B_k(i_k),$$ ranks are multiplied

Rounding

Rounding procedure solves the following problem: given a TT-representation

$$A(i_1, \ldots, i_d) = G_1(i_1) G_2(i_2) \ldots G_d(i_d)$$

find another one with smaller memory that approximates it with threshold $\epsilon$.

It can be done in $\mathcal{O}(dnr^3)$ operations using the linear structure of the format:

We can do orthogonalization of the cores! (whiteboard again).

Cross approximation in $d$-dimensions

Oseledets, Tyrtyshnikov, TT-cross approximation of multidimensional arrays, which generalizes the matrix result.

Theorem: A rank-$r$ tensor can be recovered from $\mathcal{O}(dnr^2)$ elements.

Quantized tensor-train

To used tensor decompositions, we have to approximate a continious problem by a discrete, where the vector of unknowns can be indexed by a $d$-index.

  1. (Standard) Given a function of $d$ variables, introduce a tensor-product basis set, and the tensor is the tensor of coefficients
  2. (Non-standard) Given a function of one variable, take it on a uniform grid with $2^d$ points, reshape into a $2 \times \ldots \times 2$ $d$-dimensional tensor

Quantized tensor train

It is a mapping from 1D functions to $d$-dimensional arrays, using the following representation:

$$v_k = f(x_k), \quad k = 1, \ldots, 2^d.$$

$V(i_1, \ldots, i_d) = v(a + i h)$, $\quad i = i_1 + 2 i_2 + \ldots 2^{d-1} i_d.$

It gives $\mathcal{O}(2dr^2) = \mathcal{O}(\log N)$ parameters.

Representation of some basic functions

Almost all important functions in 1D are of low QTT-ranks

  • $f(x) = \exp(\lambda x)$ -- rank $1$
  • $f(x) = \sin( w x + \alpha)$ -- rank $2$
  • $f(x)$ -- polynomial, rank $p+1$
  • Rational functions, even non-smooth functions (Weirstrass function has bounded QTT-ranks)

Let us do some demo


In [17]:
import numpy as np
import tt
fun = lambda x: np.sin(20*x)
a = 0
b = 20
d = 70
N = 2 ** d
h = (b - a)*1.0/(N + 1)
t = tt.xfun(2, d) * h
eps = 1e-12
s1 = tt.multifuncrs2([t], fun, eps)
s1 = s1.round(eps)


=multifuncrs= sweep 1{2}, max_dy: 1.353e+01, erank: 5.83095
=multifuncrs= sweep 2{1}, max_dy: 1.006e-13, erank: 6.55744
This is a 70-dimensional tensor 
r(0)=1, n(0)=2 
r(1)=1, n(1)=2 
r(2)=1, n(2)=2 
r(3)=1, n(3)=2 
r(4)=1, n(4)=2 
r(5)=1, n(5)=2 
r(6)=1, n(6)=2 
r(7)=1, n(7)=2 
r(8)=1, n(8)=2 
r(9)=1, n(9)=2 
r(10)=1, n(10)=2 
r(11)=1, n(11)=2 
r(12)=1, n(12)=2 
r(13)=1, n(13)=2 
r(14)=1, n(14)=2 
r(15)=1, n(15)=2 
r(16)=1, n(16)=2 
r(17)=1, n(17)=2 
r(18)=1, n(18)=2 
r(19)=1, n(19)=2 
r(20)=1, n(20)=2 
r(21)=2, n(21)=2 
r(22)=2, n(22)=2 
r(23)=2, n(23)=2 
r(24)=2, n(24)=2 
r(25)=2, n(25)=2 
r(26)=2, n(26)=2 
r(27)=2, n(27)=2 
r(28)=2, n(28)=2 
r(29)=2, n(29)=2 
r(30)=2, n(30)=2 
r(31)=2, n(31)=2 
r(32)=2, n(32)=2 
r(33)=2, n(33)=2 
r(34)=2, n(34)=2 
r(35)=2, n(35)=2 
r(36)=2, n(36)=2 
r(37)=2, n(37)=2 
r(38)=2, n(38)=2 
r(39)=2, n(39)=2 
r(40)=2, n(40)=2 
r(41)=2, n(41)=2 
r(42)=2, n(42)=2 
r(43)=2, n(43)=2 
r(44)=2, n(44)=2 
r(45)=2, n(45)=2 
r(46)=2, n(46)=2 
r(47)=2, n(47)=2 
r(48)=2, n(48)=2 
r(49)=2, n(49)=2 
r(50)=2, n(50)=2 
r(51)=2, n(51)=2 
r(52)=2, n(52)=2 
r(53)=2, n(53)=2 
r(54)=2, n(54)=2 
r(55)=2, n(55)=2 
r(56)=2, n(56)=2 
r(57)=2, n(57)=2 
r(58)=2, n(58)=2 
r(59)=2, n(59)=2 
r(60)=2, n(60)=2 
r(61)=2, n(61)=2 
r(62)=2, n(62)=2 
r(63)=2, n(63)=2 
r(64)=2, n(64)=2 
r(65)=2, n(65)=2 
r(66)=2, n(66)=2 
r(67)=2, n(67)=2 
r(68)=2, n(68)=2 
r(69)=2, n(69)=2 
r(70)=1 

More complicated integral

Consider the integral

$$\int^{\infty}_0 \frac{\sin x}{x} dx = \frac{\pi}{2}.$$

Let us compute with rectangular rule on a finite interval $[0, b]$.

We take $2^d$ points, and use cross approximation to compute the approximation.


In [20]:
import math
fun = lambda x: np.sin(x)/x
a = 0
b = 1e6
d = 60
N = 2 ** d
h = (b - a)*1.0/(N - 1)
e = tt.ones(2, d)
t = (tt.xfun(2, d) + e)* h 
eps = 1e-12
s1 = tt.multifuncrs2([t], fun, eps)
tt.dot(s1, e) * h - math.pi/2


=multifuncrs= sweep 1{2}, max_dy: 2.613e+01, erank: 5.74456
=multifuncrs= sweep 2{1}, max_dy: 5.046e-03, erank: 8.77496
=multifuncrs= sweep 2{2}, max_dy: 5.046e-03, erank: 10.247
=multifuncrs= sweep 3{1}, max_dy: 5.657e-12, erank: 11.8322
=multifuncrs= sweep 3{2}, max_dy: 5.657e-12, erank: 11.7047
=multifuncrs= sweep 4{1}, max_dy: 9.912e-14, erank: 11.8743
Out[20]:
-9.3675221912725704e-07

Path integrals

In this paper we focus on the one-dimensional reaction-diffusion equation with initial distribution $f(x): \mathbb{R} \to \mathbb{R}^{+}$ and a constant diffusion coefficient~$\sigma$

\begin{equation} \left\{ \begin{aligned} & \frac{\partial}{\partial t} u(x,t) = \sigma \frac{\partial^2}{\partial x^2} u(x,t) - V(x,t) u(x,t),\\ & u(x,0)=f(x) \end{aligned} \right. \qquad t \in [0, T], \quad x \in \mathbb{R}. \end{equation}

Feyman-Kac formula

The solution can be expressed by the Feynman-Kac formula \begin{equation} u_{f}(x,T)=\int_{\mathcal C\{x,0; T \}} f(\xi(T)) e^{-\int_{0}^{T}\! V(\xi(\tau),T-\tau) d\tau } \mathcal{D}_{\xi}, \end{equation} where the integration is done over a set of all continuous paths $\xi(T): [0,T]\to \mathbb{R}$ from the Banach space $\Xi([0,T], \mathbb{R})$ starting at $\xi(0)=x$ and stopping at arbitrary endpoints at time $T$. $\mathcal{D}_{\xi}$ is the Wiener measure, and $\xi(t)$ is the Wiener process.

Discretization

A standard way to discretize the path integral is to break the time range $[0,T]$ into $n$ intervals \begin{equation} \tau_k=k \cdot \delta t, \qquad 0\le k\le n, \qquad n\!: \tau_n=T. \end{equation} \begin{equation} u^{(n)}(x,t)=\int_{-\infty}^{\infty} \mathcal{D}^{(n)}_{\xi} \, f\!\left( \xi^{(n)}\right) \! \prod_{i=0}^{n} e^{ -w_{i} V^{(n)}_{i} \delta t }, \end{equation} The integration sign here denotes an $n$-dimensional integration over the particle steps $\xi_k$,

Convolution

We rewrite the process as the sequence of convolutions \begin{equation} F^{(n)}_k(x)= \sqrt{\lambda/\pi} \int_{-\infty}^{\infty} \Phi^{(n)}_{k+1}(x + \xi) \, e^{-\lambda \xi^2 } d\xi, \qquad x \in \mathbb{R}, \qquad 1\le k \le n, \end{equation} where \begin{equation} \Phi^{(n)}_{k+1}(x)= F^{(n)}_{k+1}(x) \, e^{ -w_k V (x, \tau_{n - k}) \delta t }, \end{equation} and the boundary condition \begin{equation} F^{(n)}_{n+1}(x)=f(x), \end{equation} the solution is expressed as follows \begin{equation} u^{(n)}_{f}(x,T) = F^{(n)}_{1}(x) \, e^{-w_0 V(x, T) \, \delta t}. \end{equation} The iteration starts from $k=n$ and goes down to $k=1$.

Problem

After discretization on a uniform grid, we have $$ f_{k} = \int_{-\infty}^{\infty} f_{k+1}(x + y) e^{-y^2/2} dy $$ If we know $f_{k+1}$ on $[-L, L]h$ and discretize convolution on $M$ points, we know the solution on a grid $[-L+M, L-M]h$.

Inflated domain

We start from a large domain with $L \sim M^2$ points and then after $M$ steps find the solution at the required domain.

The complexity is then $\mathcal{O}(M^2 \log M)$ (due to the FFT).

To reduce complexity, we reshape the vector of length $M^2$ to a $M \times M$ vector and do low-rank approximation of it!

Numerical experiment

For numerical experiment we used diffusion equation in a weird potential which is non-periodic.

Solution (and convergence)
Potential and initial condition

Notes

  • We generalized it to Schrodinger equation
  • Can be generalized to $Nd$ (we inflate domain in all directions)
  • Can be applied to much more general integrals over Brownian motions

Financial integral

We compute PV (Present Discounted Value) that takes into account that value of money changes in time due to basic economic parameters. \begin{equation} PV=\left( \frac{\alpha}{\pi} \right)^{d/2} \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} v(\xi_1,\ldots, \xi_d) \, e^{-\alpha \xi_{1}^2}\ldots e^{-\alpha \xi_{d}^2} d\xi_d \ldots d\xi_1 \qquad v(\xi_1,\ldots, \xi_d) = \sum_{k=1}^{d} u_k m_k \end{equation} \begin{equation*} u_k=\prod_{j=0}^{k-1} \frac{1}{1+i_j}, \qquad m_k=cr_k(1-w_k +w_k c_k), \qquad r_k=\prod_{j=1}^{k-1}(1-w_j), \end{equation*} \begin{equation} c_k = \sum^{d-k}_{j=0}\frac{1}{(1+i_0)^j} \qquad w_k=K_1 + K_2 \arctan(K_3 i_k + K_4), \qquad i_k = K^{k}_0 e^{\xi_1 + \ldots + \xi_k} i_0 \end{equation}

Results

Accuracy of the cross approximation $\varepsilon = 10^{-10}$. Dimension of the integral is labeled by~$n$, It corresponds to the number of monthes in mortgage-backed security model. Its parameters are $(i_0,K_1,K_2,K_3, K_4, \sigma^2)=(0.007,0.01, -0.005, 10, 0.5,0.0004)$. Ranks of the matrix are presented in column labeled by $r$.

n r $T_{cross}$ (min.) $T_{MC}$ (min.)
36 7 0.1 7.7
48 10 0.16 11.6
90 10 0.3 21.4
180 10 0.5 41.5
360 10 0.8 81.6
720 9 1.56 178
1440 9 2.81 354

Solving PDEs on very fine meshes using QTT

The idea of QTT can be used to solve PDEs on very fine grids.

Kazeev & Schwab (2015) have proved that for all second-order elliptic/parabolic PDEs with analytic right-hand side the solution has the QTT-structure.

There are hp-methods that are capable of optimal complexity, but very difficult to be used

QTT-representation is very simple

Let us do a demo in another window.

Non-rectangular domains

Recently we have realized how to use QTT for non-rectangular domains.

Idea: Split into quads! (Picture taken from the paper by D. Zorin et. al), and use QTT for each particular piece.

Summary

  • QTT is a very efficient and convient tool for PDE discretization on extremely fine meshes

Graphical models

A very interesting model for joint probablity distribution are graphical models

Motivation

Say we want to

  • Denoize an image $X$;
  • Generate a new image of a horse;
  • Fill the gaps (image completion).

To do it, we may build a probabilistic model of our images:

$$ P: 256^{1000 \times 1000} \mapsto [0, 1] $$ Now we can

  • Find the image that is close to $X$ and have high probability;
  • Generate new image from the distribution $P$;
  • Find the most probable values of hidden pixels.

Three dimensions

Random variable $\vec{x} = (x_1, x_2, x_3)$, $x_j \in \{1, 2\}$:

  • Estimate all 8 parameters from the data:

    $$ P(\vec{x}) := \frac{1}{N} \sum_{i=1}^N [\vec{x} = \vec{x}^i] $$

  • Just store it explicitly: $P \in \mathbb{R}^{2 \times 2 \times 2}$.

    Million dimensions.

    How to estimate and store $256^{1000 000}$ parameters (probabilities)?

Graphical models

  • Define an undirected graph $\mathcal{G}$ with nodes corresponding to the variables (pixels of the image).
  • Define some positive functions $\Psi_c(T_c; X, W)$ (called MRF factors) on the cliques of the graph $\mathcal{G}$.
  • The model is then defined as follows: $$ p(T | X, W) = 1/Z(X, W) \prod\limits_{c \in \mathcal{C}} \Psi_c(T_c; X, W), $$ where $Z(X, W)$ is the normalization constant (statistical sum)

Even if the potentials are known, $Z(X, W)$ is difficult to compute (sum of a huge array).

Computing marginals

One example from the paper by Putting MRFs on a Tensor Train, A. Novikov, A. Rodomanov, A. Osokin, D. Vetrov, ICML-2014 $$\widehat{\mathbf{P}}(\vec{x}) = \prod_{i=1}^n \exp \left ( -\frac{1}{T} h_i x_i \right ) \prod_{(i, \, j) \in \mathcal{E}} \exp \left (-\frac{1}{T}c_{ij} x_i x_j \right )$$

Spin glass models, $T = 1$, $c_{ij} \sim U[-f, f]$.

Machine learning

The set of tensors with bounded TT-ranks allows for nice approximation; i.e. it is a good idea to use it as a regularization for machine learning problems.

This leads to the problems of the form

$$F(X) \rightarrow \min$$

where $X$ belongs to the manifold of tensors with bounded TT-ranks

This can be solved using Riemannian optimization techniques, which seems to be very difficult, but in fact very simple and efficient (see Manopt).

In fact, the nonlinear manifold of TT-tensors with bounded ranks has properties similar to the linear subspace!

Summary

  • Tensor train decomposition is a very useful tool with many good algorithms
  • Interesting to combine with other techniques

In [14]:
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[14]: