A signal, $f(x)$, sampled on a uniform grid with $2^d$ points, **reshape it** into a $2 \times \ldots \times 2$ $d$-dimensional tensor

$$V(i_1, \ldots, i_d) = v(i_1 + 2 i_2+ \ldots+2^{d-1} i_d)$$

For example, an exponential signal: $f(x) = \exp(\lambda x)$ breaks into the product of vectors of length $2$!

First we tensorize indices, then we permute them to bring $i_k j_k$ together.

$$A(i_1 i_2 \ldots i_d; j_1 \ldots j_d) \rightarrow B(i_1 j_1; i_2 j_2; \ldots i_d j_d), \mbox{Hilbert curve ordering}$$Why do we do it, it another question: our goal is to introduce a **reasonable structure** for image representation via tensors.

The key idea is the idea of **separation of variables**:

.

In probability theory, it corresponds to the idea of **independence** of variables.

This class is too **narrow** to represent useful functions (or probability distributions).

For $d > 3$ The canonical decomposition of the form

$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$then it is **very difficult** to compute numerically; there is even an example of $9 \times 9 \times 9$

tensor for which the tensor rank is not known!

**But this is not the only tensor decomposition** we can use.

Some times, it is better to study with the code and to try your own application

- Tensor Toolbox 2.5 (T. Kolda)
- TT-Toolbox, MATLAB (http://github.com/oseledets/TT-Toolbox)
- TT-Toolbox, Python (http://github.com/oseledets/ttpy)

The maximal-volume principle is present in many areas, one of the earliest related references I know is the **Design-optimality** (Fedorov, 1972), and is related to the selection of **optimal interpolation points** for a given basis set:

instead of doing linear regression, just do interpolation at $r$ points.

Given an $n \times r$ matrix, find the submatrix of largest **volume** it in.

Use greedy algorithm (now we know that it is D-optimality from at least 1972)

- Take some rows, put them in the first $r$. Compute $B = A \widehat{A}^{-1}$
- $B = \begin{pmatrix} I \\ Z \end{pmatrix}$
- Suppose maximal element in $Z$ is in position $(i,j)$.
- Swap $i$-th row with $j$-th row.
- Stop if maximal element is less than $(1+\delta)$.

Bad things about the canonical format:

- Best approximation may not exist
- Canonical rank is NP-complete (matrix rank is ...)
- No good algorithm

You can compute Tucker by means of the SVD:

- Compute unfoldings : $A_1, A_2, A_3$
- Compute left SVD factors: $A_i \approx U_i \Phi_i$
- Compute the core: $G = A \times_1 U^{\top}_1 \times_2 U^{\top}_2 \times_3 U^{\top}_3$.

The simplest construction **separates** one index at a time and gives the **tensor-train** (or **matrix-product state**) decomposition

where $G_k(i_k)$ is $r_{k-1} \times r_k$ matrix for any fixed $r_k$.

**Similar to Hidden Markov models**

Tensor-train (in physics known as matrix-product-state) representation has the form

$$ A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d), $$where $G_k(i_k)$ has size $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.

Now we have **d-1** ranks (we call them TT-ranks). The parametrization is defined by **TT-cores**.

Why tensor train?

Theorem: there exists a TT-decomposition with TT-ranks

$$r_k = \mathrm{rank} A_k$$

**Rounding** procedure solves the following problem: given a TT-representation

find **another one** with smaller memory that approximates it with **threshold** $\epsilon$.

It can be done in $\mathcal{O}(dnr^3)$ operations using the **linear structure** of the format:

We can do orthogonalization of the cores!

To used tensor decompositions, we have to **approximate** a continious problem by a discrete, where the **vector of unknowns** can be indexed by a $d$-index.

- (Standard) Given a function of $d$ variables, introduce a tensor-product basis set, and the tensor is the tensor of coefficients
- (Non-standard) Given a function of one variable, take it on a uniform grid with $2^d$ points, reshape into a $2 \times \ldots \times 2$ $d$-dimensional tensor

It is a mapping from 1D functions to $d$-dimensional arrays, using the following representation:

$$v_k = f(x_k), \quad k = 1, \ldots, 2^d.$$$V(i_1, \ldots, i_d) = v(a + i h)$, $\quad i = i_1 + 2 i_2 + \ldots 2^{d-1} i_d.$

It gives $\mathcal{O}(2dr^2) = \mathcal{O}(\log N)$ parameters.

The meaning of the QTT-rank is that you split the image into $2^k \times 2^k$ patches, and then you look at the **linear dependence** between patches, and compute the basis.

For the circle, due to grid artifacts, the dimension of the linear span is big.

Alternative: interpret this as a filter bank, and transform locally.

This leads to **WTT** transform, where the filters vary from level to level and are computed adaptively for the image.

http://arxiv.org/abs/1412.6553 Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

A convolutional neural network layer is just a contraction with a 4D tensor. We approximated this tensor by a canonical decomposition (using wonderful TensorLab by De Lathauwer et al.)

and then put one more layer into the network!

A multidimensional SVD adds you a tool, an efficient structure you can work with.

You can do:

- Tensor completion
- Solution of linear systems
- Regularization of regression

By reducing to the solution of low-rank constraint optimization: $$F(X) \rightarrow \min,$$

subject to $X$ has "low TT-rank".

- Linear systems: $$F(x) = <A(X), X> - 2<F, X>, \quad A = A^* > 0.$$
- Eigenvalue problems: $$F(X) = <A(X), X>/<X, X>.$$
- Completion problems:
- Dynamical low-rank approximation: $$\left\Vert \frac{dF(t)}{dt} - \frac{dX(t)}{dt} \right\Vert \rightarrow \min.$$

For all we have efficient methods, based on **DMRG** and more recent **AMEN** methods.

AMEN method, proposed by Dolgov & Savostyanov, is does not require solution of $\mathcal{O}(n^2)$ system.

Instead, it uses the ALS step to update the basis, and then enriches the basis by the **gradient projection.**

At the moment, it is the most effecient black-box solver available (and we have its implementation in TT-Toolbox!).

I became a big fan of Riemannian optimization techniques.

The set of TT-tensors with bounded has is a **manifold**, thus we can write **projected gradient method**

and then do **retraction** $R$ onto the manifold. Together with C. Lubich & B. Vandereycken we have designed a simplest

retraction called **projector-splitting**, which is a half-step of ALS in disguise.

The standard theory says that such methods convergence deteriorate when the curvature (i.e., inverse of the minimal singular value) is large. In practice: does not matter!

$$x = Qx + f, \quad \Vert Q \Vert \leq \delta, $$Solution $x^*$ has rank $5$ with $3$ large singular values, one $10^{-3}$ of the largest, one $10^{-6}$ of the largest.

Blue : original gradient method, Green : Manifold-projected gradient, Cyan : The normal component of the error

- oseledets.github.io
- http://github.com/oseledets/TT-Toolbox

```
In [2]:
```from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()

```
Out[2]:
```

```
In [ ]:
```