A signal, $f(x)$, sampled on a uniform grid with $2^d$ points, reshape it into a $2 \times \ldots \times 2$ $d$-dimensional tensor
$$V(i_1, \ldots, i_d) = v(i_1 + 2 i_2+ \ldots+2^{d-1} i_d)$$
For example, an exponential signal: $f(x) = \exp(\lambda x)$ breaks into the product of vectors of length $2$!
First we tensorize indices, then we permute them to bring $i_k j_k$ together.
$$A(i_1 i_2 \ldots i_d; j_1 \ldots j_d) \rightarrow B(i_1 j_1; i_2 j_2; \ldots i_d j_d), \mbox{Hilbert curve ordering}$$Why do we do it, it another question: our goal is to introduce a reasonable structure for image representation via tensors.
The key idea is the idea of separation of variables:
$$A(i_1, \ldots, i_d) = U_1(i_1) \ldots U_d(i_d)$$.
In probability theory, it corresponds to the idea of independence of variables.
This class is too narrow to represent useful functions (or probability distributions).
For $d > 3$ The canonical decomposition of the form
$$A(i_1, \ldots, i_d) = \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$then it is very difficult to compute numerically; there is even an example of $9 \times 9 \times 9$
tensor for which the tensor rank is not known!
But this is not the only tensor decomposition we can use.
Some times, it is better to study with the code and to try your own application
The maximal-volume principle is present in many areas, one of the earliest related references I know is the Design-optimality (Fedorov, 1972), and is related to the selection of optimal interpolation points for a given basis set:
$$f(x) \approx \sum_{\alpha=1}^r c_{\alpha} g_{\alpha}(x),$$instead of doing linear regression, just do interpolation at $r$ points.
Given an $n \times r$ matrix, find the submatrix of largest volume it in.
Use greedy algorithm (now we know that it is D-optimality from at least 1972)
Bad things about the canonical format:
You can compute Tucker by means of the SVD:
The simplest construction separates one index at a time and gives the tensor-train (or matrix-product state) decomposition
$$A(i_1, \ldots, i_d) \approx G_1(i_1) \ldots G_d(i_d), $$where $G_k(i_k)$ is $r_{k-1} \times r_k$ matrix for any fixed $r_k$.
Similar to Hidden Markov models
Tensor-train (in physics known as matrix-product-state) representation has the form
$$ A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d), $$where $G_k(i_k)$ has size $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.
Now we have d-1 ranks (we call them TT-ranks). The parametrization is defined by TT-cores.
Why tensor train?
Theorem: there exists a TT-decomposition with TT-ranks
$$r_k = \mathrm{rank} A_k$$
Rounding procedure solves the following problem: given a TT-representation
$$A(i_1, \ldots, i_d) = G_1(i_1) G_2(i_2) \ldots G_d(i_d)$$find another one with smaller memory that approximates it with threshold $\epsilon$.
It can be done in $\mathcal{O}(dnr^3)$ operations using the linear structure of the format:
We can do orthogonalization of the cores!
To used tensor decompositions, we have to approximate a continious problem by a discrete, where the vector of unknowns can be indexed by a $d$-index.
It is a mapping from 1D functions to $d$-dimensional arrays, using the following representation:
$$v_k = f(x_k), \quad k = 1, \ldots, 2^d.$$$V(i_1, \ldots, i_d) = v(a + i h)$, $\quad i = i_1 + 2 i_2 + \ldots 2^{d-1} i_d.$
It gives $\mathcal{O}(2dr^2) = \mathcal{O}(\log N)$ parameters.
The meaning of the QTT-rank is that you split the image into $2^k \times 2^k$ patches, and then you look at the linear dependence between patches, and compute the basis.
For the circle, due to grid artifacts, the dimension of the linear span is big.
Alternative: interpret this as a filter bank, and transform locally.
This leads to WTT transform, where the filters vary from level to level and are computed adaptively for the image.
http://arxiv.org/abs/1412.6553 Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition
A convolutional neural network layer is just a contraction with a 4D tensor. We approximated this tensor by a canonical decomposition (using wonderful TensorLab by De Lathauwer et al.)
and then put one more layer into the network!
A multidimensional SVD adds you a tool, an efficient structure you can work with.
You can do:
By reducing to the solution of low-rank constraint optimization: $$F(X) \rightarrow \min,$$
subject to $X$ has "low TT-rank".
For all we have efficient methods, based on DMRG and more recent AMEN methods.
AMEN method, proposed by Dolgov & Savostyanov, is does not require solution of $\mathcal{O}(n^2)$ system.
Instead, it uses the ALS step to update the basis, and then enriches the basis by the gradient projection.
At the moment, it is the most effecient black-box solver available (and we have its implementation in TT-Toolbox!).
Can we do better? I believe we can, and the way lies through the topology of the low-rank space.
I became a big fan of Riemannian optimization techniques.
The set of TT-tensors with bounded has is a manifold, thus we can write projected gradient method
$$X_{k+1} = P_{\mathcal{T}}(X_k + \nabla F(X_k)),$$and then do retraction $R$ onto the manifold. Together with C. Lubich & B. Vandereycken we have designed a simplest
retraction called projector-splitting, which is a half-step of ALS in disguise.
$$X_{k+1} = I(X_k, \nabla F(X_k)).$$The standard theory says that such methods convergence deteriorate when the curvature (i.e., inverse of the minimal singular value) is large. In practice: does not matter!
$$x = Qx + f, \quad \Vert Q \Vert \leq \delta, $$Solution $x^*$ has rank $5$ with $3$ large singular values, one $10^{-3}$ of the largest, one $10^{-6}$ of the largest.
Blue : original gradient method, Green : Manifold-projected gradient, Cyan : The normal component of the error
In [2]:
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Out[2]:
In [ ]: