Tensor / matrix decompositions

Ivan Oseledets

Skolkovo Institute of Science and Technology (Skoltech), Russia

Link to this talk: http://bit.ly/1xsnRqg

What is a tensor?

A tensor is a $d$-way array, $A(i_1, \ldots, i_d)$. For $d=2$ we get matrices.

A tensor decomposition is a (compact) representation of a tensor using fewer number of parameters.

Review: T. Kolda, B. Bader, Tensor decompositions

They appear in many applications:

multivariate functions approximations, quantum mechanics, multiparametric problems, hypergraphs, multidimensional data...

General setup

We select a certain manifold $\mathcal{M}$ of tensors with small dimension, and try to solve:

Optimization problems, $F(X) \rightarrow \min$ with $X \in \mathcal{M}$
(or) Using apriopri knowledge of $X \in \mathcal{M}$ recover $X$ by sampling

Multivariate functions

Many applications involve computation of a complex function of several parameters:

$$E = f(p_1, \ldots, p_M)$$

Any single function computation can be time-consuming.

A good interpolation scheme might be a good idea.

Multivariate function interpolation

There are not too many efficient methods for working with multivariate functions:

Radial basis functions
Sparse grids
Best $N$-term approximation (wavelets?)
Different machine learning techniques (decision trees, neural networks)

Many successful methods are based on the idea of separation of variables

Separation of variables

Function is called separable, if $$ f(x_1, \ldots, x_d) = f_1(x_1) f_2(x_2) \ldots f_d(x_d).$$

Not many functions can be represented in this form

(or even approximated, it means independence of variables in some sense.)

A more general and efficient class of functions is the sum of separable functions:

$$f(x_1,\ldots, x_d) \approx \sum_{\alpha=1}^r f^{(\alpha)}_1(x_1) \ldots f^{(\alpha)}_d(x_d)$$

Fitting by sums of separable functions

How to fit a given function $f$ by a sums of separable functions?

We can always consider $x_1, \ldots, x_d$ to be discrete, i.e

$$A(i_1, \ldots, i_d) = f(x_1(i_1), \ldots, x_d(i_d))$$

Second, we approximate the tensor by sampling.
Why it is possible to be done?

Number of parameters

In the discrete case, representation $$A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha)$$ is called canonical format, or canonical-polyadic (CP-format).
The number of parameters is $dnr$, so if $r$ is small, a good idea is that it is possible to fit those parameters.

Matrix case

In two dimensions separation boils down to low-rank approximation: $$A \approx UV^{\top}$$

Skeleton decomposition

How can we approximation a low-rank matrix without computing all its elements?
The skeleton decomposition gives an answer: if a matrix has rank $r$, it can be exactly recovered from its $r$ linearly independent columns and $r$ linearly independent rows.

Cross approximation

Cross approximation is a heuristic sampling technique for the selection of "good" rows and columns in a low-rank matrix.

The goal is to find rows and column that maximize the volume of the submatrix.

The maximal volume principle (Goreinov, Tyrtyshnikov) states that

$$ \Vert A - A_{skel} \Vert_C \leq (r + 1) \sigma_{r+1},$$

and $\Vert \cdot \Vert$ is the maximal in modulus element of a matrix.

We can do a short demo (and compare with the SVD)



In [26]:

    
import numpy as np
from rect_cross2d import rect_cross2d 
from scipy.linalg import hilbert 
a = hilbert(1000)
%timeit U, S, V = rect_cross2d(a, 1e-5)
%timeit U, S, V = np.linalg.svd(a)
U, S, V = rect_cross2d(a, 1e-5)
print 'True error:',  np.linalg.norm(U.dot(np.diag(S).dot(V)) - a)









    



10 loops, best of 3: 28 ms per loop
1 loops, best of 3: 742 ms per loop
True error: 1.08067407133e-05

Multivariate case: problems

There is a big problem in many dimensions:

it is not possible to do the exact interpolation trick for the sum-of-products (canonical format).

The problem is not with the algorithms but with the format itself:

the best approximation may not even exist!

Bad example

$$f(x_1, \ldots, x_d) = x_1 + \ldots + x_d$$

$$g(t) = (1 + x_1 t) (1 + x_2 t) \ldots (1 + x_d t)$$$$g'(0) = f = \underbrace{\frac{g(h) - g(0)}{h}}_{\mathrm{rank~2}} + \mathcal{O}(h),$$

but no exact rank 2 exist!

New tensor formats

It motivated the development of new tensor formats, that are stable as the singular value decomposition.

The Tensor Train (TT) format is the one that I use the most: it is simple and efficient

$$A(i_1, \ldots, i_d) \approx G_1(i_1) \ldots G_d(i_d),$$

where $G_k(i_k)$ has size $r_{k-1} \times r_k$, $r_0 = r_d = 1$.

TT-format in a nutshell

If there is a canonical representation with rank $R$, then $r_k \leq R$
TT-ranks are computable via SVD of auxiliary matrices
Basic linear algebra operations
Solve optimization problems with solution in the TT-format
There is an exact interpolation formula with $\mathcal{O}(dnr^2)$ parameters (TT-cross method)

Packages

We have two basic packages:

TT-Toolbox (http://github.com/oseledets/TT-Toolbox) which is a MATLAB Toolbox that implements all the basic algorithms
ttpy (http://github.com/oseledets/ttpy) - its Python counterpart

Many basic and advanced algorithms are implemented, and can be used as building blocks.

Demo on the cross approximation of tensors



In [27]:

    
import tt
from tt.cross import rect_cross

d = 20
n = 10
x0 = tt.rand(d, n, 3)
h = 1e-2
def myfun1(x):
    return np.sum(x, axis=1)
    #return np.sqrt((((h * x) ** 2).sum(axis=1)))

x1 = rect_cross(myfun1, x0)
x1 = x1.round(1e-8)
print x1









    



swp: 0/9 er_rel = 2.1e+00 er_abs = 6.4e+08 erank = 7.0 fun_eval: 8400
swp: 1/9 er_rel = 7.6e-16 er_abs = 2.3e-07 erank = 11.7 fun_eval: 36120
This is a 10-dimensional tensor 
r(0)=1, n(0)=20 
r(1)=2, n(1)=20 
r(2)=2, n(2)=20 
r(3)=2, n(3)=20 
r(4)=2, n(4)=20 
r(5)=2, n(5)=20 
r(6)=2, n(6)=20 
r(7)=2, n(7)=20 
r(8)=2, n(8)=20 
r(9)=2, n(9)=20 
r(10)=1

Advanced methods

General strategy: formulate the problem as a minimization problem

$$ F(X) \rightarrow \min, \quad X \in \mathcal{M}. $$

Linear systems: $F(x) = (Ax, x) - 2(f, x)$, where $x$ can be mapped to a tensor
Eigenvalue problems: $F(x) = (Ax, x)/(x, x)$
Dynamic problem: $f(x) = \Vert \frac{dx}{dt} - \frac{dA}{dt} \Vert$,

Minimization techniques

Idea 1: Use ordinary optimization method + projection: $X_{k+1} = P(X_k + G(X_k))$
Idea 2: Use alternating least squares (polylinear structure)
Best: combine 1+2 (AMEN method). Instead of optimizing $A \approx UV^{\top}$ we enrich the basis with new elements.
All of them are implemented in the Toolboxes (both Python and MATLAB)

Linear system demo



In [28]:

    
import numpy as np
import tt
from tt.amen import amen_solve

d = 8 
f = 4
mat = tt.qlaplace_dd([d] * f)
rhs = tt.ones(2, d * f)
sol = amen_solve(mat, rhs, rhs, 1e-6)
print 'Error:', (tt.matvec(mat, sol) - rhs).norm()/rhs.norm()
print 'Efficient rank:', sol.erank









    



Error: 5.96702346748e-07
Efficient rank: 35.2790975909

Rectangular maximum volume

Maximum volume-principle for square matrices:

If $A$ is $n \times r$, and $\widehat{A}$ is the submatrix with the largest volume, then

$$A = C \widehat{A},$$

where $|C_{ij}| \leq 1$.

Rectangular maximum volume (2)

Maximum volume-principle for rectangular matrices:

If $A$ is $n \times r$, and $\widehat{A}$ is the submatrix $K \times r, \quad K \geq r$ with the largest 2-volume, then

$$A = C \widehat{A},$$

where $\Vert C_i \Vert_2 \leq \sqrt{\frac{r}{K + r - 1}}$.

Volume of rectangular matrices

We define $p$-volume of a matrix $K \times r$ matrix $A$ with rows $a_i, i = 1, \ldots, K$, as the volume of the set

$$G = \{ x: x = \sum_i c_i a_i, \quad \Vert c \Vert_p \leq 1 \}$$

The 2-volume:

$$\mathrm{vol}(G) = \gamma \det A^* A$$

The 2-volume can be efficiently optimized, leading to the rectmaxvol algorithm.

Typically, $K \approx 2r$.

https://bitbucket.org/muxas/rect_maxvol (Alexander Mikhalev)

Applications of rectmaxvol

TT-cross approximation (used to be: select $r$ rows from $rn \times r$ matrix -- no rank adaptation).
Recommender systems
Preconditioning of linear least squares
Finding maximal elements in low-rank matrices

Solving dynamical problems

Besides solving optimization (or rank-constrained problems) we can also solve dynamical problems, i.e.

efficiently recompute decompositions when matrix/tensor change

Concept of dynamical low-rank approximation

We have a time-varying matrix $A(t)$ (time can be discrete!) and we want to do low-rank approximation of it:

$$A(t) \approx X(t) = U(t) S(t) V^{\top}(t).$$

Naive method (SVD projection) ignores time-dependence.

Dirac-Frenkel variational principle

In quantum physics, for many years an optimal solution is defined via Dirac-Frenkel principle:

$$\Vert \frac{dA}{dt} - \frac{dX}{dt} \Vert = \min,$$

$$ \langle \frac{dA}{dt} - \frac{dX}{dt}, Z \rangle = 0 $$

for all $Z$ in the tangent space to the manifold

Equations of motion in 2D case

(Lubich, Koch, H.-D. Meyer,...)

$$\begin{split} &\frac{dU}{dt} S = \frac{dA}{dt} V,\\ &\frac{dV}{dt} S^{\top} = \frac{dA^{\top}}{dt} U, \\ &\frac{dS}{dt} = U^{\top} \frac{dA}{dt} V. \end{split}$$

Suppose, you start from rank-$1$ and evolve into rank-$10$ case: the matrix $S$ will be singular!

Projector-splitting

$$\frac{dX}{dt} = P_X(\frac{dA}{dt}), $$

where $$ P_X(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top}). $$

Splitting integrator

The final splitting steps are

Step "L": $$\begin{split}L_0 = V_0 S^{\top}_0 \ L_1 = L_0 + (A_1 - A_0)^{\top} U_0 \
```
 [V_1, S^{\top}_1] = QR(L_1)\end{split}$$
```
Step "K": $$\begin{split} K_0 = U_0 S^{\top}_0 \ K_1 = K_0 + (A_1 - A_0) V_0 \
```
 [U_1, S_1] = QR(K_1)\end{split}$$
```
Step "L": $$S_1 = S_0 - U^{\top}_0 (A_1 - A_0) V_0$$ (note the minus!) The order of the steps matters

Can be used for updating!

Application to convolutional neural networks

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, Victor Lempitsky,arxiv 1412.6553

We used tensors to speed up CNN with a factor of 10. \begin{equation} V(x,y,t) = \sum_{i=x-\delta}^{x+\delta} \; \sum_{j=y-\delta}^{y+\delta} \; \sum_{s=1}^S K(i,j,s,t)\, U(i,j,s) \end{equation}

We approximate $K$ by a CP-decomposition and get an excellent initial condition for the fine-tuning!

CNN: results

36-chacter CNN: 1% accuracy drop and 8.5x speedup

AlexNet: speedup second convolution layer by a factor of 4, 1% increase of top-5 error

Conclusions

There are several SVD-based tensor formats
We can solve rank-constrained optimization/dynamical/interpolation problems

Publications, software

http://oseledets.github.io - group website, papers, etc.
http://github.com/oseledets/TT-Toolbox
http://github.com/oseledets/ttpy
https://bitbucket.org/muxas/rect_maxvol

Questions?



In [12]:

    
from IPython.core.display import HTML
def css_styling():
    styles = open("../common/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[12]: