Tensor / matrix decompositions

Ivan Oseledets

Skolkovo Institute of Science and Technology (Skoltech), Russia

Link to this talk: http://bit.ly/1xsnRqg

What is a tensor?

A tensor is a $d$-way array, $A(i_1, \ldots, i_d)$. For $d=2$ we get matrices.

A tensor decomposition is a (compact) representation of a tensor using fewer number of parameters.

Review: T. Kolda, B. Bader, Tensor decompositions

They appear in many applications:

multivariate functions approximations, quantum mechanics, multiparametric problems, hypergraphs, multidimensional data...

General setup

We select a certain manifold $\mathcal{M}$ of tensors with small dimension, and try to solve:

  1. Optimization problems, $F(X) \rightarrow \min$ with $X \in \mathcal{M}$

  2. (or) Using apriopri knowledge of $X \in \mathcal{M}$ recover $X$ by sampling

Multivariate functions

Many applications involve computation of a complex function of several parameters:

$$E = f(p_1, \ldots, p_M)$$

Any single function computation can be time-consuming.

A good interpolation scheme might be a good idea.

Multivariate function interpolation

There are not too many efficient methods for working with multivariate functions:

  • Radial basis functions
  • Sparse grids
  • Best $N$-term approximation (wavelets?)
  • Different machine learning techniques (decision trees, neural networks)

Many successful methods are based on the idea of separation of variables

Separation of variables

Function is called separable, if $$ f(x_1, \ldots, x_d) = f_1(x_1) f_2(x_2) \ldots f_d(x_d).$$

Not many functions can be represented in this form

(or even approximated, it means independence of variables in some sense.)

A more general and efficient class of functions is the sum of separable functions:

$$f(x_1,\ldots, x_d) \approx \sum_{\alpha=1}^r f^{(\alpha)}_1(x_1) \ldots f^{(\alpha)}_d(x_d)$$

Fitting by sums of separable functions

How to fit a given function $f$ by a sums of separable functions?

We can always consider $x_1, \ldots, x_d$ to be discrete, i.e

$$A(i_1, \ldots, i_d) = f(x_1(i_1), \ldots, x_d(i_d))$$

Second, we approximate the tensor by sampling.
Why it is possible to be done?

Number of parameters

In the discrete case, representation $$A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha)$$ is called canonical format, or canonical-polyadic (CP-format).
The number of parameters is $dnr$, so if $r$ is small, a good idea is that it is possible to fit those parameters.

Matrix case

In two dimensions separation boils down to low-rank approximation: $$A \approx UV^{\top}$$

Skeleton decomposition

How can we approximation a low-rank matrix without computing all its elements?
The skeleton decomposition gives an answer: if a matrix has rank $r$, it can be exactly recovered from its $r$ linearly independent columns and $r$ linearly independent rows.

Cross approximation

Cross approximation is a heuristic sampling technique for the selection of "good" rows and columns in a low-rank matrix.

The goal is to find rows and column that maximize the volume of the submatrix.

The maximal volume principle (Goreinov, Tyrtyshnikov) states that

$$ \Vert A - A_{skel} \Vert_C \leq (r + 1) \sigma_{r+1},$$

and $\Vert \cdot \Vert$ is the maximal in modulus element of a matrix.

We can do a short demo (and compare with the SVD)


In [26]:
import numpy as np
from rect_cross2d import rect_cross2d 
from scipy.linalg import hilbert 
a = hilbert(1000)
%timeit U, S, V = rect_cross2d(a, 1e-5)
%timeit U, S, V = np.linalg.svd(a)
U, S, V = rect_cross2d(a, 1e-5)
print 'True error:',  np.linalg.norm(U.dot(np.diag(S).dot(V)) - a)


10 loops, best of 3: 28 ms per loop
1 loops, best of 3: 742 ms per loop
True error: 1.08067407133e-05

Multivariate case: problems

There is a big problem in many dimensions:

it is not possible to do the exact interpolation trick for the sum-of-products (canonical format).

The problem is not with the algorithms but with the format itself:

the best approximation may not even exist!

Bad example

$$f(x_1, \ldots, x_d) = x_1 + \ldots + x_d$$

.

$$g(t) = (1 + x_1 t) (1 + x_2 t) \ldots (1 + x_d t)$$$$g'(0) = f = \underbrace{\frac{g(h) - g(0)}{h}}_{\mathrm{rank~2}} + \mathcal{O}(h),$$

but no exact rank 2 exist!

New tensor formats

It motivated the development of new tensor formats, that are stable as the singular value decomposition.

The Tensor Train (TT) format is the one that I use the most: it is simple and efficient

$$A(i_1, \ldots, i_d) \approx G_1(i_1) \ldots G_d(i_d),$$

where $G_k(i_k)$ has size $r_{k-1} \times r_k$, $r_0 = r_d = 1$.

TT-format in a nutshell

  • If there is a canonical representation with rank $R$, then $r_k \leq R$
  • TT-ranks are computable via SVD of auxiliary matrices
  • Basic linear algebra operations
  • Solve optimization problems with solution in the TT-format
  • There is an exact interpolation formula with $\mathcal{O}(dnr^2)$ parameters (TT-cross method)

Packages

We have two basic packages:

Many basic and advanced algorithms are implemented, and can be used as building blocks.

Demo on the cross approximation of tensors


In [27]:
import tt
from tt.cross import rect_cross

d = 20
n = 10
x0 = tt.rand(d, n, 3)
h = 1e-2
def myfun1(x):
    return np.sum(x, axis=1)
    #return np.sqrt((((h * x) ** 2).sum(axis=1)))

x1 = rect_cross(myfun1, x0)
x1 = x1.round(1e-8)
print x1


swp: 0/9 er_rel = 2.1e+00 er_abs = 6.4e+08 erank = 7.0 fun_eval: 8400
swp: 1/9 er_rel = 7.6e-16 er_abs = 2.3e-07 erank = 11.7 fun_eval: 36120
This is a 10-dimensional tensor 
r(0)=1, n(0)=20 
r(1)=2, n(1)=20 
r(2)=2, n(2)=20 
r(3)=2, n(3)=20 
r(4)=2, n(4)=20 
r(5)=2, n(5)=20 
r(6)=2, n(6)=20 
r(7)=2, n(7)=20 
r(8)=2, n(8)=20 
r(9)=2, n(9)=20 
r(10)=1 

Advanced methods

General strategy: formulate the problem as a minimization problem

$$ F(X) \rightarrow \min, \quad X \in \mathcal{M}. $$
  • Linear systems: $F(x) = (Ax, x) - 2(f, x)$, where $x$ can be mapped to a tensor
  • Eigenvalue problems: $F(x) = (Ax, x)/(x, x)$
  • Dynamic problem: $f(x) = \Vert \frac{dx}{dt} - \frac{dA}{dt} \Vert$,

Minimization techniques

  • Idea 1: Use ordinary optimization method + projection: $X_{k+1} = P(X_k + G(X_k))$
  • Idea 2: Use alternating least squares (polylinear structure)
  • Best: combine 1+2 (AMEN method). Instead of optimizing $A \approx UV^{\top}$ we enrich the basis with new elements.
    All of them are implemented in the Toolboxes (both Python and MATLAB)

Linear system demo


In [28]:
import numpy as np
import tt
from tt.amen import amen_solve

d = 8 
f = 4
mat = tt.qlaplace_dd([d] * f)
rhs = tt.ones(2, d * f)
sol = amen_solve(mat, rhs, rhs, 1e-6)
print 'Error:', (tt.matvec(mat, sol) - rhs).norm()/rhs.norm()
print 'Efficient rank:', sol.erank


Error: 5.96702346748e-07
Efficient rank: 35.2790975909

Recent results

  • Rectangular maximum-volume principle (A. Mikhalev): selecting representative columns/rows
  • Dynamical low-rank approximation

Rectangular maximum volume

Maximum volume-principle for square matrices:

If $A$ is $n \times r$, and $\widehat{A}$ is the submatrix with the largest volume, then

$$A = C \widehat{A},$$

where $|C_{ij}| \leq 1$.

Rectangular maximum volume (2)

Maximum volume-principle for rectangular matrices:

If $A$ is $n \times r$, and $\widehat{A}$ is the submatrix $K \times r, \quad K \geq r$ with the largest 2-volume, then

$$A = C \widehat{A},$$

where $\Vert C_i \Vert_2 \leq \sqrt{\frac{r}{K + r - 1}}$.

Volume of rectangular matrices

We define $p$-volume of a matrix $K \times r$ matrix $A$ with rows $a_i, i = 1, \ldots, K$, as the volume of the set

$$G = \{ x: x = \sum_i c_i a_i, \quad \Vert c \Vert_p \leq 1 \}$$

The 2-volume:

$$\mathrm{vol}(G) = \gamma \det A^* A$$

The 2-volume can be efficiently optimized, leading to the rectmaxvol algorithm.

Typically, $K \approx 2r$.

https://bitbucket.org/muxas/rect_maxvol (Alexander Mikhalev)

Applications of rectmaxvol

  • TT-cross approximation (used to be: select $r$ rows from $rn \times r$ matrix -- no rank adaptation).
  • Recommender systems
  • Preconditioning of linear least squares
  • Finding maximal elements in low-rank matrices

Solving dynamical problems

Besides solving optimization (or rank-constrained problems) we can also solve dynamical problems, i.e.

efficiently recompute decompositions when matrix/tensor change

Concept of dynamical low-rank approximation

We have a time-varying matrix $A(t)$ (time can be discrete!) and we want to do low-rank approximation of it:

$$A(t) \approx X(t) = U(t) S(t) V^{\top}(t).$$

Naive method (SVD projection) ignores time-dependence.

Dirac-Frenkel variational principle

In quantum physics, for many years an optimal solution is defined via Dirac-Frenkel principle:

$$\Vert \frac{dA}{dt} - \frac{dX}{dt} \Vert = \min,$$

or

$$ \langle \frac{dA}{dt} - \frac{dX}{dt}, Z \rangle = 0 $$

for all $Z$ in the tangent space to the manifold

Equations of motion in 2D case

(Lubich, Koch, H.-D. Meyer,...)

$$\begin{split} &\frac{dU}{dt} S = \frac{dA}{dt} V,\\ &\frac{dV}{dt} S^{\top} = \frac{dA^{\top}}{dt} U, \\ &\frac{dS}{dt} = U^{\top} \frac{dA}{dt} V. \end{split}$$

Suppose, you start from rank-$1$ and evolve into rank-$10$ case: the matrix $S$ will be singular!

Projector-splitting

$$\frac{dX}{dt} = P_X(\frac{dA}{dt}), $$

where $$ P_X(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top}). $$

Splitting integrator

The final splitting steps are

  • Step "L": $$\begin{split}L_0 = V_0 S^{\top}_0 \ L_1 = L_0 + (A_1 - A_0)^{\top} U_0 \
     [V_1, S^{\top}_1] = QR(L_1)\end{split}$$
  • Step "K": $$\begin{split} K_0 = U_0 S^{\top}_0 \ K_1 = K_0 + (A_1 - A_0) V_0 \

     [U_1, S_1] = QR(K_1)\end{split}$$
  • Step "L": $$S_1 = S_0 - U^{\top}_0 (A_1 - A_0) V_0$$ (note the minus!) The order of the steps matters

Can be used for updating!

Application to convolutional neural networks

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, Victor Lempitsky,arxiv 1412.6553

We used tensors to speed up CNN with a factor of 10. \begin{equation} V(x,y,t) = \sum_{i=x-\delta}^{x+\delta} \; \sum_{j=y-\delta}^{y+\delta} \; \sum_{s=1}^S K(i,j,s,t)\, U(i,j,s) \end{equation}

We approximate $K$ by a CP-decomposition and get an excellent initial condition for the fine-tuning!

CNN: results

36-chacter CNN: 1% accuracy drop and 8.5x speedup

AlexNet: speedup second convolution layer by a factor of 4, 1% increase of top-5 error

Conclusions

  • There are several SVD-based tensor formats
  • We can solve rank-constrained optimization/dynamical/interpolation problems
Questions?

In [12]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../common/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[12]: