In [1]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'sky',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
})


Out[1]:
{u'start_slideshow_at': 'selected', u'theme': 'sky', u'transition': 'zoom'}

Low-rank approximation: new application areas?

Ivan Oseledets, Skolkovo Institute of Science and Technology
oseledets.github.io, i.oseledets@skoltech.ru

Talk: http://goo.gl/mTewpH

Skoltech

  • Founded in 2011 as a new Western-style university near Moscow http://faculty.skoltech.ru
  • In collaboration with MIT
  • No departments: priority areas "IT, Bio, Space and Nuclear"
  • At the moment, 160 master students, 30 professors, 40 postdocs, 50 PhD studens

MMMA-2015

In August 23-28 2015 we hold the 4-th Moscow conference in Matrix Methods in Mathematics and Applications.

Confirmed speakers: C. Schwab, B. Benner, J. White, D. Zorin, P.-A.Absil, A. Cichocki, P. Van Dooren.

Good time to visit Moscow (i.e., due to the exchange rate drop from 35 roubles/dollar to 55 roubles/dollar).

http://matrix.inm.ras.ru/mmma-2015/


In [2]:
%reload_ext tikzmagic

Talk plan

The main goal of this talk is to outline the difficulties in the usage of low-rank approximation in applications

  • Why non-quadratic can be important
  • Examples

Low-rank approximation of matrices

Given a matrix $A$ its low-rank approximation can be parametrized by the product $UV^{\top}$.

$$A \approx UV^{\top}, $$

where $U$ is $n \times r$ and $V$ is $m \times r$.

Computing low-rank factorization

Minimization of $$\min_{\mathrm{rank}(A_r)=r}\Vert A - A_r \Vert_2 = \sigma_{r+1}$$

can be done by the singular value decomposition (SVD).

Or by sampling and cross approximation

Applications of low-rank approximation

  • Integral equations (off-diagonal blocks are low-rank)
  • Fast direct sparse solvers (see the poster by Daria Sushnikova)
  • Numerous machine learning/data mining tasks (Latent Semantic Analysis, collaborative filtering)

Do we assume there is a good approximation? (i.e. $\sigma_{r+1}$ is small?)

What to do, if $\sigma_{r+1}$ is not small? </font>

Function approximation

In the function approximation, low-rank approximation is equivalent to the separation of variables

$$f(x_1, \ldots, x_d) \approx \sum_{\alpha=1}^r U_1(x_1, \alpha) \ldots U_d(x_d, \alpha),$$

Often is better to use stable tensor formats (TT-format, HT-format).

The main problem is to select right variables that can be separated

Example

Consider the recommender system: the user-product matrix $A$ contains only ones and zeros.

In the low-rank approximation, do we need least squares approximation (when we care about bought/or not)?

Of course, no (we optimize what we can optimize, not what we need to optimize).

Now, the simplest example possible

Identity matrix

$$ A = \begin{bmatrix} 1& 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1& 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix} $$
                       What  is the rank of this matrix?

Identity matrix(2)

The rank is $2$. But if we care only about $0, 1$, we can use the sign rank which is defined as follows for a $0-1$ matrix $Y$: Bouchard

$$\mathrm{rank}_{\pm}(A) = \min_{\tau \in \mathbb{R}, X \in \mathbb{R}^{n \times m}} \{\mathrm{rank}{X}; Y_{ij} = [X_{ij} + \tau > 0] \}$$

What is the sign rank of the identity matrix?

It is equal to $2$

General principle

It means, you have to use absolutely different functional (not quadratic, not matrix completion) in the minimization problem.

For a sign-rank, a good surrogate is the logistic loss

$$l(x, y) = \log(1 + e^{-(2y - 1)x}) $$$$F(X, Y) = \sum_{ij} l(x_{ij}, y_{ij})$$

Then we minimize $F(X, Y)$ given $Y$ subject to low-rank constraints on $X$.

Non-quadratic functional -- forget about alternating least squares

Riemannian optimization

One of the most important properties of low-rank formats is the structure of the tangent space

Given $A = U S V^{\top}$, with orthonormal $U$ and $V$ the tangent space is defined as

$$Z = U S V^{\top} + \delta U S V^{\top} + U \delta S V^{\top} + U S \delta V^{\top},$$

subject to $$\delta U^{\top} U + U^{\top} \delta U = \delta V^{\top} V + V^{\top} \delta V = 0$$

The projector onto the tangent space is given by

$$P_T(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top})$$

We can write down the projector for TT/HT formats as well!

Given the projector, you can solve the problems!

Optimization methods on manifolds

For the low-rank manifold (matrix, TT, HT) we can efficiently compute the projection to the tangent space.

The simplest is to project the gradient onto the tangent space:

$$x_{k+1} = R(x_k + P_{\mathcal{T}}(\nabla F)), $$

where $R$ is the mapping from the tangent space to the manifold, called retraction.

Example for sign-rank

The gradient of

$$F(X) = \sum_{ij} L(X_{ij}) $$

is given by

$$\frac{\partial F(X)}{\partial X_{kl}} = (d L/dt)(X_{kl}) = \frac{2 Y_{kl} - 1}{1 + e^{(2Y_{kl} - 1) X_{kl}}} $$

In [13]:
import numpy as np

n = 3
Y = np.eye(n)
Z = 2 * Y - 1
def compute_grad(X):
    return (-Z)/(1 + np.exp(-Z * X))

def compute_func(X):
    return np.sum(np.log(1 + np.exp(-Z * X)))

def project_tangent(F, U, V):
    return F - U.dot(U.T.dot(F)) - (F.dot(V)).dot(V.T)

#Initialization
r = 2
u = np.random.randn(n, r)
v = np.random.randn(n, r)
u = np.linalg.qr(u, )[0]
v = np.linalg.qr(v)[0]
s = np.random.randn(r, r)

In [16]:
#Run the code (no difficult step size selection)
for _ in xrange(10000):
    alpha = -0.05
    X = u.dot(s).dot(v.T)
    F = compute_grad(X)
    F = project_tangent(F, u, v)
    F0 =  compute_func(X)
    X = X +  alpha * F     
    u, s, v = np.linalg.svd(X)
    u = u[:, :r]
    s = np.diag(s[:r])
    v = v[:r, :].T
print compute_func(X ), np.linalg.norm(F)


111.573132258 0.0892293076862

In [17]:
np.sign(X)


Out[17]:
array([[-1.,  1.,  1.],
       [ 1., -1.,  1.],
       [ 1.,  1., -1.]])

Lyapunov equation

Another example of non-quadratic functional comes from the Lyapunov equation recent paper by Kolesnikov, Oseledets $$AX + X A^{\top} = y_0 y^{\top}_0$$

We look for the solution in the form

$$X \approx U Z U^{\top}$$

The functional (motivated by a seminal paper by Y. Saad)

$$F(U) = \int_{0}^{\infty} \Vert y - \widehat{y} \Vert^2 dt,$$

where $y = e^{At} y_0, \widehat{y} = U e^{Bt} U^{\top} y_0, B = U^{\top} A U$

Lyapunov equation (2)

This functional can be computed

$$ F(U) = \mathrm{tr}(X) - 2 \mathrm{tr}(U^{\top}(P - UZ)) - \mathrm{tr}(Z)$$

where $$B Z + Z B^{\top} = c_0 c^{\top}_0, \quad A P + P B^{\top} = y_0 c^{\top}_0, \quad c_0 = U^{\top} y_0$$

Lyapunov equation (3)

This leads to the Rational Krylov-type method with very simple shift generation procedure.

Summary

  • Use the functional you want to minimize, not quadratic loss
  • Try the Riemannian gradient descent first: simple to use, often converges well

Tensors

  • The most challenging problems are problems with tensors (curse of dimensionality, tricky optimization questions).

  • There are tensor formats that are matrix low-rank approximation manifolds in disguise

  • We can use efficient matrix techniques for working with them.

Tensor-train format

The simplest SVD-based format is the tensor-train format

$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d),$$

i.e. the product of matrices, depending only on 1 index, $G_k(i_k)$ is $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.

Canonical format

A popular choice in function approximation is the canonical or sum-of-products format

$$A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$

i.e. sum of separable functions.

Disadvantage: it is not a stable format: the best approximation may not exist, and may be hard to compute if we know that it exists!

However, for a particular tensor $A$ it can be very efficient.

Tensor train

The TT-format

$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d),$$

can be characterized by the following condition:

$$\mathrm{rank}(A_k) = r_k,$$

where $A_k = A(i_1 i_2 \ldots i_k; i_{k+1} \ldots i_d)$ is the k-th unfolding of the tensor.

I.e. it is the intersection of low-rank matrix manifolds!

Software

We have a TT-Toolbox, both in MATLAB http://github.com/oseledets/TT-Toolbox and in Python http://github.com/oseledets/ttpy

  • Computing the TT representation (i.e. checking if there is such a representation)
  • Performing basic operations
  • Adaptive sampling algorithms (cross approximation)
  • Optimization algorithms

Tensor formats in machine learning.

  • Define an undirected graph $\mathcal{G}$ with nodes corresponding to the variables (pixels of the image).
  • Define some positive functions $\Psi_c(T_c; X, W)$ (called MRF factors) on the cliques of the graph $\mathcal{G}$.
  • The model is then defined as follows: $$ p(T | X, W) = \frac{1}{Z(X, W)} \prod\limits_{c \in \mathcal{C}} \Psi_c(T_c; X, W), $$ where $Z(X, W)$ is the normalization constant.

Representation as a product of low TT-rank tensors

Then, each factor depends only on the few variables, so its ranks are bounded.

Thus, the tensor is a product of low-rank tensors

$$ A = A_1 \circ A_2 \ldots \circ A_p, $$

where $A_k(i_1, \ldots, i_d) = A_{k1}(i_1) \ldots A_{kd}(i_d)$.

Then, we use the standard trick and obtain $$\sum_{i_1, \ldots, i_d} A(i_1, \ldots, i_d) = \Big(\sum_{i_1} (A_{11}(i_1) \otimes A_{21}(i_1) \otimes \ldots \otimes A_{p1}(i_1)\Big)\ldots \Big(\sum_{i_1} (A_{11}(i_1) \otimes A_{2d}(i_d) \otimes \ldots \otimes A_{pd}(i_d)\Big),$$ i.e. is now a product of small-rank tensors, so we multiply them and compress.

Computing marginals

Putting MRFs on a Tensor Train, A. Novikov, A. Rodomanov, A. Osokin, D. Vetrov, ICML-2014 $$\widehat{\mathbf{P}}(\vec{x}) = \prod_{i=1}^n \exp \left ( -\frac{1}{T} h_i x_i \right ) \prod_{(i, \, j) \in \mathcal{E}} \exp \left (-\frac{1}{T}c_{ij} x_i x_j \right )$$ Spin glass models, $T = 1$, $c_{ij} \sim U[-f, f]$.

Optimization over TT-manifold

Given $A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d)$ optimize over one core $G_k(i_k)$.

  • Good for quadratic functionals, and also you can parametrize

    $$\mathrm{vec}(A) = Q \mathrm{vec}(G_k),$$
    where $Q$ is orthonormal.

  • Bad for non-quadratic (frequent in machine learning!)

Therefore, Riemanian optimization techniques are needed.

Riemannian gradient descent

$$x_{k+1} = R(x_k + P_{\mathcal{T}}(\nabla F)), $$

The retraction step is easy, since the projection alwas has rank $2r$, so it can be done by rounding.

The main problem is the computation of $$P_{\mathcal{T}}(\nabla F)$$ without exponential complexity.

If it is possible, this is the way to go.

Example: all-subsets regression

Consider the binary classification problem.

Log-regression is the simplest choice: given the feature vector $x$,

we predict the probability to observe $y_i$

$$p = Z \exp(-y_i \langle w, x_i \rangle).$$

I.e. the predictor variable is just the linear combination of the components of the feature vector,

and $w$ is the weight vector.

All-subsets regression

We can use other predictor variable, for example, select product of features (subsets) like

$$w_1 x_1 + w_{125} x_1 x_2 x_5 + \ldots $$

We can code all possible subsets in this form by a vector of length $2^d$, or tensor of size $2 \times \ldots \times 2$.

(i.e. if there is $x_1$ in the term, or not).

The predictor variable is then $t = \langle W, X \rangle$, where $\langle \cdot \rangle$ is the scalar product of tensors.

$W$ is $2 \times 2 \times \ldots \times 2$ -- weight tensor

We impose low-rank constraints on the $W$, to avoid overfitting.

Optimization problem

The total loss is the sum of individual losses $$F(W) = \sum_{k=1}^K f(y_i, \langle X_i, W \rangle),$$

where $X_i$ is the low-rank tensor obtained from the feature vector $x_i$.

The gradient is easily computatble:

$$\nabla F = \sum_{k=1}^K \frac{df}{dz}(y_i, \langle X_i, W \rangle) X_i,$$

and projection onto the tangent space can be computed in a fast way.

Preliminary results

On the problem Otto from Kaggle, the larger the rank, the better is learning

You have to train fast
(GPU implementation is necessary, as in the Deep Learning).

Idea for classifier

We can go further and try to use the low-rank regularization as a constraint.

Deep Learning is an extremely popular approach for classification/regression problems.

Given a feature vector $x$, we construct a sequence of linear/non-linear layers

$$x \rightarrow C_1 \rightarrow C_2 \rightarrow \ldots p$$

Tensor Train as classifier

The expression

$$ x W_1(i_1) \ldots W_d(i_d) $$

($x$ is the feature vector) can be viewed as a sequence of "linear" layers.

Classification problem: setup

We are given feature vectors $x$ (high-dimensional, $256 \times 256$ images) and $y$ - classes that we want to predict.

A log-regression uses a linear filter (matrix) $W$ and computes $p = Wx$, where $\exp(p)$ are then

the probabilities to observe the corresponding class.

Multi-layer scheme

A multilayer scheme uses a sequence of matrices $W(i_1, \ldots, i_d)$ parametrized by a tensor-train

$$W(i_1, \ldots, i_d) = W_1(i_1) \ldots W_d(i_d).$$

It is equivalent to $n^d$ linear classifiers, but with constraints (otherwise we will have overfitting)

This is a work in progress, but it seems to work.

Low-rank factorization as initialization

You can use low-rank factorization to initialize other optimization methods.

We have successfully speeded up the convolutional neural networks by factorizing a 4D tensor into the canonical format and then fine-tuning it.

Thanks to the wonderful TensorLab toolbox by L. De Lathauwer!

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, Victor Lempitsky, ICLR 2015

Publications and software


In [284]:
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[284]: