In [1]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
'theme': 'sky',
'transition': 'zoom',
'start_slideshow_at': 'selected',
})
Out[1]:
In August 23-28 2015 we hold the 4-th Moscow conference in Matrix Methods in Mathematics and Applications.
Confirmed speakers: C. Schwab, B. Benner, J. White, D. Zorin, P.-A.Absil, A. Cichocki, P. Van Dooren.
Good time to visit Moscow (i.e., due to the exchange rate drop from 35 roubles/dollar to 55 roubles/dollar).
http://matrix.inm.ras.ru/mmma-2015/
In [2]:
%reload_ext tikzmagic
Do we assume there is a good approximation? (i.e. $\sigma_{r+1}$ is small?)
What to do, if $\sigma_{r+1}$ is not small? </font>
In the function approximation, low-rank approximation is equivalent to the separation of variables
$$f(x_1, \ldots, x_d) \approx \sum_{\alpha=1}^r U_1(x_1, \alpha) \ldots U_d(x_d, \alpha),$$Often is better to use stable tensor formats (TT-format, HT-format).
The main problem is to select right variables that can be separated
Consider the recommender system: the user-product matrix $A$ contains only ones and zeros.
In the low-rank approximation, do we need least squares approximation (when we care about bought/or not)?
Of course, no (we optimize what we can optimize, not what we need to optimize).
Now, the simplest example possible
The rank is $2$. But if we care only about $0, 1$, we can use the sign rank which is defined as follows for a $0-1$ matrix $Y$: Bouchard
$$\mathrm{rank}_{\pm}(A) = \min_{\tau \in \mathbb{R}, X \in \mathbb{R}^{n \times m}} \{\mathrm{rank}{X}; Y_{ij} = [X_{ij} + \tau > 0] \}$$What is the sign rank of the identity matrix?
It is equal to $2$
It means, you have to use absolutely different functional (not quadratic, not matrix completion) in the minimization problem.
For a sign-rank, a good surrogate is the logistic loss
$$l(x, y) = \log(1 + e^{-(2y - 1)x}) $$$$F(X, Y) = \sum_{ij} l(x_{ij}, y_{ij})$$Then we minimize $F(X, Y)$ given $Y$ subject to low-rank constraints on $X$.
Non-quadratic functional -- forget about alternating least squares
One of the most important properties of low-rank formats is the structure of the tangent space
Given $A = U S V^{\top}$, with orthonormal $U$ and $V$ the tangent space is defined as
$$Z = U S V^{\top} + \delta U S V^{\top} + U \delta S V^{\top} + U S \delta V^{\top},$$subject to $$\delta U^{\top} U + U^{\top} \delta U = \delta V^{\top} V + V^{\top} \delta V = 0$$
The projector onto the tangent space is given by
$$P_T(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top})$$We can write down the projector for TT/HT formats as well!
Given the projector, you can solve the problems!
For the low-rank manifold (matrix, TT, HT) we can efficiently compute the projection to the tangent space.
The simplest is to project the gradient onto the tangent space:
$$x_{k+1} = R(x_k + P_{\mathcal{T}}(\nabla F)), $$where $R$ is the mapping from the tangent space to the manifold, called retraction.
In [13]:
import numpy as np
n = 3
Y = np.eye(n)
Z = 2 * Y - 1
def compute_grad(X):
return (-Z)/(1 + np.exp(-Z * X))
def compute_func(X):
return np.sum(np.log(1 + np.exp(-Z * X)))
def project_tangent(F, U, V):
return F - U.dot(U.T.dot(F)) - (F.dot(V)).dot(V.T)
#Initialization
r = 2
u = np.random.randn(n, r)
v = np.random.randn(n, r)
u = np.linalg.qr(u, )[0]
v = np.linalg.qr(v)[0]
s = np.random.randn(r, r)
In [16]:
#Run the code (no difficult step size selection)
for _ in xrange(10000):
alpha = -0.05
X = u.dot(s).dot(v.T)
F = compute_grad(X)
F = project_tangent(F, u, v)
F0 = compute_func(X)
X = X + alpha * F
u, s, v = np.linalg.svd(X)
u = u[:, :r]
s = np.diag(s[:r])
v = v[:r, :].T
print compute_func(X ), np.linalg.norm(F)
In [17]:
np.sign(X)
Out[17]:
Another example of non-quadratic functional comes from the Lyapunov equation recent paper by Kolesnikov, Oseledets $$AX + X A^{\top} = y_0 y^{\top}_0$$
We look for the solution in the form
$$X \approx U Z U^{\top}$$The functional (motivated by a seminal paper by Y. Saad)
$$F(U) = \int_{0}^{\infty} \Vert y - \widehat{y} \Vert^2 dt,$$where $y = e^{At} y_0, \widehat{y} = U e^{Bt} U^{\top} y_0, B = U^{\top} A U$
A popular choice in function approximation is the canonical or sum-of-products format
$$A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$i.e. sum of separable functions.
Disadvantage: it is not a stable format: the best approximation may not exist, and may be hard to compute if we know that it exists!
However, for a particular tensor $A$ it can be very efficient.
The TT-format
$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d),$$can be characterized by the following condition:
$$\mathrm{rank}(A_k) = r_k,$$where $A_k = A(i_1 i_2 \ldots i_k; i_{k+1} \ldots i_d)$ is the k-th unfolding of the tensor.
I.e. it is the intersection of low-rank matrix manifolds!
We have a TT-Toolbox, both in MATLAB http://github.com/oseledets/TT-Toolbox and in Python http://github.com/oseledets/ttpy
Then, we use the standard trick and obtain $$\sum_{i_1, \ldots, i_d} A(i_1, \ldots, i_d) = \Big(\sum_{i_1} (A_{11}(i_1) \otimes A_{21}(i_1) \otimes \ldots \otimes A_{p1}(i_1)\Big)\ldots \Big(\sum_{i_1} (A_{11}(i_1) \otimes A_{2d}(i_d) \otimes \ldots \otimes A_{pd}(i_d)\Big),$$ i.e. is now a product of small-rank tensors, so we multiply them and compress.
Putting MRFs on a Tensor Train, A. Novikov, A. Rodomanov, A. Osokin, D. Vetrov, ICML-2014
Given $A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d)$ optimize over one core $G_k(i_k)$.
Good for quadratic functionals, and also you can parametrize
$$\mathrm{vec}(A) = Q \mathrm{vec}(G_k),$$
where $Q$ is orthonormal.
Therefore, Riemanian optimization techniques are needed.
The retraction step is easy, since the projection alwas has rank $2r$, so it can be done by rounding.
The main problem is the computation of $$P_{\mathcal{T}}(\nabla F)$$ without exponential complexity.
If it is possible, this is the way to go.
Consider the binary classification problem.
Log-regression is the simplest choice: given the feature vector $x$,
we predict the probability to observe $y_i$
$$p = Z \exp(-y_i \langle w, x_i \rangle).$$I.e. the predictor variable is just the linear combination of the components of the feature vector,
and $w$ is the weight vector.
We can use other predictor variable, for example, select product of features (subsets) like
$$w_1 x_1 + w_{125} x_1 x_2 x_5 + \ldots $$We can code all possible subsets in this form by a vector of length $2^d$, or tensor of size $2 \times \ldots \times 2$.
(i.e. if there is $x_1$ in the term, or not).
The predictor variable is then $t = \langle W, X \rangle$, where $\langle \cdot \rangle$ is the scalar product of tensors.
$W$ is $2 \times 2 \times \ldots \times 2$ -- weight tensor
We impose low-rank constraints on the $W$, to avoid overfitting.
The total loss is the sum of individual losses $$F(W) = \sum_{k=1}^K f(y_i, \langle X_i, W \rangle),$$
where $X_i$ is the low-rank tensor obtained from the feature vector $x_i$.
The gradient is easily computatble:
$$\nabla F = \sum_{k=1}^K \frac{df}{dz}(y_i, \langle X_i, W \rangle) X_i,$$and projection onto the tangent space can be computed in a fast way.
On the problem Otto from Kaggle, the larger the rank, the better is learning
You have to train fast
(GPU implementation is necessary, as in the Deep Learning).
We can go further and try to use the low-rank regularization as a constraint.
Deep Learning is an extremely popular approach for classification/regression problems.
Given a feature vector $x$, we construct a sequence of linear/non-linear layers
$$x \rightarrow C_1 \rightarrow C_2 \rightarrow \ldots p$$We are given feature vectors $x$ (high-dimensional, $256 \times 256$ images) and $y$ - classes that we want to predict.
A log-regression uses a linear filter (matrix) $W$ and computes $p = Wx$, where $\exp(p)$ are then
the probabilities to observe the corresponding class.
A multilayer scheme uses a sequence of matrices $W(i_1, \ldots, i_d)$ parametrized by a tensor-train
$$W(i_1, \ldots, i_d) = W_1(i_1) \ldots W_d(i_d).$$It is equivalent to $n^d$ linear classifiers, but with constraints (otherwise we will have overfitting)
This is a work in progress, but it seems to work.
You can use low-rank factorization to initialize other optimization methods.
We have successfully speeded up the convolutional neural networks by factorizing a 4D tensor into the canonical format and then fine-tuning it.
Thanks to the wonderful TensorLab toolbox by L. De Lathauwer!
In [284]:
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Out[284]: