In [4]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
'theme': 'sky',
'transition': 'zoom',
'start_slideshow_at': 'selected',
})
Out[4]:
In August 23-28 2015 we hold the 4-th Moscow conference in Matrix Methods in Mathematics and Applications.
Confirmed speakers: C. Schwab, P. G. Martinsson, B. Benner, J. White, D. Zorin, P.-A.Absil, A. Cichocki, P. Van Dooren.
Good time to visit Moscow (i.e., due to the exchange rate drop from 35 roubles/dollar to 55 roubles/dollar).
http://matrix.inm.ras.ru/mmma-2015/
The main goal of this talk is to show different connections between non-stationary problems and optimization on low-rank manifolds of matrices and tensors
The details for this part of the talk can be found in the paper
From low-rank approximation to an efficient rational Krylov subspace method for the Lyapunov equation, D. A. Kolesnikov, I. V. Oseledets
Consider a linear system of ODEs,
$$\frac{dy}{dt} = Ay, y(0) = y_0.$$All methods for solving such problems are based on a careful selection of a low-dimensional subspace $U$
such that
$$y(t) \approx U c(t),$$where $U^* U = I$ and $U$ has $r$ columns.
In model order reduction, the exact solution for such subspace seems to be well-known (however, we did not find exactly the same result)
$\def\trace{\mathop{\mathrm{tr}}\nolimits}$ The functional can be written as
\begin{equation*} F(U) = F_1(U) - 2 F_2(U), \end{equation*}\begin{equation} \begin{split} F_1(U) &= \trace X - \trace Z,\\ F_2(U) &= \trace U^{\top} (P - U Z), \end{split} \end{equation}where $P$ is the solution of the Sylvester equation and $Z$ is solution of the Lyapunov equation:
\begin{equation} \begin{split} A P + P B^{\top} &= -y_0 c_0^{\top},\\ B Z + Z B^{\top} &= -c_0 c_0^{\top}. \end{split} \end{equation}Now we go to the next part of the talk: diffusion/Schrodinger equation in unbounded domains using "inflation" method with low-rank approximation.
It is based on the joint work with my postdoc Mikhail Litsarev
The idea is described in the paper Low-rank approach to the computation of path integrals
The matrix $\mathbf{K}$ is a bi-infinite tridiagonal symmetric Toeplitz matrix with elements
$$K_{kl} = \mathbf{k}_{k-l}, $$
wher $\mathbf{k}_0 = -\frac{2}{h^2}$, $\mathbf{k}_1 = \mathbf{k}_{-1} = \frac{1}{h^2}$ and all other elements of $\mathbf{k}$ are equal to $0$. The operator $\mathbf{V}$
corresponds to a bi-infinite diagonal matrix with elements $V(x_k)$ on the diagonal.
Now we apply the second-order Marchuk-Strang splitting
$$\psi(t + \tau) = e^{i \mathbf{H} \tau} \psi(t) = e^{i \mathbf{V} \tau/2} e^{i \mathbf{K} \tau} e^{i \mathbf{V} \tau/2} \psi(t) + \mathcal{O}(\tau^3).$$Since matrix $\mathbf{V}$ is a diagonal matrix, its exponent is also a diagonal matrix. The most interesting part is the computation of the action of the matrix exponent
\begin{equation} \widehat{\varphi} = e^{i \mathbf{K} \tau} \varphi. \end{equation}Since the matrix $\mathbf{K}$ is a Toeplitz matrix, the exponent is also a Toeplitz matrix, and the action of the operator is a convolution $$\widehat{\varphi}_k = \sum_{l=-\infty}^{\infty} \mathbf{b}_{l} \varphi_{l+k}.$$
The coefficients $b_l$ depend on $\tau$ and $h$.
Theorem: $b_l$ are exponentially decaying for $|l| > M$, $M \sim \frac{\tau}{h^2}$.
In [7]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import fft, ifft
Nx = 200
Mx = 3 * Nx + 3
hx = 1.0/Nx
tau = hx
print 'Mx = ', Mx
H = np.zeros(Mx)
H[0] = -2
H[1] = 1
H[Mx - 1] = 1
cf = 0.5 * tau / (hx ** 2)
H = -cf * H
lam_analytic = fft(H)
band = ifft(np.exp(-1j * lam_analytic))
abs_band = abs(band)
plt.semilogy(abs_band)
plt.title('Convolution coefficients')
Out[7]:
We can replace the infinite convolution by a window one!
$$\widehat{\varphi}_k \approx \sum_{l=-M/2}^{M/2} \mathbf{b}_{l} \varphi_{k+l}.$$Thus, if we know the solution at $[-L, L]$, we can compute it for $[-L+M/2, L-M/2]$ (smaller interval).
How big $L$ should be?
$\tau \sim h$, $M \sim \frac{1}{h}$, thus to get time $T$ we need at least $MT$ steps.
At each step $M$ points disappear, thus $L \sim TM^2$ points are needed (inflated domain).
How to break the complexity/storage problems?
Theorem After one convolution the rank is at most $2r$ (thus we approximate).
Intialization and multiplication by $e^{i V \tau}$ is done cross approximation.
In the function approximation, low-rank approximation is equivalent to the separation of variables
$$f(x_1, \ldots, x_d) \approx \sum_{\alpha=1}^r U_1(x_1, \alpha) \ldots U_d(x_d, \alpha),$$Often is better to use stable tensor formats (TT-format, HT-format).
The main problem is to select right variables that can be separated
It means, you have to use absolutely different functional (not quadratic, not matrix completion) in the minimization problem.
For a sign-rank, a good surrogate is the logistic loss
$$l(x, y) = \log(1 + e^{-(2y - 1)x}) $$$$F(X, Y) = \sum_{ij} l(x_{ij}, y_{ij})$$Then we minimize $F(X, Y)$ given $Y$ subject to low-rank constraints on $X$.
Non-quadratic functional -- forget about alternating least squares
One of the most important properties of low-rank formats is the structure of the tangent space
Given $A = U S V^{\top}$, with orthonormal $U$ and $V$ the tangent space is defined as
$$Z = U S V^{\top} + \delta U S V^{\top} + U \delta S V^{\top} + U S \delta V^{\top},$$subject to $$\delta U^{\top} U + U^{\top} \delta U = \delta V^{\top} V + V^{\top} \delta V = 0$$
The projector onto the tangent space is given by
$$P_T(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top})$$We can write down the projector for TT/HT formats as well!
Given the projector, you can solve the problems!
For the low-rank manifold (matrix, TT, HT) we can efficiently compute the projection to the tangent space.
The simplest is to project the gradient onto the tangent space:
$$x_{k+1} = R(x_k + P_{\mathcal{T}}(\nabla F)), $$where $R$ is the mapping from the tangent space to the manifold, called retraction.
A popular choice in function approximation is the canonical or sum-of-products format
$$A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$i.e. sum of separable functions.
Disadvantage: it is not a stable format: the best approximation may not exist, and may be hard to compute if we know that it exists!
However, for a particular tensor $A$ it can be very efficient.
The TT-format
$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d),$$can be characterized by the following condition:
$$\mathrm{rank}(A_k) = r_k,$$where $A_k = A(i_1 i_2 \ldots i_k; i_{k+1} \ldots i_d)$ is the k-th unfolding of the tensor.
I.e. it is the intersection of low-rank matrix manifolds!
We have a TT-Toolbox, both in MATLAB http://github.com/oseledets/TT-Toolbox and in Python http://github.com/oseledets/ttpy
Given $A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d)$ optimize over one core $G_k(i_k)$.
Good for quadratic functionals, and also you can parametrize
$$\mathrm{vec}(A) = Q \mathrm{vec}(G_k),$$
where $Q$ is orthonormal.
Therefore, Riemanian optimization techniques are needed.
The retraction step is easy, since the projection alwas has rank $2r$, so it can be done by rounding.
The main problem is the computation of $$P_{\mathcal{T}}(\nabla F)$$ without exponential complexity.
If it is possible, this is the way to go.
Consider the binary classification problem.
Log-regression is the simplest choice: given the feature vector $x$,
we predict the probability to observe $y_i$
$$p = Z \exp(-y_i \langle w, x_i \rangle).$$I.e. the predictor variable is just the linear combination of the components of the feature vector,
and $w$ is the weight vector.
We can use other predictor variables, for example, select product of features (subsets) like
$$w_1 x_1 + w_{125} x_1 x_2 x_5 + \ldots $$We can code all possible subsets in this form by a vector of length $2^d$, or tensor of size $2 \times \ldots \times 2$.
(i.e. if there is $x_1$ in the term, or not).
The predictor variable is then $t = \langle W, X \rangle$, where $\langle \cdot \rangle$ is the scalar product of tensors.
$W$ is $2 \times 2 \times \ldots \times 2$ -- weight tensor
We impose low-rank constraints on the $W$, to avoid overfitting.
The total loss is the sum of individual losses $$F(W) = \sum_{k=1}^K f(y_i, \langle X_i, W \rangle),$$
where $X_i$ is the low-rank tensor obtained from the feature vector $x_i$.
The gradient is easily computatble:
$$\nabla F = \sum_{k=1}^K \frac{df}{dz}(y_i, \langle X_i, W \rangle) X_i,$$and projection onto the tangent space can be computed in a fast way.
On the problem Otto from Kaggle, the larger the rank, the better is learning
You have to train fast
(GPU implementation is necessary, as in the Deep Learning).
You can use low-rank factorization to initialize other optimization methods.
We have successfully speeded up the convolutional neural networks by factorizing a 4D tensor into the canonical format and then fine-tuning it.
Thanks to the wonderful TensorLab toolbox by L. De Lathauwer!
In [3]:
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Out[3]: