In [2]:
from IPython.html.services.config import ConfigManager
from IPython.utils.path import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
'theme': 'sky',
'transition': 'zoom',
'start_slideshow_at': 'selected',
})
Out[2]:
In [ ]:
%reload_ext tikzmagic
The main goal is to show, that tensor networks are around in different communities, sharing similar algorithms but different applications.
Say we want to
To do it, we may build a probabilistic model of our images:
$$ P: 256^{1000 \times 1000} \mapsto [0, 1] $$ Now we can
Random variable $\vec{x} = (x_1, x_2, x_3)$, $x_j \in \{1, 2\}$:
Estimate all 8 parameters from the data:
$$ P(\vec{x}) := \frac{1}{N} \sum_{i=1}^N [\vec{x} = \vec{x}^i] $$
Just store it explicitly: $P \in \mathbb{R}^{2 \times 2 \times 2}$.
Million dimensions.
How to estimate and store $256^{1000 000}$ parameters (probabilities)?
Many algorithms and methods are known in different communities:
We need a tensor format (or multivariate function representation) that is:
A superposition of simple functions is easy to evaluate! Kolmogorov superposition theorem:
Every continious function from $[0, 1]^d \rightarrow \mathbb{R}$ can be written as a superposition of smaller-dimensional functions
$$f(x_1, \ldots, x_d) = \sum_{n=0}^{2d} g_n(\xi(x_1 + na, \ldots, x_d + na)), $$$$\xi(x_1 + na, \ldots, x_d + na) = \sum_{i=1}^d \lambda_i \psi(x_i + na) $$Not constructive, but more instructive: what we can only compute is the superposition of simple functions!
The simplest construction is the combination of weighted sum operation and product operation
Now we will follow the idea of Sum-product network (SPN) by Hoifung Poon and Pedro Domingos
We can start from simple univariate functions (or distributions), and the state the following axioms.
In [ ]:
%%tikz -l graphdrawing,graphs,trees,automata,quotes -f svg -s 800,500 --save /Users/ivan/work/talks-online/eth-2015/spn.svg
\tikzset{>=stealth}
\usegdlibrary{layered,trees,circular,force}
\graph[grow=up, layered layout]{
q__"$\times$",
s1__"$+$", s2__"$+$",s3__"$+$", s4__"$+$",
p1__"$\times$", p2__"$\times$", p3__"$\times$", p4__"$\times$", p5__"$\times$", 2__"$x_1$", 3__"$x_1$", 4__"$x_1$",
5__"$x_2$", 6__"$x_2$", 7__"$x_2$", 8__"$x_3$", 9__"$x_3$",
2 -> {p1}, 3 -> {p2}, 4 -> {p3}, 5 -> {p1}, 6 -> {p2}, 7 -> {p3}, 8 -> {p4}, 2 -> {p4}, 9 -> {p5}, 3 -> {p5},
s1 <- {p1, p2}, s2 <- {p2, p3}, s3 <- {p1, p3}, s4 <- {p4, p5}, q <- {s1, s2, s3, s4}
};
If the we approximate the probability distribution, then there is a natural interpretation of the sum nodes:
It is conditional independence: you have two variables, and $P(x, y) \neq P_x(x) P_y(y)$.
Assume that given certain hidden variable $z$ they are indepedent:
$$P(x, y | z) = P_x(x | z) P_y( y | z),$$thus $P(x, y)$ is just a sum of products over $z$.
There is a nice algorithm for learning SPN, which is based on the probability interpretation.
Given a graph, we associate original indices with the vertices, and auxiliary indices with the edges.
In each vertex we store a tensor that depends on the index in the vertex and in all edges.
We multiply these tensors and sum over all edges:
$$A(i_1, \ldots, i_d) = \sum_{s(E)}\prod_V G(s(V)) $$
In [ ]:
%%tikz -l graphdrawing,graphs,trees,automata,quotes -f svg --save /Users/ivan/work/talks-online/eth-2015/tt.svg
\tikzset{>=stealth}
\usegdlibrary{layered,trees,circular,force}
\graph[grow right=1.5cm, branch down]{
{[nodes={rectangle, draw}] i1__$i_1$ -- ["$\alpha_1$"] i2__$i_2$ -- ["$\alpha_2$"]
i3__$i_3$ -- ["$\alpha_3$"] i4__$i_4$ -- ["$\alpha_4$"] i5__$i_5$},
/ -!- / -!- A__"Tensor train / matrix product state network" -!- i3;
};
A specific case of the tree tensor network is the tensor train (or matrix product state) representatation,
which can be written as
$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d),$$i.e. the product of matrices, depending only on 1 index, $G_k(i_k)$ is $r_{k-1} \times r_k$ and $r_0 = r_d = 1$.
It depends on the ordering of the indices.
A popular choice in function approximation is the canonical or sum-of-products format
$$A(i_1, \ldots, i_d) \approx \sum_{\alpha=1}^r U_1(i_1, \alpha) \ldots U_d(i_d, \alpha),$$i.e. sum of separable functions.
Disadvantage: it is not a stable format: the best approximation may not exist, and may be hard to compute if we know that it exists!
However, for a particular tensor $A$ it can be very efficient.
In [ ]:
%%tikz -l graphdrawing,graphs,trees,automata,quotes,graphs.standard -f svg --save /Users/ivan/work/talks-online/eth-2015/2d.svg
\tikzset{>=stealth}
\usegdlibrary{layered,trees,circular,force}
\graph[grid placement]{subgraph Grid_n[V={$i_1$, $i_2$, $i_3$, $i_4$, $i_5$, $i_6$, $i_7$, $i_8$, $i_9$}] };
We know a lot about efficient computations with tensor train / tree formats.
We can:
The TT-format
$$A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d),$$given the ordering of the indices can be characterized by the following condition:
$$\mathrm{rank}(A_k) = r_k,$$where $A_k = A(i_1 i_2 \ldots i_k; i_{k+1} \ldots i_d)$ is the k-th unfolding of the tensor.
I.e. it is the intersection of low-rank matrix manifolds!
We have a TT-Toolbox, both in MATLAB http://github.com/oseledets/TT-Toolbox and in Python http://github.com/oseledets/ttpy
for
In [9]:
# A short illustration
import numpy as np
import tt
from tt.cross import rect_cross as cross
d = 20
a = tt.ones(2, d)
b = a + a
#print b.round(1e-6)
fun = lambda x: x.sum(axis=1)
c = cross(fun, b, eps=1e-6)
c.round(1e-8)
Out[9]:
We have problems that can be efficiently solved using well-established NLA techniques, so it is a good idea to use that as building blocks.
Reduce these operations to operations with tensor trains, i.e. represent 2D network as a layer of 1D tensor networks!
A Markov random field is another example of tensor network.
Even if the potentials are known, $Z(X, W)$ is difficult to compute (sum of a huge array).
Then, we use the standard trick and obtain $$\sum_{i_1, \ldots, i_d} A(i_1, \ldots, i_d) = \Big(\sum_{i_1} (A_{11}(i_1) \otimes A_{21}(i_1) \otimes \ldots \otimes A_{p1}(i_1)\Big)\ldots \Big(\sum_{i_1} (A_{11}(i_1) \otimes A_{2d}(i_d) \otimes \ldots \otimes A_{pd}(i_d)\Big),$$ i.e. is now a product of small-rank tensors, so we multiply them and compress.
One example from the paper by Putting MRFs on a Tensor Train, A. Novikov, A. Rodomanov, A. Osokin, D. Vetrov, ICML-2014
Spin glass models, $T = 1$, $c_{ij} \sim U[-f, f]$.
Even if there is no good low-rank approximation, low-rank tensor formats can be used as regularizations or initial approximations.
Given a certain functional $F(X)$, we restrict $X$ to the set of "low-rank" tensors $\mathcal{M}$, and minimize
$$F(X) \rightarrow \min, X \in \mathcal{M}$$Usually in tensor community, alternating minimization is used
Given $A(i_1, \ldots, i_d) = G_1(i_1) \ldots G_d(i_d)$ optimize over one core $G_k(i_k)$.
Good for quadratic functionals, and also you can parametrize
$$\mathrm{vec}(A) = Q \mathrm{vec}(G_k),$$
where $Q$ is orthonormal.
Therefore, Riemanian optimization techniques are needed.
For the low-rank manifold (matrix, TT, HT) we can efficiently compute the projection to the tangent space.
The simplest is to project the gradient onto the tangent space:
$$x_{k+1} = R(x_k + P_{\mathcal{T}}(\nabla F)), $$where $R$ is the mapping from the tangent space to the manifold, called retraction.
Just to give you the feeling, in two dimensions there is a simple formula.
$$X = U S V^{\top},$$$$P_X(Z) = Z - (I - UU^{\top}) Z (I - VV^{\top})$$is the projection onto the tangent space.
For the multidimensional case, see Time integration of tensor trains, C. Lubich, I. Oseledets, B. Vandereycken
Consider the binary classification problem.
Log-regression is the simplest choice: given the feature vector $x$,
we predict the probability to observe $y_i$
$$p = Z \exp(-y_i \langle w, x_i \rangle).$$I.e. the predictor variable is just the linear combination of the components of the feature vector.
We can use other predictor variable, for example, select product of features (subsets) like
$$w_1 x_1 + w_{125} x_1 x_2 x_5 + \ldots $$We can code all possible subsets in this form by a vector of length $2^d$, or tensor of size $2 \times \ldots \times 2$.
(i.e. if there is $x_1$ in the term, or not).
The predictor variable is then $t = \langle W, X \rangle$, where $\langle \cdot \rangle$ is the scalar product of tensors.
We impose low-rank constraints on the $W$, to avoid overfitting.
The total loss is the sum of individual losses
$$F(W) = \sum_{k=1}^K f(y_i, \langle X_i, W \rangle),$$where $X_i$ is the low-rank tensor obtained from the feature vector $x_i$.
The gradient is easily computatble:
$$\nabla F = \sum_{k=1}^K \frac{df}{dz}(y_i, \langle X_i, W \rangle) X_i,$$and projection onto the tangent space can be computed in a fast way.
On the problem Otto from Kaggle, the larger the rank, the better is learning (still learning today)
You have to train fast
(GPU implementation is necessary, as in the Deep Learning).
Introduced by Oseledets for matrices and Khoromskij for vectors, the idea of quantization is very powerful:
Given a univariate function $f(x)$ on a uniform grid with $2^d$ points, reshape it into $2 \times \ldots \times 2$
$d$-dimensional tensor and apply TT-decomposition to it. The is the QTT decomposition.
It gives logarithmic complexity in the number of samples.
In [13]:
%matplotlib inline
import matplotlib.pyplot as plt
import tt
import numpy as np
k = 200
delta = 1e-2
d = 20
n = 2 ** d
t = np.linspace(delta, 1, n)
t = np.reshape(t, [2] * d, "F")
t_ten = tt.tensor(t, eps=1e-10)
e = tt.ones(2, d)
x = tt.kron(t_ten, e)
y = tt.kron(e, t_ten)
my_fun = lambda x: np.exp(-1j * k * np.sqrt(x[:, 0] ** 2 + x[:, 1] ** 2))/(np.sqrt(x[:, 0] ** 2 + x[:, 1] ** 2))
z = tt.multifuncrs([x, y], my_fun, eps=1e-6, verb=False)
plt.plot(z.r)
plt.text(0.5, 0.5, 'k=%d, Maximal rank=%d' % (k, max(z.r)), fontsize=14, transform=plt.gca().transAxes)
Out[13]:
Given a function $f(x_1, \ldots, x_d)$ we try reparametrize the variable $x$ into a new variable $t$ by a linear transformation
$$x = W t.$$A natural way to go is to align the axis with the directions where the function varies the most:
This can significantly improve the separability!
The main problems are:
We know particial answers in many cases, but not in general.
In August 23-28 2015 we hold the 4-th Moscow conference in Matrix Methods in Mathematics and Applications.
Confirmed speakers: C. Schwab, B. Benner, J. White, D. Zorin, M. Benzi, P.-A.Absil, A. Cichocki, P. Van Dooren.
Good time to visit Moscow (i.e., due to the exchange rate drop from 35 roubles/dollar to 50 roubles/dollar).
In [1]:
from IPython.core.display import HTML
def css_styling():
styles = open("custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: