Predictive Coding of Dynamical Variables in Balanced Spiking Networks, Martin Boerlin, Christian K. Machens, Sophie Denève, Nov, 2013

Formalism

Assumptions

$J$ variables of dynamical system:

$\dot{\mathbf{x}}(t)=\mathbf{A}\mathbf{x}(t)+\mathbf{c}(t)$ ...Eqn(1)

are encoded in the delta-function spike trains $\mathbf{o}$(t) of $N$ leaky integrate-and-fire neurons,

which are read out [by another layer?]:

$\dot{\mathbf{\hat{x}}}=-\lambda_d \mathbf{\hat{x}} + \mathbf{\Gamma} \mathbf{o}(t)$ ...Eqn(2).

The integration of each spike in train $i$ contributes a decaying exponential $\exp(-\lambda_d\tau)$ weighted by $\Gamma_{ij}$ to the read-out variable $\hat{x}_j$.

[ Check:

Use an integrating factor to integrate Eqn(2).

$d(\mathbf{\hat{x}}\exp(\lambda_d t))=dt\mathbf{\Gamma}\mathbf{o}(t)\exp(\lambda_d t)$

$\mathbf{\hat{x}}(t)\exp(\lambda_d t)-\mathbf{\hat{x}}(0) = \int_0^t d\tau\mathbf{\Gamma}\mathbf{o}(\tau)\exp(\lambda_d \tau)$

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \exp(-\lambda_d t)\int_0^t d\tau\mathbf{\Gamma}\mathbf{o}(\tau)\exp(\lambda_d \tau)$

If $\mathbf{o}(t)$ has a single delta-function spike at $t'$ for neuron $k$, then:

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \exp(-\lambda_d t)\int_0^t\mathbf{\Gamma} \delta(\tau-t')\mathbf{e}_k\exp(\lambda_d \tau)d\tau$

where $\mathbf{e}_k$ is a vector with a $1$ at position $k$, and zeroes everywhere else.

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \exp(-\lambda_d t) \mathbf{\Gamma}\mathbf{e}_k\exp(\lambda_d t')H(t-t')$

where $H(t-t')$ is the Heaviside unit step function, which is needed since the intergral evaluates to a non-zero quantity only if $t'$ is contained in the interval $(0,t)$. Thus,

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \exp(-\lambda_d (t-t')) \mathbf{\Gamma}\mathbf{e}_k H(t-t')$.

]

In a rate formulation: $\hat{\mathbf{x}}=\frac{1}{\lambda_d}\mathbf{\Gamma}\mathbf{r}(t)$, where

$\dot{\mathbf{r}}(t) = -\lambda_d \mathbf{r}(t) + \lambda_d \mathbf{o}(t)$ ...Eqn(3)

[ Check:

Insert $\mathbf{r}(t) = \lambda_d \mathbf{\Gamma}^{-1}\hat{\mathbf{x}}$ into $\dot{\mathbf{r}}(t) = -\lambda_d \mathbf{r}(t) + \lambda_d \mathbf{o}(t)$ to get:

$\lambda_d \mathbf{\Gamma}^{-1}\dot{\mathbf{\hat{x}}} = -\lambda_d^2 \mathbf{\Gamma}^{-1}\mathbf{\hat{x}} + \lambda_d \mathbf{o}(t)$ same as Eqn(2).

I think we need to assume that $\mathbf{\Gamma}$ is invertible as above, even if we manipulate Eqn(2) to get Eqn(3). $\Gamma$ is not square, so be careful here about left-invertible, etc. See second sub-section of Materials and Methods.

]

[ The readout seems to be rate-based? What if we cascade two such predictive coding systems? ]

[ What is the input to the $N$ neurons? ]

Derivation of spiking rule

Assume network greedy-optimizes cost function ($||\cdot||_1$ and $||\cdot||_2$ refer to L1 and L2 norms):

$E(t) = \int_0^t du \left(|| \mathbf{x}(u)-\mathbf{\hat{x}}(u)||_2^2 +\nu ||\mathbf{r}(u)||_1 +\mu ||\mathbf{r}(u)||_2^2 \right)~~$ ...Eqn(4)

by varying spike-times, but $\mathbf{\Gamma}$ (matrix) is kept constant, unlike in liquid computing approaches.

If $k$th neuron spikes at time $t$, then (cf. above 'check')

$\mathbf{\hat{x}}(u)\to \mathbf{\hat{x}}(u) + \exp(-\lambda_d (u-t)) \mathbf{\Gamma}\mathbf{e}_k H(u-t)$

Define $h_d(\tau) = \exp(-\lambda_d \tau)H(\tau)$.

Thus,

$\mathbf{\hat{x}}(u)\to \mathbf{\hat{x}}(u) + \mathbf{\Gamma}\mathbf{e}_k h_d(u-t)$

Now,

$\mathbf{r}(t) = \lambda_d \mathbf{\Gamma}^{-1}\hat{\mathbf{x}}$. So,

$\mathbf{r}(u)\to \mathbf{r}(u) + \lambda_d\mathbf{e}_k h_d(u-t)$

Since optimization is greedy, neuron $k$ will spike only if

$E(t|k ~~\text{spikes})<E(t|k ~~\text{doesn't spike})$ ...Eqn(22).

For a time $t+\epsilon$, where $\epsilon>0$,

$ \int_0^{t+\epsilon} du \left(|| \mathbf{x}(u)-\mathbf{\hat{x}}(u) - \mathbf{\Gamma}\mathbf{e}_k h_d(u-t)||_2^2 +\nu ||\mathbf{r}(u) + \lambda_d\mathbf{e}_k h_d(u-t)||_1 +\mu ||\mathbf{r}(u) + \lambda_d\mathbf{e}_k h_d(u-t)||_2^2 \right) < \int_0^{t+\epsilon} du \left(|| \mathbf{x}(u)-\mathbf{\hat{x}}(u)||_2^2 +\nu ||\mathbf{r}(u)||_1 +\mu ||\mathbf{r}(u)||_2^2 \right) $

We use: $||\mathbf{a}||_2^2 = \mathbf{a}^T \mathbf{a}$; notation $\mathbf{\Gamma}\mathbf{e}_k = \mathbf{\Gamma}_k$; $\mathbf{e}_k^T\mathbf{e}_k=1$; $\mathbf{\Gamma}_k^T\mathbf{x}=\mathbf{x}^T\mathbf{\Gamma}_k$, since it's a scalar; and rates are positive to get:

$ \int_0^{t+\epsilon} du \left(\mathbf{\Gamma}_k^T\mathbf{\Gamma}_k h_d^2(u-t) - 2 \mathbf{\Gamma}_k^T(\mathbf{x}(u)-\mathbf{\hat{x}}(u)) h_d(u-t) +\nu \lambda_d h_d(u-t) + \mu \lambda_d^2 h_d^2(u-t) + 2 \mu \mathbf{e}_k^T\mathbf{r}(u)\lambda_d h_d(u-t) \right) < 0 $

All terms have a Heaviside function starting from $t$, so integrating from $t_-$; moving terms to RHS; and multiplying by $-1$; we write:

$ \int_{t_-}^{t+\epsilon} du \left( 2 \mathbf{\Gamma}_k^T(\mathbf{x}(u)-\mathbf{\hat{x}}(u)) h_d(u-t) - 2 \mu \mathbf{e}_k^T\mathbf{r}(u)\lambda_d h_d(u-t) \right) > \int_{t_-}^{t+\epsilon} du \left( \mathbf{\Gamma}_k^T\mathbf{\Gamma}_k h_d^2(u-t) + \nu \lambda_d h_d(u-t) + \mu \lambda_d^2 h_d^2(u-t)\right) $

This spike rule requires looking into the future (!!!), but assume greedy optimization, so that $\epsilon\ll\lambda_d$. Thus, $h_d(u-t)\approx 1$, and the integrands are approximately constant. So:

$\mathbf{\Gamma}_k^T(\mathbf{x}(u)-\mathbf{\hat{x}}(u)) - \mu \mathbf{e}_k^T\mathbf{r}(u)\lambda_d > \frac{1}{2} \left( \mathbf{\Gamma}_k^T\mathbf{\Gamma}_k + \nu \lambda_d + \mu \lambda_d^2 \right)$ ...Eqn(29)

This is the decision to spike. Take the RHS which is a constant as threshold $T_k$, and the LHS as the membrane potential $V_k(t)$, of the $k$-th neuron.

Link to biophysics

From above, we have:

$V_k(t) = \mathbf{\Gamma}_k^T(\mathbf{x}(t)-\mathbf{\hat{x}}(t)) - \mu \mathbf{e}_k^T\mathbf{r}(t)\lambda_d$ ...Eqn(6)

$T_k = \frac{1}{2} \left( \mathbf{\Gamma}_k^T\mathbf{\Gamma}_k + \nu \lambda_d + \mu \lambda_d^2 \right)$ ...Eqn(7)

As vector for the network of $N$ neurons:

$\mathbf{V}(t) = \mathbf{\Gamma}^T(\mathbf{x}(t)-\mathbf{\hat{x}}(t)) - \mu \mathbf{r}(t)\lambda_d$ ...Eqn(30)

Assume $N>J$, assume $\Gamma$ has rank $J$, and that the dynamical variables are not degenerate or linearly dependent on each other. So left pseudo-inverse of $\Gamma^T$ exists which is $\mathbf{L}=(\mathbf{\Gamma}\mathbf{\Gamma^T})^{-1}\mathbf{\Gamma}$.

$\mathbf{x}$ is a column matrix $J\times 1$. $\Gamma^T$ is $N\times J$. Thus only a left pseudo-inverse exists, which on left-multiplication gives $\mathbf{I}_{J\times J}$.

Left-multiplying Eqn(30),

$\mathbf{x}(t)=\mathbf{L}\mathbf{V}(t)+\mathbf{\hat{x}}(t)+\mu\mathbf{L}\mathbf{r}(t)\lambda_d$

Taking time derivative on Eqn(30),

$\dot{\mathbf{V}}(t) = \mathbf{\Gamma}^T(\dot{\mathbf{x}}(t)-\dot{\mathbf{\hat{x}}}(t)) - \mu \dot{\mathbf{r}}(t)\lambda_d$

Replace $\dot{\mathbf{x}}(t)$, $\dot{\mathbf{\hat{x}}}(t)$, and $\dot{\mathbf{r}}(t)$ by Eqns(1-3):

$\dot{\mathbf{V}}(t) = \mathbf{\Gamma}^T(\mathbf{A}\mathbf{x}(t)+\mathbf{c}(t) +\lambda_d \hat{\mathbf{x}}(t)-\mathbf{\Gamma}\mathbf{o}(t)) + \mu\lambda_d^2\mathbf{r}(t) - \mu\lambda_d^2\mathbf{o}(t)$

Replace $\mathbf{x}(t)$ by Eqn(32) and collect terms, using $\mathbf{\hat{x}}(t)=\frac{1}{\lambda_d}\mathbf{\Gamma}\mathbf{r}(t)$:

$\dot{\mathbf{V}}(t) = \mathbf{\Gamma}^T\mathbf{A}\mathbf{L}\mathbf{V}(t) + \left( \frac{1}{\lambda_d}\mathbf{\Gamma}^T\mathbf{A}\mathbf{\Gamma} + \mu\lambda_d\mathbf{\Gamma}^T\mathbf{A}\mathbf{L} +\mathbf{\Gamma}^T\mathbf{\Gamma} + \mu\lambda_d^2\mathbf{I} \right)\mathbf{r}(t) - \left(\mathbf{\Gamma}^T\mathbf{\Gamma}+\mu\lambda_d^2\right)\mathbf{o}(t) + \mathbf{\Gamma}^T\mathbf{c}(t)$ ...Eqn(36)

The second last term includes a reset mechanism reducing the voltage of neuron $k$ by $\Gamma_k^2+\mu\lambda_d^2$ at each spike.

Large $N$ limit

Which terms are important in Eqn(36) in the large $N$ limit?

$\mathbf{\hat{x}}=\mathbf{\Gamma}\mathbf{r}/\lambda_d$.

We want to keep individual neural rates constant [why? -- why not encode as sparsely as possible? And as $N$ increases, I expect $J$ also to increase -- encodes multiple/larger dynamical system].

$\mathbf{r}(t)$ has individual terms the same / similar, but its size grows as $N$. On multiplying by matrix $\mathbf{\Gamma}$, $N$ elements of $\mathbf{r}$ will be added up. Thus $\mathbf{\Gamma}$ must scale as $1/N$ to keep $\mathbf{\hat{x}}$ constant.

This can be seen more easily for a homogeneous network with constant scalar $x$:

$r_{pop}=\frac{\hat{x}\lambda_d}{\Gamma}=Nr_{ind}$. So $\Gamma$ must scale as $1/N$ to keep $r_{ind}$ constant.

The linear and quadratic terms $||r(t)||_1$ and $||r(t)||_2$ scale as $N$. So $\nu$ and $\mu$ must scale as $1/N$ or slower to avoid cost increasing with size. But most terms involve $\mathbf{\Gamma}$ as $\mathbf{\Gamma}^T\mathbf{\Gamma}$ thus going as $1/N^2$. Hence, we make $\nu$ and $\mu$ go as $1/N^2$ to maintain relative contributions of the terms.

Since $T_k$ scales as $1/N^2$, so does $V_k(t)$ since $V_k(t)$ is bounded above by $T_k$ and bounded below by existence of neurons with opposing $\Gamma$ elements [meaning negative $\Gamma$ elements that determine lowering of $V_k(t)$ scale as $1/N^2$?].

Thus we can add a generic leak term $-\lambda_V \mathbf{V}(t)$ which will be negligible at large $N$, but may be detrimental at small $N$.

Eqn(36) is thus approximated by dropping $1/N^2$ scaling terms, but adding the leak term for biophysical reasons.

$\mathbf{\Gamma}^T\mathbf{A}\mathbf{L}\mathbf{V}(t)$, $\frac{1}{\lambda_d}\mathbf{\Gamma}^T\mathbf{A}\mathbf{\Gamma}\mathbf{r}(t)$ and $\mu\lambda_d^2\mathbf{I}\mathbf{r}(t)$ scale as $1/N^2$ and are neglected. Note that in $\mu\lambda_d^2\mathbf{r}(t)$, $\mathbf{r}(t)$ scales as $1$ (each element doesn't scale) while $\mu$ scales as $1/N^2$. [But, $\mathbf{\Gamma}\mathbf{r}(t)$ scales as $1/N^3$].

Thus,

$\dot{\mathbf{V}}(t) = -\lambda_V\mathbf{V}(t) + \frac{1}{\lambda_d}\mathbf{\Omega}^s\mathbf{r}(t) - \mathbf{\Omega}^f\mathbf{o}(t) + \mathbf{\Gamma}^T\mathbf{c}(t)$ ...Eqn(37)

$\mathbf{\Omega}^s = \mathbf{\Gamma}^T(\mathbf{A}+\lambda_d\mathbf{I})\mathbf{\Gamma}$ ...Eqn(38)

$\mathbf{\Omega}^f = \left(\mathbf{\Gamma}^T\mathbf{\Gamma}+\mu\lambda_d^2\right)$ ...Eqn(39)

The superscripts for slow and fast terms make sense as $\mathbf{r}(t)$ is a low-pass filtered version of the spike train $\mathbf{o}(t)$ as shown below.

We integrated Eqn(2) above to get:

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \exp(-\lambda_d t)\int_0^t d\tau\mathbf{\Gamma}\mathbf{o}(\tau)\exp(\lambda_d \tau)$

Rewriting:

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \mathbf{\Gamma}\int_0^t d\tau\mathbf{o}(\tau)\exp(-\lambda_d (t-\tau))$

[Aside: spike train $\mathbf{o}(t)=\sum_k\sum_{\{t'\}} \mathbf{e}_k\delta(t-t')$. ]

Rewriting using $h_d^1(\tau) = \exp(-\lambda_d \tau)$:

[My $h_d(\tau) = \exp(-\lambda_d \tau)H(\tau)$ has a Heaviside function, but theirs doesn't which I'm calling $h_d^1(\tau)$. The Heaviside function comes post-convolution i.e. after evaluating the integral, as the the integral evaluates to $0$, when spike time $t'$ is not in the interval $(0,t)$.]

$\mathbf{\hat{x}}(t) = \mathbf{\hat{x}}(0)\exp(-\lambda_d t) + \mathbf{\Gamma}(h_d^1(t)*\mathbf{o})(t)$

Note that here:

$(f*g)(t) \equiv \int_0^t f(t-\tau)g(\tau)d\tau = \int_{-t}^0 f(\tau)g(t-\tau)d\tau \ne \int_0^t f(\tau)g(t-\tau)d\tau$. [Be careful!]

Recall, $\mathbf{r}(t) = \lambda_d \mathbf{L}\hat{\mathbf{x}}$. Thus,

$\mathbf{r}(t)=\mathbf{r}(0)\exp(-\lambda_d t)+\lambda_d (h_d^1(t)*\mathbf{o})(t)$

Also, by the delta function(-al) definition:

$\mathbf{o}(t)=(\delta * \mathbf{o})(t)=\int_0^t\delta(t-\tau)\mathbf{o}(\tau)d\tau = \mathbf{o}(t)$.

Compact notation

Using above results, re-write Eqn(37) in a compact notation:

$\dot{\mathbf{V}}(t) = -\lambda_V \mathbf{V}(t)+(\mathbf{W}*\mathbf{o})(t) + \mathbf{\Gamma}^T \mathbf{c}(t)$ ...Eqn(41)

defining $\mathbf{W}=\mathbf{\Omega}^s h_d^1(u) - \mathbf{\Omega}^f \delta(u)$ ...Eqn(40)

Can add a noise term to get a vectorised version of Eqn(8).



In [ ]: