Use a cost function that is the mod-squared difference of an external quantity vector and its prediction vector (+ spike cost terms). The difference vector pre-multiplied by the readout weight matrix^T (- rate penalty), gives the membrane potentials of a layer of spiking LIF output neurons that thus code predictively / fire spikes to greedy-minimize the cost function.
The prediction spike train from output neurons is decoded by another virtual layer (may or may not be present) as a rate.
The input layer encodes a 'stimulus' which is the input to the output layer with its feedforward weights. Currently only instantaneous response to stimulus seems allowed [Boerlin and Deneve 2011]?
The external world quantity being predicted is not necessarily the stimulus i.e. network is not just an autoencoder. The quantity being predicted can be a log posterior probability of the stimulus [Deneve 2007, Boerlin and Deneve 2011] or dynamical variables that change with the stimulus [Boerlin et al 2013].
The recurrent weights are to be learnt as some function of: (a) constants in the dynamical system, (b) input neuron tuning filters (no time dependence allowed), (b) readout weights, (c) spatial derivatives of readout weights (for log-posterior), and (c) constants in the cost fuction and time constants in neural dynamics. It is claimed that gradient descent on the spiking rule cost function with respect to the recurrent weights is sufficient to learn them. But currently only the fast recurrent connections are learnt for an autoencoder with further constraint on neural decoding time constant [Bourdoukan et al 2012].
Assume an external dynamical system $\dot{\mathbf{x}}(t)=\mathbf{A}\mathbf{x}(t)+\mathbf{c}(t)$. An output readout layer rate codes the $J$ dynamical variables $\mathbf{x}(t)$ as $\dot{\mathbf{\hat{x}}}=-\lambda_d \mathbf{\hat{x}} + \mathbf{\Gamma} \mathbf{o}(t)$. The recurrent network weights are derived from the readout weights $\mathbf{\Gamma}$, the desired dynamics $\mathbf{A}$ and the penalty scalar $\mu$ as
$\mathbf{W}=\mathbf{\Omega}^s h_d^1(u) - \mathbf{\Omega}^f \delta(u)$ ...Eqn(40)
$\mathbf{\Omega}^s = \mathbf{\Gamma}^T(\mathbf{A}+\lambda_d\mathbf{I})\mathbf{\Gamma}$ ...Eqn(38)
$\mathbf{\Omega}^f = \left(\mathbf{\Gamma}^T\mathbf{\Gamma}+\mu\lambda_d^2\right)$ ...Eqn(39)
The recurrent network neurons receive the same input $\mathbf{c}(t)$ as the dynamical system, and spike only 'as much as is necessary' for the readout to track the dynamical variables $\mathbf{x}(t)$.
Cost function: $E(t) = \int_0^t du \left(|| \mathbf{x}(u)-\mathbf{\hat{x}}(u)||_2^2 +\nu ||\mathbf{r}(u)||_1 +\mu ||\mathbf{r}(u)||_2^2 \right)~~$ ...Eqn(4)
$\mathbf{V}(t) = \mathbf{\Gamma}^T(\mathbf{x}(t)-\mathbf{\hat{x}}(t)) - \mu \mathbf{r}(t)\lambda_d$
$T_k = \frac{1}{2} \left( \mathbf{\Gamma}_k^T\mathbf{\Gamma}_k + \nu \lambda_d + \mu \lambda_d^2 \right)$
$\dot{\mathbf{V}}(t) = -\lambda_V\mathbf{V}(t) + \frac{1}{\lambda_d}\mathbf{\Omega}^s\mathbf{r}(t) - \mathbf{\Omega}^f\mathbf{o}(t) + \mathbf{\Gamma}^T\mathbf{c}(t)$ ...Eqn(37)
Noise model:??
Plasticity rule can be derived from the same cost function as the spiking rule.
Autoencoder here; not encoding a dynamical system (see just below its Eqn(4)):
$\dot{\mathbf{x}}(t)=-\mathbf{x}(t)+\mathbf{c}(t)$
Thus $A=-\mathbf{I}$ in the formalism above.
Also they've taken $\nu=0$ in the cost function throughout.
$\dot{\mathbf{V}}(t) = -\mathbf{V}(t) - \mathbf{\Omega}\mathbf{o}(t) + \mathbf{\Gamma}^T\mathbf{c}(t)$ ...Eqn(20,15,6)
Different equations belong to different sections which build from one neuron scalar $x$, to multi-neuron scalar $x$, to multi-neuron vector $\mathbf{x}$.
[ Insert $A=-\mathbf{I}$ in Eqn(37-39) of above Boerlin et al 2013 paper to get:
$\dot{\mathbf{V}}(t) = -\lambda_V\mathbf{V}(t) + (-\frac{1}{\lambda_d}+1)\mathbf{\Gamma}^T\mathbf{\Gamma} \mathbf{r}(t) - \left(\mathbf{\Gamma}^T\mathbf{\Gamma}+\mu\lambda_d^2\mathbf{I}\right)\mathbf{o}(t) + \mathbf{\Gamma}^T\mathbf{c}(t)$
Output read-out layer in this case has $\lambda_d=1$, so the second term goes to zero. Only the fast inhibition, here $\mathbf{\Omega}=\left(\mathbf{\Gamma}^T\mathbf{\Gamma}+\mu\mathbf{I}\right)$, remains. But I expect the slow inhibition to be important for learning?? ]
Here the readout weights $\mathbf\Gamma$ are constant, and $\mathbf{\Omega}$ will be learnt to be equal to its above final value.
Assume $\mathbf{\Omega}$ changes much more slowly than $\mathbf{V}$, and integrate and get the average membrane potentials:
$\mathbf{V}(t)-\mathbf{V}(0)\exp(-t) = \mathbf{\Gamma}^T\mathbf{x}-\mathbf{\Omega}(h*\mathbf{o})(t)$. Ignoring initial value which goes away at long time, we can write:
$\mathbf{V}(t) = \mathbf{\Gamma}^T\mathbf{x}-\mathbf{\Omega}\bar{\mathbf{o}}(t)$ ...Eqn(6)
$\bar{\mathbf{o}} \equiv (h*\mathbf{o})(t) \equiv [h* o_i]$ where $h(\tau)=\theta(\tau)\exp (-\tau)$. Note that this means $\dot{\bar{\mathbf{o}}}=-\bar{\mathbf{o}}+\mathbf{o}$.
Also, the readout $\dot{\hat{\mathbf{x}}}=-\hat{\mathbf{x}}+\mathbf{\Gamma} \mathbf{o}(t)$. Integrating, $\hat{\mathbf{x}}(t)-\hat{\mathbf{x}}(0)\exp(-t) = \mathbf{\Gamma}(h*\mathbf{o})(t) = \mathbf{\Gamma}\bar{\mathbf{o}}(t)$. Ignoring decaying initial value, we have $\hat{\mathbf{x}}(t) = \mathbf{\Gamma}\bar{\mathbf{o}}(t)$.
From Eqn(20),
$\mathbf{x}(t)=\left(\mathbf{\Gamma}^T\right)^{-1}\left(\mathbf{V}(t)+\mathbf{\Omega}\bar{\mathbf{o}}\right)$
Use the same cost function as for spiking [since we use greedy minimization, it is done at each $\Delta t$, so I guess the integration is not needed.]:
$L = ||\mathbf{x}(t)-\mathbf{\hat{x}}(t)||_2^2 +\mu ||\mathbf{r}(t)||_2^2$
Using above equations:
$L = ||\left(\mathbf{\Gamma}^T\right)^{-1}\left(\mathbf{V}(t)+\mathbf{\Omega}\bar{\mathbf{o}}(t)\right)-\mathbf{\Gamma}\bar{\mathbf{o}}(t)||_2^2 +\mu ||\mathbf{r}(t)||_2^2 = ||\left(\mathbf{\Gamma}^T\right)^{-1}\left(\mathbf{V}(t)+\mathbf{\Omega}\bar{\mathbf{o}}(t)-\mathbf{\Gamma}^T\mathbf{\Gamma}\bar{\mathbf{o}}(t)\right)||_2^2 +\mu ||\mathbf{r}(t)||_2^2$
[ Will the minimum be reached when $\mathbf{\Omega}=\left(\mathbf{\Gamma}^T\mathbf{\Gamma}+\mu\mathbf{I}\right)$? Is this required by consistency? But using this $\mathbf{\Omega}$ in above $L$ does not make the $\bar{\mathbf{o}}$ terms go away except for the case when $\mu=0$.
Anyway, they've replaced the above cost function by $L_V = ||\mathbf{V}||^2/2$ which they claim has the same minimum as $L$. Does it in the case when $\mu\ne 0$? ]
Calculate gradient:
$\frac{\partial L_V}{\partial \Omega_{ij}}=\sum_k V_k\frac{\partial V_k}{\partial \Omega_{ij}} = -\sum_{kl}V_k \frac{\partial\Omega_{kl}}{\partial\Omega_{ij}}\bar{o}_l - \sum_{kl}V_k\Omega_{kl}\frac{\partial\bar{o}_l}{\partial\Omega_{ij}} = -V_i\bar{o}_j - \sum_{kl}V_k\Omega_{kl}\frac{\partial\bar{o}_l}{\partial\Omega_{ij}}$
The second summation can be neglected as it is averaged out under realistic conditions, and introducing a time constant of learning the $\mathbf{\Omega}$ via gradient descent:
$\tau\dot{\Omega}_{ij}=-\frac{\partial L_V}{\partial \Omega_{ij}}=V_i\bar{o}_j$
This is a Hebbian plasticity rule, whereby connectivity changes in proportion to presynaptic firing rate $\bar{o}_j$ and post-synaptic membrane potential $V_i$ [note these are recurrent connections in the same population.].
Assume stimulus evolves as as drift-diffusion process $dx_t=\delta dt+ \sigma dW_t$, where $W$ is a Wiener process.
Response likelihood of the input neurons is assumed to belong to an exponential family with sufficient linear statistics:
$p(\mathbf{S}^n_{t}|x_t)=\Phi^n(\mathbf{S}^n_{t})\Psi^n(x_t)\exp\left(\sum_j H^n_j(x_t)S^n_{t,j}\right)$
where $\Phi^n(\mathbf{S}^n_{t})$ and $\Psi^n(x_t)$ are arbitrary functions, and
$\mathbf{H}^n(x_t)'=\Sigma^{-1}(x_t)\mathbf{f}^n(x_t)' = \mathbf{\Gamma}'(x_t)$ ...Eqn(14 & 30) [Weiji Ma, et al, Nat Neurosci, 2006],
where $\mathbf{f}^n(x_t)$ are the neurons' tuning curves, and $\Sigma(x_t)$ is the spike count covariance matrix.
Figure 1 from Boerlin Deneve 2011, PLOS Comp Biol.
The output neurons above are online decoded by a set of decoder/readout neurons (not shown -- explicit neural layer may not be required for perceptual or motor tasks), each of which codes for the log posterior $L_i(t)\equiv l(x_t,t)|_{x_t=x_i}$ of its preferred value $x_i$ in the discretized stimulus space $(x_1,x_2,...,x_N)$.
$\Gamma_{ij}=\Gamma_j(x_i)$.
The decoder can combine spike trains from multiple sensory areas generating a sum of log-posteriors, in effect performing a product of posterior probabilities.
The decoder neurons approximate $L_i(t)$ by $G_i(t)$, and follow the dynamics:
$\dot{G_i}(t)=-\lambda G_i +\sum_{ij}\Gamma_{ij}O_j(t)$ ...Eqn(20),
where $O_j(t)$ is the spike train of the output layer.
The dynamics of $L_i(t)$ are calculated assuming an ideal observer and assuming responses of input neurons are independent of each other and depend only on current stimulus location [But the input may be an integral of locations, There is often delayed lateral inhibition too. The input is a filtered version of the stimulus over all time, as in figure above?!] as Eqn (19).
An approximation of the dynamics assuming $L_i(t)\approx G_i(t)$ is:
$\dot{L_i}(t)=-\lambda L_i(t)+Y_i(t)+Z_i^2(t)+I_i(t)$.
where $I_i(t)$ is the input current to decoder neuron $i$ at time $t$. $Y_i(t)$ and $Z_i(t)$ defined below.
$Y_i(t)\equiv \lambda G_i(t) -\delta \partial_x G_i(t) +\frac{\sigma^2}{2} \partial_{xx} G_i(t)$
$Z_i(t) \equiv \frac{\sigma}{\sqrt{2}}\partial_x G_i(t)$
Using Eqn(20), we can get the dynamical equations for $Y_i(t)$ and $Z_i(t)$ involving the stimulus dynamics constants, and the output weights $\Gamma_{ij}$ and their spatial derivatives. Integrating these dynamical equations, one should be able to get $Y_i(t)$ and $Z_i(t)$ as functions of above terms.
Cost function is $\sum_j(L_j(t)-G_j(t))^2$. Only those output neurons fire whose spike causes a kernel that brings $\mathbf{L}$ closer to $\mathbf{G}$.
This results in the firing condition:
$\sum_j \Gamma_{ji} (L_j(t)-G_j(t)) > \sum_j \Gamma_{ji}^2/2$
where the LHS can be construed as the membrane potential of an LIF neuron and the RHS as the threshold.
The membrane potential of the output neurons follows the dynamics:
$\dot{V_i}(t) = -\lambda V_i(t) +\sum_n\sum_j\{[\mathbf{\Gamma}^T \mathbf{H}^n]_{ij} S^n_j(t)-\Gamma_{ij}^T\log\Psi^n_i\}-\sum_{j\ne i}[\mathbf{\Gamma}^T\mathbf{\Gamma}]_{ij}O_j(t)+U_i(\mathbf{O},t)$
where slow currents $U_i(\mathbf{O},t) = Y_i(t)+\sum_j \Gamma^T_{ij}Z_j(t)^2$.
[ The second term involving dual summation is the input to the output LIF neurons (see figure also). Third and fourth terms are recurrent connections. The current $I_i(t)$ doesn't appear here, only in the decoder layer.
What are the learned quantities? $\Gamma$ is learnt from the input tunings (Eqn(30)). $U_i$ had $Y_i$ which involves $\delta$ and $\sigma$. Do they need to be learnt? ]
In [ ]: