In the previous post, we looked at Hessian-free optimization, a powerful optimization technique for training deep neural networks. In the next several, I'm going to look into implementation details of deep convolutional networks.
I'm going to begin by reviewing simple fully connected neural networks, re-deriving the backpropagation algorithm for computing the error gradient, and using a clever method (Pearlmutter, 1993) to find an algorithm for computing the Hessian of neural network error times some vector.
Recall that a standard fully-connected neural network of $L$ layers has three types of layers:
The neural network learns by adjusting a set of weights, $w_{ij}^\ell$, where $w_{ij}^\ell$ is the weight from some unit $u_i^\ell$'s output to some other unit $u_j^{\ell+1}$.
The output of a neural network unit is the output of the last layer $u^L$. We use the following terminology:
In order to compute the output from an input, we simply apply some nonlinearity $\sigma(x)$ to the input. The input to the first layer, as we will see, is simply set, while the inputs to subsequent layers are computed as weighted sums of the outputs of the previous layer.
This gives us the following forward propagation algorithm to compute the output of a neural network:
Forward Propagation:
Compute activations for layers with known inputs: $$y_i^\ell = \sigma(x_i^\ell) + I_i^\ell$$
Compute inputs for the next layer from these activations: $$x_i^\ell = \sum_j w_{ji}^{\ell - 1} y_j^{\ell - 1}$$
Repeat steps 1 and 2 until you reach the output layer, and know values of $y^L$.
As we defined earlier, $x_i^\ell$ is the input to a given unit, whereas $y_i^\ell$ is the activation, or output, of that unit. In these equations, when a unit has no inputs, the $\sigma(x)$ term in $y_i^\ell$ becomes a constant, allowing the unit to be set completely externally via the value $I_i^\ell$. This corresponds to the input layer in the neural network. Thus, we start with some set activations at the input layer, compute inputs to the neurons at the next layer, use the nonlinearity to get their activations, and continue propagating values until we reach the output layer. At that point, we can compute the final activations $y^L$.
In order to tell how well the neural network is doing, we compute the error $E(y^L)$. This error function can be a number of different things, such as binary cross-entropy or sum of squared residuals. However, we require that the derivative $\frac{dE(y^L)}{dy_i^L}$ depends only on $y_i^L$. This is the case with the functions listed previously, and effectively means that our error must be computed per-output and summed. This rules out a lot of more complex error functions that would be harder to learn with.
The purpose of being able to compute the error, of course, is to be able to optimize the weights to minimize the error; that is, the process of learning. We learn via an algorithm known as backpropagation, which we can derive in a similar manner to forward propagation. In order to use gradient descent (or another algorithm) to train our network, we need to compute the derivative of the error with respect to each weight. Using the chain rule, we get that $$\frac{\partial E}{\partial w_{ij}^\ell} = \frac{\partial E}{\partial x_j^{\ell + 1}}\frac{\partial x_j^{\ell + 1}}{\partial w_{ij}^\ell}$$ Note that we only get a contribution from $x_j^{\ell +1}$ since that weight appears nowhere else. Looking at the equation for forward propagation $\left(x_i^\ell = \sum_j w_{ji}^{\ell - 1} y_j^{\ell - 1}\right)$, we see that the partial with respect to any given weight is just the activation from its origin neuron. Thus, the chain rule above becomes $$\frac{\partial E}{\partial w_{ij}^\ell} = y_i^\ell \frac{\partial E}{\partial x_j^{\ell+1}}$$ We already know all the values of $y$, so we just need to compute the partial with respect to the input $x_j$. However, we know that $y_i^\ell = \sigma(x_i^\ell) + I_i^\ell$, so we can once more use the chain rule to write $$\frac{\partial E}{\partial x_j^\ell} = \frac{\partial E}{\partial y_j^\ell}\frac{\partial y_j^\ell}{\partial x_j^\ell}= \frac{\partial E}{\partial y_j^\ell}\frac{\partial}{\partial x_j^\ell}\left(\sigma(x_j^\ell) + I_j^\ell\right) = \frac{\partial E}{\partial y_j^\ell} \sigma'(x_j^\ell)$$
Again, $y_j^\ell$ is the only expression in which we ever see an $x_j^\ell$ term, so it is the only contribution to the chain rule. The only bit we have yet to derive is the derivative with respect to the activation $y_i^\ell$. If $\ell = L$ (that is, we're looking at the output layer), then we know that the partial is just the derivative of the error function, which is directly a function of those activations: $$\frac{\partial E}{\partial y_i^L} = \frac{d}{d y_i^L} E(y^L).$$ As we discussed earlier, we require that this derivative is just a function of $y_i^L$ and none of the other activations in the output layer.
Finally, if we are not looking in the output layer, then we simply use chain rule once more: $$\frac{\partial E}{\partial y_i^\ell} = \sum \frac{\partial E}{\partial x_j^{\ell + 1}} \frac{\partial x_j^{\ell + 1}}{\partial y_i^\ell} = \sum \frac{\partial E}{\partial x_j^{\ell + 1}} w_{ij}.$$ Unlike the previous two applications, we see that $y_i^\ell$ is in many expressions throughout the neural network (the entire next layer). Applying the chain rule, we sum over all these contributions, and find that what we get is the derivatives of the inputs to the next layer weighted by how much $y_i^\ell$ matters to each input. Intuitively speaking, this means that the error at a particular node in layer $\ell$ is the combination of errors at the next nodes (layer $\ell + 1$), weighted by the size of the contribution of the node in layer $\ell$ to each of those nodes in layer $\ell +1$.
These equations are complete, and allow us to compute the gradient of the error (the partial with respect to all of the weights). The full algorithm follows.
Backpropagation:
Compute errors at the output layer $L$: $$\frac{\partial E}{\partial y_i^L} = \frac{d}{d y_i^L} E(y^L)$$
Compute partial derivative of error with respect to neuron input (sometimes known as "deltas") at first layer $\ell$ that has known errors: $$\frac{\partial E}{\partial x_j^\ell} = \sigma'(x_j^\ell)\frac{\partial E}{\partial y_j^\ell}$$
Compute errors at the previous layer (backpropagate errors): $$\frac{\partial E}{\partial y_i^\ell} =\sum w_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}}$$
Repeat steps 2 and 3 until deltas are known at all but the input layer.
Compute the gradient of the error (derivative with respect to weights): $$\frac{\partial E}{\partial w_{ij}^\ell} = y_i^\ell \frac{\partial E}{\partial x_j^{\ell+1}}$$ Note that in order to compute derivatives with respect to weights in a given layer, we use the activations in that layer and the deltas for the next layer. Thus, we never need to compute deltas for the input layer.
In the previous post, we saw an optimization method that needed to be able to compute the Hessian (matrix of second derivatives) times some vector. Computing the Hessian is incredibly computationally intense, and using finite differences (as discussed there) can be numerically unstable. Thus, in order to learn in neural networks, we want to be able to compute the Hessian-vector product $Hv$ in a better way.
It turns out that given a forward and backward propagation algorithm for computing a gradient, we can use these algorithms to derive a Hessian-vector product algorithm. The method for deriving this algorithm is known as the $\mathcal{R}\{\cdot\}$ method.
Let $H$ be the Hessian of our error function. As we showed earlier, we know that the Hessian-vector product is actually just a directional derivative: $$Hv = \lim_{\varepsilon\to 0} \frac{\nabla E(x + \varepsilon v) - \nabla E(x)}{\varepsilon}.$$ By the definition of the derivative, we find that another way to write this without limits is $$Hv = \frac{\partial}{\partial \varepsilon} \nabla E(x + \varepsilon v){\huge\mid}_{\varepsilon = 0}$$
Let us then define an operator $\mathcal{R}_v$ which converts a gradient computation into a Hessian-vector product: $$\newcommand\Rv[1]{\mathcal{R}_v\left\{#1\right\}}$$ $$\Rv{f(x)} = \frac{\partial}{\partial \varepsilon} f(x+\epsilon v){\huge\mid}_{\varepsilon = 0}$$ Using a trivial substitution, we can see that now we can write $Hv = \Rv{\nabla E(x)}$.
Before using this operator to derive Hessian-vector product algorithms, let's take a look at some of its properties. As we might expect, it is a linear operator:
$$\begin{align*} \Rv{a f(x) + b g(x)} = a \Rv{f(x)} + b \Rv{g(x)} \end{align*}$$This can be easily verified by simply plugging in the definition of $\Rv{\cdot}$ and doing some algebra.
More interestingly, since it is an evaluation of a derivative, it obeys the laws of differential operators. For instance, the product rule is obeyed:
$$\begin{align*} \Rv{f(x)g(x)} &= \frac{\partial}{\partial \varepsilon} f(x+\epsilon v)g(x+\epsilon v){\huge\mid}_{\varepsilon = 0} \\ &= \left(g(x+\epsilon v)\frac{\partial}{\partial \varepsilon} f(x+\epsilon v)+f(x+\epsilon v)\frac{\partial}{\partial \varepsilon} g(x+\epsilon v)\right){\huge\mid}_{\varepsilon = 0}\\ &= f(x)\Rv{g(x)} + g(x)\Rv{f(x)} \end{align*}$$In the same way, we get a chain rule:
$$\Rv{f(g(x))} = f'(g(x))\Rv{g(x)},$$and find that it is not affected by other derivatives
$$\Rv{\frac{d}{dt}f(x)} = \frac{d}{dt}\Rv{f(x)}.$$Finally, applying $\Rv{\cdot}$ to $f(x) = x$ just gives us $v$:
$$\Rv{x} = \frac{\partial}{\partial \varepsilon} (x+\epsilon v){\huge\mid}_{\varepsilon = 0} = v{\huge\mid}_{\varepsilon = 0} = v.$$Now that we've taken a look at this operator, how do we use this? Well, we've defined the operator to be the operator that converts a gradient into the Hessian-vector product. So, in order to compute the Hessian-vector product for a neural network, we simply apply this operator to the equations for computing the gradient; the resulting set of equations will dictate the algorithm we must use in order to compute the Hessian-vector product.
Our gradient computation is comprised of two steps. The first step, forward propagation, is dictated by the following set of equations:
$$\begin{align*} y_i^\ell &= \sigma(x_i^\ell) + I_i^\ell\\ x_i^\ell &= \sum_j w_{ji}^{\ell - 1} y_j^{\ell - 1} \end{align*}$$Let's go step-by-step (equation-by-equation) and apply $\Rv{\cdot}$ to these equations. By applying the chain rule to $\sigma(x)$, the first one becomes
$$\begin{align*} \Rv{y_i^\ell} &= \Rv{\sigma(x_i^\ell) + I_i^\ell}\\ &= \sigma'(x_i^\ell)\Rv{x_i^\ell} \end{align*}$$After moving the operator inside the sum and applying the product rule, the second forward propagation equation becomes
$$\begin{align*} \Rv{x_i^\ell} &= \Rv{\sum_j w_{ji}^{\ell - 1} y_j^{\ell - 1}} \\ &= \sum_j \Rv{w_{ji}^{\ell - 1} y_j^{\ell - 1}} \\ &= \sum_j \left(w_{ji}^{\ell - 1} \Rv{y_j^{\ell - 1}} + \Rv{w_{ji}^{\ell - 1}}y_j^{\ell - 1}\right) \end{align*}$$Note that since $v$ is a vector in the weight space, $w_{ji}^{\ell - 1}$ as a function just picks out a particular component, and thus
$$\Rv{w_{ji}^{\ell - 1}} = v_{ji}.$$Thus, the final set of forward propagation equations is
$$\begin{align*} \Rv{y_i^\ell} &= \sigma'(x_i^\ell)\Rv{x_i^\ell}\\\ \Rv{x_i^\ell} &= \sum_j \left(w_{ji}^{\ell - 1} \Rv{y_j^{\ell - 1}} + v_{ji}^{\ell - 1}y_j^{\ell - 1}\right) \end{align*}$$Next, we can derive the backward propagation equations for Hessian-vector product. If you understood the previous derivation, it may make sense to skip to the end of this section and just take a look at the results. For thoroughness, however, we will derive the new backpropagation equations. The original backpropagation equations are:
$$\begin{align*} \frac{\partial E}{\partial y_i^L} &= \frac{d}{d y_i^L} E(y^L) = e_i(y_i^L)\\ \frac{\partial E}{\partial y_i^\ell} &=\sum w_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}}\\ \frac{\partial E}{\partial x_j^\ell} &= \sigma'(x_j^\ell)\frac{\partial E}{\partial y_j^\ell}\\ \frac{\partial E}{\partial w_{ij}^\ell} &= y_i^\ell \frac{\partial E}{\partial x_j^{\ell+1}} \end{align*}$$Note that in the first one, we wrote the error explicitly as a function of $y_i^L$ only, since that comes from the requirement we originally imposed on the error function $E$.
Once more, we can apply the $\Rv{\cdot}$ operator to each of these equations. Applying the chain rule for our operator, the first one becomes
$$\begin{align*} \Rv{\frac{\partial E}{\partial y_i^L}} &= \Rv{e_i(y_i^L)}\\ &= e_i'(y_i^L) \Rv{y_i^L}\\ \end{align*}$$The second equation, after applying linearity and the product rule (as we did before to $w_{ij}^\ell$) becomes
$$\begin{align*} \Rv{\frac{\partial E}{\partial y_i^\ell}} &=\Rv{\sum w_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}}}\\ &=\sum \Rv{w_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}}}\\ &=\sum \left(\Rv{w_{ij}^\ell} \frac{\partial E}{\partial x_j^{\ell + 1}} + w_{ij}^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell + 1}}}\right)\\ &=\sum \left(v_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}} + w_{ij}^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell + 1}}}\right)\\ \end{align*}$$The equation for backpropagating deltas (the third equation) becomes
$$\begin{align*} \Rv{\frac{\partial E}{\partial x_j^\ell}} &= \Rv{\sigma'(x_j^\ell)\frac{\partial E}{\partial y_j^\ell}}\\ &= \Rv{\sigma'(x_j^\ell)}\frac{\partial E}{\partial y_j^\ell}+\sigma'(x_j^\ell)\Rv{\frac{\partial E}{\partial y_j^\ell}}\\ &= \sigma''(x_j^\ell)\Rv{x_j^\ell}\frac{\partial E}{\partial y_j^\ell}+\sigma'(x_j^\ell)\Rv{\frac{\partial E}{\partial y_j^\ell}}\\ \end{align*}$$Finally, the equation that used to be the gradient computation (fourth equation) becomes (after a product rule)
$$\begin{align*} \Rv{\frac{\partial E}{\partial w_{ij}^\ell}} &= \Rv{y_i^\ell \frac{\partial E}{\partial x_j^{\ell+1}}} \\ &= \Rv{y_i^\ell} \frac{\partial E}{\partial x_j^{\ell+1}}+y_i^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell+1}}} \\ \end{align*}$$This gives us the full set of Hessian-vector backpropagation equations:
$$\begin{align*} \Rv{\frac{\partial E}{\partial y_i^L}} &= e_i'(y_i^L) \Rv{y_i^L}\\ \Rv{\frac{\partial E}{\partial y_i^\ell}} &=\sum \left(v_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}} + w_{ij}^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell + 1}}}\right)\\ \Rv{\frac{\partial E}{\partial x_j^\ell}} &= \sigma''(x_j^\ell)\Rv{x_j^\ell}\frac{\partial E}{\partial y_j^\ell}+\sigma'(x_j^\ell)\Rv{\frac{\partial E}{\partial y_j^\ell}}\\ \Rv{\frac{\partial E}{\partial w_{ij}^\ell}} &= \Rv{y_i^\ell} \frac{\partial E}{\partial x_j^{\ell+1}}+y_i^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell+1}}} \end{align*}$$These two sets of equations (for forward and backward propagation) give us, via their structure, a forward and backward propagation algorithm to compute the Hessian-vector product. The Hessian-vector product forward propagation algorithm is as follows:
Initialize $\Rv{x_i^0}$. Since these are constants (your input layer), this will be zero.
Compute $\Rv{y_i^\ell}$ for the current layer via $$\Rv{y_i^\ell} = \sigma'(x_i^\ell)\Rv{x_i^\ell}$$
Compute the $\Rv{x_i^\ell}$ at the next layer via $$\Rv{x_i^\ell} = \sum_j \left(w_{ji}^{\ell - 1} \Rv{y_j^{\ell - 1}} + v_{ji}^{\ell - 1}y_j^{\ell - 1}\right).$$
Repeat steps 2 and 3 until the last (output) layer is reached.
After running forward propagation, we must run Hessian-vector product backward propagation:
Initialize $\Rv{\frac{\partial E}{\partial y_i^L}}$ at the output layer via $$\Rv{\frac{\partial E}{\partial y_i^L}} = e_i'(y_i^L) \Rv{y_i^L}.$$
Compute $\Rv{\frac{\partial E}{\partial x_j^\ell}}$ at the current layer via $$\Rv{\frac{\partial E}{\partial x_j^\ell}} = \sigma''(x_j^\ell)\Rv{x_j^\ell}\frac{\partial E}{\partial y_j^\ell}+\sigma'(x_j^\ell)\Rv{\frac{\partial E}{\partial y_j^\ell}}.$$
Compute $\Rv{\frac{\partial E}{\partial y_i^\ell}}$ at the previous layer via $$\Rv{\frac{\partial E}{\partial y_i^\ell}} = \sum \left(v_{ij}^\ell \frac{\partial E}{\partial x_j^{\ell + 1}} + w_{ij}^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell + 1}}}\right)$$
Iterate steps 2 and 3 until we reach the last hidden (non-input) layer.
Compute the components of the Hessian-vector product via $$\Rv{\frac{\partial E}{\partial w_{ij}^\ell}} = \Rv{y_i^\ell} \frac{\partial E}{\partial x_j^{\ell+1}}+y_i^\ell \Rv{\frac{\partial E}{\partial x_j^{\ell+1}}}$$
We started with a basic description of fully connected feed-forward neural networks, and used it to derive the forward propagation algorithm and the backward propagation algorithm for computing gradients. We then used the method developed by Pearlmutter to develop an adjoint algorithm pair that, in a forward and a backward pass, computes the Hessian-vector product $Hv$ for any vector $v$ in the weight space. This allows us to create second order optimization algorithms that use the Hessian without ever computing the Hessian itself, permitting more performant neural network training algorithms.