Hessian-free optimization, which we've discussed before, uses the quadratic conjugate gradient algorithm at every iteration of the Newton-style method (which is why it is also known as the Newton-CG method). However, when discussing Hessian-free optimization, we glossed over a few crucial details, and in this post, I'll seek to correct one of these and justify our correction. Note that I rely heavily on the material presented by Schraudolph in 2002 and James Martens in 2010, so you may peruse the original research for the information presented here.
When we looked at conjugate gradient, we quickly glossed over some fairly important requirements. Recall that we had some quadratic minimization objective $$f(x) = c + b^T x + \frac{1}{2} x^T A x.$$ Of course, this only makes sense if $f(x)$ actually has a minimum, which only happens if it is bounded from below.
In order to bound $f(x)$ from below, we require that the matrix $A$ not have any negative eigenvalues. If $A$ had negative eigenvalues, then we could set $x$ to be the corresponding eigenvector. Then, when computing $x^T A x$, multiplication by $A$ would then pick up a negative coefficient, and the $x^T$ would give us some negative value (the eigenvalue times the norm of $x$). Since we could scale $x$ to have any norm we want, we could choose a norm make $f(x)$ arbitrarily small (by making it very negative). Thus, to prevent this, we enforce the previously mentioned condition that $A$ be positive semi-definite (all eigenvalues of $A$ are greater than or equal to zero).
When performing Hessian-free optimization, we create the quadratic objective $$f(x + \Delta x) \approx c + \nabla f(x)^T \Delta x + \frac{1}{2} \Delta x^T H_f(x) \Delta x,$$ where $H_f(x)$ is the Hessian of $f(x)$. By using quadratic conjugate gradient, we implicitly assume that the Hessian is going to be positive semi-definite. As it turns out, that assumption is definitely wrong, as the Hessian can be non-positive semi-definite. We had no reason to make that assumption in the first place, either! (In fact, the definiteness of the Hessian can be used to determine whether a critical point is a saddle point, minimum, or maximum.)
In order to fix our quadratic objective for Hessian-free optimization, we need to come up with some approximation to the Hessian that we know will be positive semi-definite.
The approximation to the Hessian that we use is known as the Gauss-Newton matrix. First, let $E(x)$ be our error function which we wish to minimize using its gradient $$\newcommand\p[2]{\frac{\partial #1}{\partial #2}}(\nabla E)_i$$
Thus, an element of the Hessian can be written as $$H_{ij} = \p{}{x_j}\p{E(x)}{x_i} = \p{}{x_j} (\nabla E)_i$$
Next, let's write $E(x)$ as the composition of two functions: one function $f(x)$ which computes the pre-nonlinearity output of the neural network final layer, and a function $\sigma$ to compute non-linearity itself and the error at the final layer: $$E(x) = \sigma(f(x)).$$ $f(x)$ is a vector of outputs (the entire output layer), and $\sigma$ may depend on multiple elements of that vector. For example, a softmax layer at the end of a neural network often takes the form $$S(x)_i = \frac{\exp(x_i)}{\sum_{k = 0}^n \exp(x_k)}.$$ Recall also from a previous post that we require that $E(x)$ is the sum of a bunch of small per-output errors, and that $\sigma$ outputs a single value (not a vector) that is the error.
Substituting this function composition in, the Hessian becomes $$H_{ij} = \p{}{x_j}\p{\sigma(f(x))}{x_i}$$ We can first apply the chain rule to get: $$H_{ij} = \p{}{x_j}\left(\sum_{k=0}^n \p{\sigma}{f_k(x)}(f(x)) \p{f_k(x)}{x_i}\right) = \sum_{k=0}^n \p{}{x_j}\left(\p{\sigma}{f_k(x)}(f(x)) \p{f_k(x)}{x_i}\right)$$ Then, we apply product rule, and get $$H_{ij} = \sum_{k=0}^n \p{}{x_j}\left(\p{\sigma}{f_k(x)}(f(x))\right) \p{f_k(x)}{x_i} + \sum_{k=0}^n\p{\sigma}{f_k(x)}(f(x)) \p{f_k(x)^2}{x_j \partial x_i}$$ Finally, we can apply the chain rule in the first of those terms and get our final expanded expression for an element of the Hessian: $$H_{ij} = \sum_{k=0}^n \left(\sum_{\ell=0}^n \p{\sigma^2}{f_\ell(x) f_k(x)}(f(x)) \p{f_\ell(x)}{x_j}\right) \p{f_k(x)}{x_i} + \sum_{k=0}^n\p{\sigma}{f_k(x)}(f(x)) \p{f_k(x)^2}{x_j \partial x_i}$$ Bringing the summation to the outside gives us $$H_{ij} = \sum_{k=0}^n \sum_{\ell=0}^n \p{\sigma^2}{f_\ell(x) f_k(x)}(f(x)) \p{f_\ell(x)}{x_j} \p{f_k(x)}{x_i} + \sum_{k=0}^n\p{\sigma}{f_k(x)}(f(x)) \p{f_k(x)^2}{x_j \partial x_i}$$ The two terms of this expression both have meaningful interpretations. The first term is the component of the Hessian due to variation in $f_k$; since we are only looking at variation in $x$, we do effectively a change of basis using the Jacobian of $f_k$. The second term, on the other hand, is the component that is due to variation in $x$, which is why we see the Hessian of $f_k$.
Consider the minimum of the expression $x^T H x$, which is ultimately what we are using the Hessian for. At the minimum of $\sigma$, first derivatives are zero, so the second of those summations is negligible due to its $\p{\sigma}{f_k(x)}$ term. Thus, near the minimum we can approximate the Hessian as $$H_{ij} \approx \sum_{k=0}^n \sum_{\ell=0}^n \p{\sigma^2}{f_\ell(x) f_k(x)}(f(x)) \p{f_\ell(x)}{x_j} \p{f_k(x)}{x_i} = G_{ij}$$ This approximation is the Gauss-Newton matrix, and is commonly denoted as $G$. We can rewrite elements of $G$ in a slightly different way: $$G_{ij} = \sum_{\ell=0}^n \p{f_\ell(x)}{x_j} \sum_{k=0}^n \p{f_k(x)}{x_i} \p{\sigma^2}{f_\ell(x) f_k(x)}(f(x))$$ All we did was change the order of summation. However, once we have it in this form, we can see that this can be rewritten as a matrix product, with the first order derivatives coming from a Jacobian and the second derivative coming from a Hessian! Rewriting this with matrix notation yields a much simpler formulation, $$G = {J_f}^T H_\sigma J_f,$$ where $H_\sigma$ is the Hessian of $\sigma$.
(The traditional Gauss-Newton matrix used in the Gauss-Newton method for solving nonlinear least squares problems has simple a two in place of $H_\sigma$, because the nonlinearity is simply squaring, with a Hessian equal to twice the identity matrix.)
Note that the Gauss-Newton matrix is, unlike the Hessian, guaranteed to be positive semi-definite as long as $H_\sigma$ is. Since $G = J^T H_\sigma J$, we can check for positive semidefiniteness: $$x^T (J^T H_\sigma J) x = (x^T J^T)H_\sigma(Jx) = (J x)^T H_\sigma (J x)$$ Since we require that $H_\sigma$ be positive semi-definite, $G$ must be as well, and thus, the Gauss-Newton matrix is indeed usable for our conjugate gradient algorithm.
As before, we do not actually need to compute all of $G$. Instead, we simply need to be able to compute the Gauss-Newton matrix times some vector, $Gv = J^TH_\sigma Jv$. In order to do this, we'll split this up into two steps: first, we'll compute the vector $Jv$, and then we'll compute the vector $J^Tv'$ with $v' = H_\sigma Jv$.
Let's handle the first step (computing $Jv$) first. Recall the operator $\newcommand\Rv[1]{\mathcal{R}_v\left\{#1\right\}}\Rv{\cdot}$ we defined previously: $$\Rv{f(x)} = \frac{\partial}{\partial \varepsilon} f(x+\varepsilon v){\huge\mid}_{\varepsilon = 0} = \lim_{\varepsilon\to 0}\frac{f(x + \varepsilon v)-f(x)}{\varepsilon}$$
Consider applying this operator so come vector-valued function, $\:\newcommand\R{\mathbb{R}}f:\R^n\to\R^m$. In that case, the resulting function is the same as if we had applied $\Rv{\cdot}$ to each component separately. Thus, the $i$th component can be written as $$\Rv{f(x)}_i= \lim_{\varepsilon\to 0}\frac{f(x + \varepsilon v)_i-f(x)_i}{\varepsilon},$$ which is just the directional derivative of $f_i$ in the direction $v$! Since this is just the directional derivative, we can rewrite this as $$\Rv{f(x)}_i= \nabla f(x)_i \cdot v = \sum_{j=1}^{n} \p{f_i}{x_j} v_j.$$ Recall that this is just the $i$th component. By writing it in sum form, we can see that this looks exactly like a single row of the Jacobian, as the Jacobian is the matrix where each row consists of the derivatives of one component with respect to all the variables. Thus, we can conglomerate all the components by simply multiplying the Jacobian by a vector: $$Jv = \Rv{f(x)},$$ where $J$ is the Jacobian of $f(x)$ evaluated at $x$.
Thus, we can compute $Jv$ for any vector $v$ by performing the forward pass of the Hessian-vector product algorithm we discussed earlier (and derived for fully connected networks).
We have $Jv$ using the forward pass of the $\Rv{\cdot}$ operator, but before we can compute $J^Tv'$, we first need to compute $v' = H_\sigma Jv$. Luckily, we already have an algorithm for multiplication by the Hessian: the backwards $\Rv{\cdot}$ algorithm!
Unlike the previous version, which ran it all the way in order to obtain the full Hessian, we only need to run it back to the pre-nonlinearity output of the last layer. The $\Rv{\cdot}$ algorithm requires a $v$, which we can just let equal our computed $Jv$. As long as we stop our backpropagation at the right time, we will have multiplied $Jv$ by $H_\sigma$!
For most simple loss functions, we can actually just compute $H_\sigma$ manually and apply it ourselves instead of doing this backpropagation.
Now that we've computed $v' = H_\sigma Jv$, we need to compute $J^Tv'$.
Let's take a careful look at what backpropagation actually does. As we know, backpropagation computes the gradient of our error function, where the $i$th component of the gradient is $$(\nabla E)_i = \p{E}{x_i}.$$ We can rewrite this using a chain rule and all the outputs at the last layer: $$(\nabla E)_i = \sum_{k=0}^n \p{E}{f_k}\p{f_k}{x_i},$$ where $f_k$ is the $k$th output unit output value. If we have a simple sum of squared errors error function, then the dependence on every $f_k$ in the output will be linear: $$E = \sum {f_k}^2 \implies (\nabla E)_i = 2\sum_{k=0}^n f_k \p{f_k}{x_1} = 2 f(x) \cdot \p{f(x)}{x_i}.$$ As you see above, we can rewrite the $i$th component of the gradient as a dot product of the function with its own derivative with respect to $x_i$. However, this is exactly what multiplication by the transpose of the Jacobian does:
$$ (J^T f(x))_i = \left(\begin{bmatrix} \p{f_1}{x_1} & \p{f_2}{x_1} & \cdots & \p{f_n}{x_1} \\ \p{f_1}{x_2} & \p{f_2}{x_2} & \cdots & \p{f_n}{x_2} \\ \vdots & \vdots & \vdots & \vdots \\ \p{f_1}{x_m} & \p{f_2}{x_m} & \cdots & \p{f_n}{x_m} \\ \end{bmatrix} \begin{bmatrix}f_1 \\ f_2 \\ \vdots \\ f_n\end{bmatrix}\right)_i = \sum_{k=0}^n f_k \p{f_k}{x_i} $$Thus, backpropagation (for this simple error function) computes a multiple by the transpose of the Jacobian: $$\nabla E = 2J^T f(x)$$ More generally, though, if we give backpropagation any vector $v$ at the output units, it simply multiplies by $J^T$.
Thus, we can compute $J^Tv$ for any vector $v$ by performing the standard neural network backpropagation while using the components of $v$ as the "errors" at the last layer. In order to complete our Gauss-Newton matrix computation, we backpropagate $H_\sigma J v$ through until the input layer using the normal neural network backpropagation algorithm.
In this post, we looked at assumptions we made in our derivation of Hessian-free optimization. Using conjugate gradient meant we had to have a positive semidefinite matrix. Since conjugate gradient goes towards the extremum of the function, having a matrix that wasn't positive semi-definite would mean that the algorithm could fail to decrease the objective, and could even take steps uphill away from the minimum.
This assumption turned out to be problematic, because the Hessian was not guaranteed to be positive semidefinite. We then derived an approximation to the Hessian known as the Gauss-Newton matrix. The Gauss-Newton matrix is a good approximation for two reasons; first of all, quadratic optimization objectives using the Gauss-Newton matrix instead of the Hessian have the same minimum, and second, it is provably positive semidefinite.
Finally, we derived the neural network matrix-vector product for the Gauss-Newton matrix. It turns out that we can compute $Gv$ for any vector $v$ by running $\Rv{\cdot}$ forward propagation to get the vector $a = Jv$; then, run one step of $\Rv{\cdot}$ backward propagation in order to multiply it by $H_\sigma$; finally, run normal neural network backpropagation to finish computing $Gv$.
These improvements to our Hessian-free optimization algorithm should make it more robust in the presence of difficult optimization objectives. In his 2010 paper, Martens observes that using this modification was consistently superior to using the Hessian alone, or other, more complex fixes to the problem that the Hessian is not positive semi-definite. In addition, this turns out to be significantly faster than computing the full Hessian-vector product!