Output layer activation:
$$f(\mathbf{x})_y = \frac{\exp\left(a^{(L+1)}(\mathbf{x})_y\right)}{\sum_d \exp\left(a^{(L+1)}(\mathbf{x})_d\right)}$$
Softmax:
Previously, we defind the loss: $$L = \frac{1}{N} \sum_{i=1}^{N} l\left(f(\mathbf{x}_i,\theta); y_i\right)$$
And the optimal parameters are $$\theta^* = arg\min_\theta \frac{1}{N} \sum_{i=1}^{N} l\left(f(\mathbf{x}_i,\theta); y_i\right)$$
Solution: using gradient decent to find the optimal parmaters $\theta^{t+1} = \theta^t - \alpha \nabla l\left(f(\mathbf{x}_i,\theta); y_i\right)$
Big picture: we want to find the derivative w.r.t. weights.
Strategy:
$$\frac{\partial f(x)}{\partial x} = \frac{\partial f(x)}{\partial g(x)} \ \times\ \frac{\partial g(x)}{\partial x}$$
First, we derive $\frac{\partial}{\partial a^{(L+1)}(\mathbf{x})_c} -\log f(\mathbf{x})_y$ for all classes $c=0..(C-1)$. This can be represented as the gradient: $\nabla_{a^{(L+1)}(\mathbf{x})} -\log f(\mathbf{x})_y$
Step 1: derivate w.r.t. $f(\mathbf{x})$
Step2:
In [ ]: