Back Propagation

Log-loss

Bernoulli: $P(Y=y|X) = (D)^y \times (1-D)^{1-y}$

$$\mathbf{E}[\log P(Y|x)] = \mathbf{E}[y\log D + (1-y) \log(1-D)]$$

layer pre-activation :
- $\ \ h^{(0)}(\mathbf{x}) = x$

Hidden layer activation ($1 \le k \le L$):

Output layer activation:

$$f(\mathbf{x})_y = \frac{\exp\left(a^{(L+1)}(\mathbf{x})_y\right)}{\sum_d \exp\left(a^{(L+1)}(\mathbf{x})_d\right)}$$

Softmax:

Previously, we defind the loss: $$L = \frac{1}{N} \sum_{i=1}^{N} l\left(f(\mathbf{x}_i,\theta); y_i\right)$$

And the optimal parameters are $$\theta^* = arg\min_\theta \frac{1}{N} \sum_{i=1}^{N} l\left(f(\mathbf{x}_i,\theta); y_i\right)$$

Solution: using gradient decent to find the optimal parmaters $\theta^{t+1} = \theta^t - \alpha \nabla l\left(f(\mathbf{x}_i,\theta); y_i\right)$

Output layer gradient:

Big picture: we want to find the derivative w.r.t. weights.

Strategy:

$$\frac{\partial f(x)}{\partial x} = \frac{\partial f(x)}{\partial g(x)} \ \times\ \frac{\partial g(x)}{\partial x}$$

First, we derive $\frac{\partial}{\partial a^{(L+1)}(\mathbf{x})_c} -\log f(\mathbf{x})_y$ for all classes $c=0..(C-1)$. This can be represented as the gradient: $\nabla_{a^{(L+1)}(\mathbf{x})} -\log f(\mathbf{x})_y$

Step 1: derivate w.r.t. $f(\mathbf{x})$
- Partial derivative: $$\frac{\delta}{\delta f(\mathbf{x})_c} -\log f(\mathbf{x})_y =-\frac{1_{y=c}}{f(\mathbf{x})_y}$$
- Gradient $$\nabla_{f(\mathbf {x})} -\log f(\mathbf{x})_y = \frac{-1}{f(\mathbf {x})_y} \left[\begin{array}{c}\mathbf{1}_{y=0}\\\mathbf{1}_{y=1}\\ .. \\\mathbf{1}_{y=C-1}\end{array}\right] = \\ \frac{-e(y)}{f(\mathbf {x})_y} \\ \ \ \ \ e(y)\text{( is called one-shot encoding)}$$

Step2:
- Parital form: $$\frac{\delta}{\delta a^{(L+1)}(\mathbf{x})_c} -\log f(\mathbf{x})_y = \frac{\delta}{\delta f(\mathbf{x})_c} -\log f(\mathbf{x})_y \ \ \times \ \ $$

Flow graph

Forward pass

Modular implementation

Backward pass

Gradient Checking

Finite Difference Approximation



In [ ]: