Notes on Backpropagation

2017-04-28 jkang

This notes summarized notations and fundamentals for backpropagation in Neural Network for simpler understanding and easier implementation in code
All notations are vectorized for simplification

Ref:

Notations

Data

$X$ is an input matrix; size = $(n\_inputs)\ \times \ (n\_features)$
    Inputs are stacked row-wise in $X$

$Y$ is an output matrix; size = $(n\_outputs)\ \times \ (n\_classes)$

Network

$W^k$ is a weight matrix which maps $(k-1)$th layer to $k$th layer

$b^k$ is a bias vector at $k$th layer

Processes

$C$ is a cost function. The choices of cost function can be Cross-entropy, MSE, etc.

$\sigma(X)$ is a sigmoid function which maps X into $\sigma(X)$
     c.f. $\sigma'(X)$ is derivative of $\sigma(X)$

$z^l$ is the weighted sum of inputs to $l$th layer (before applied to the activation function); size = $(n\_inputs)\ \times \ (n\_hidden\_units)$
     c.f. $z^L$ is the weighted sum at the final layer

$a^l$ is a transformed version of $z$ by the activation function; size( $z^l$ ) = size( $a^l$ )

$\delta^L$ is the final output error at $z^L$; i.e. $\frac{\partial(C)}{\partial(z^L)}$; size( $\delta^L$ ) = size( $Y$ )

$\delta^l$ is the $l$th error at $z^l$; size( $\delta^l$ ) = size( $a^l$ )
Why are they called 'error'? Short answer: this 'error' tells us how sensitive each layer is to the cost. This sensitivity is important because it helps us to know how much we can change weights and biases to reduce the cost. Bottom line is the 'error' appears in deriving $\frac{\partial C}{\partial w}$. So, it would be good to know how much this error happens and make it explicit for later calculations. See Nielson

Goal

Understand how much network parameters ( $W$ and $b$ ) affect $C$, and calculate the proper amount for parameter update (i.e. derivatives of parameters)
Calculate: $$\frac{\partial C}{\partial W}\ and\ \frac{\partial C}{\partial b}$$
Update $k$th weights and biases: $$W^k = W^k - \eta\frac{\partial C}{\partial W^k}$$ $$b^k = b^k - \eta\frac{\partial C}{\partial b^k}$$

Fundamentals for Backpropagation

BP rules (=Back-Propagation)
BP 1: How much the little change in $z^L$ (in the last layer) affect $C$?
$$\delta^L = \frac{\partial C}{\partial z^L} = \frac{\partial C}{\partial z^L} \odot \sigma '(z^L)$$
BP 2: What's the relationship between $\delta^l$ and $\delta^{l+1}$
$$\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) $$
BP 3: How much does the bias $b^l$ affect $C$?
$$\frac{\partial C}{\partial b^l} = \delta^l$$
BP 4: How much does the weight $W^l$ affect $C$?
$$\frac{\partial C}{\partial W^{l+1}} = \delta^{l+1} \cdot (a^l)^T$$

Example

One hidden layer Neural Network architecture from cs224d (1st assignment):
Dimensions
$X \in \mathbb{R}^{n \times m}$, n: number of examples, m: number of input dimensions
$Y \in \mathbb{R}^{n \times c}$, c: number of output units (labels)
NB. Here $X$ and $Y$ are used as an explicit matrix notation rather than a vector $x$ or $y$ (or $\hat{y}$).

$H \in \mathbb{R}^{n \times h}$, h: number of hidden units
$W^1 \in \mathbb{R}^{m \times h}$
$W^2 \in \mathbb{R}^{h \times c}$
$b^1 \in \mathbb{R}^{1 \times h}$, they are copied column-wise to make $n \times h$ matrix for calculation
$b^2 \in \mathbb{R}^{1 \times c}$, they are copied column-wise to make $n \times c$ matrix for calculation
Activation function at the hidden layer:
$$\begin{split} \sigma(X) &= sigmoid(XW^1 + b^1) \\ &= \sigma(XW^1 + b^1) \end{split}$$
Output function at the outpur layer: $$\begin{split} \hat{Y} &= softmax(HW^2 + b^2) \\ CE(Y, \hat{Y}) &= \sum_{i=1}^n\sum_{j=1}^c Y_{ij} \cdot log(\hat{Y_{ij}}) \end{split}$$

Feedforward process

$$\begin{split} H &= \sigma(XW^1 + b^1) \\ Z &= HW^2 + b^2 \\ \hat{Y} &= softmax(Z) \\ &= f(Z) \end{split}$$

Backward propagation

Get output error (BP1)
$$\begin{split} \delta^2 &= \frac{\partial CE(Y, \hat{Y})}{\partial Z} \\ &= \frac{\partial CE(Y, \hat{Y})}{\partial f(Z)} \cdot \frac{\partial f(Z)}{\partial Z} \\ &= \hat{Y} - Y \\ \end{split}$$

For more, look Peter's notes

Backpropagate the error (BP2, BP3, BP4)

Since our goal is to calculate $$\begin{split} \frac{\partial C}{\partial W^1} &= X^T \cdot \delta^1 \quad (BP4) \\ \frac{\partial C}{\partial b^1} &= \delta^1 \quad (BP3) \\ \frac{\partial C}{\partial W^2} &= H^T \cdot \delta^2 \quad (BP4) \\ \frac{\partial C}{\partial b^2} &= \delta^2 \quad (BP3) \\ \end{split}$$
we need to calculate $\delta^1$, which can be calculated using BP2
$$\delta^1 = ((W^2)^T) \odot \delta^2$$

Update parameters

$\eta$ is a learning rate
$$\begin{split} W^{1\ new} &= W^1 - \eta \frac{\partial C}{\partial W^1} \\ b^{1\ new} &= b^1 - \eta \frac{\partial C}{\partial b^1} \\ W^{2\ new} &= W^2 - \eta \frac{\partial C}{\partial W^2} \\ b^{2\ new} &= b^2 - \eta \frac{\partial C}{\partial b^2} \\ \end{split}$$