2017-04-28 jkang
Ref:
Data
- $X$ is an input matrix; size = $(n\_inputs)\ \times \ (n\_features)$
Inputs are stacked row-wise in $X$- $Y$ is an output matrix; size = $(n\_outputs)\ \times \ (n\_classes)$
Network
- $W^k$ is a weight matrix which maps $(k-1)$th layer to $k$th layer
- $b^k$ is a bias vector at $k$th layer
Processes
- $C$ is a cost function. The choices of cost function can be Cross-entropy, MSE, etc.
- $\sigma(X)$ is a sigmoid function which maps X into $\sigma(X)$
c.f. $\sigma'(X)$ is derivative of $\sigma(X)$- $z^l$ is the weighted sum of inputs to $l$th layer (before applied to the activation function); size = $(n\_inputs)\ \times \ (n\_hidden\_units)$
c.f. $z^L$ is the weighted sum at the final layer- $a^l$ is a transformed version of $z$ by the activation function; size( $z^l$ ) = size( $a^l$ )
- $\delta^L$ is the final output error at $z^L$; i.e. $\frac{\partial(C)}{\partial(z^L)}$; size( $\delta^L$ ) = size( $Y$ )
- $\delta^l$ is the $l$th error at $z^l$; size( $\delta^l$ ) = size( $a^l$ )
Why are they called 'error'? Short answer: this 'error' tells us how sensitive each layer is to the cost. This sensitivity is important because it helps us to know how much we can change weights and biases to reduce the cost. Bottom line is the 'error' appears in deriving $\frac{\partial C}{\partial w}$. So, it would be good to know how much this error happens and make it explicit for later calculations. See Nielson
Understand how much network parameters ( $W$ and $b$ ) affect $C$, and calculate the proper amount for parameter update (i.e. derivatives of parameters)
Calculate: $$\frac{\partial C}{\partial W}\ and\ \frac{\partial C}{\partial b}$$
Update $k$th weights and biases: $$W^k = W^k - \eta\frac{\partial C}{\partial W^k}$$ $$b^k = b^k - \eta\frac{\partial C}{\partial b^k}$$
BP rules (=Back-Propagation)
BP 1: How much the little change in $z^L$ (in the last layer) affect $C$?
$$\delta^L = \frac{\partial C}{\partial z^L} = \frac{\partial C}{\partial z^L} \odot \sigma '(z^L)$$BP 2: What's the relationship between $\delta^l$ and $\delta^{l+1}$
$$\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) $$BP 3: How much does the bias $b^l$ affect $C$?
$$\frac{\partial C}{\partial b^l} = \delta^l$$BP 4: How much does the weight $W^l$ affect $C$?
$$\frac{\partial C}{\partial W^{l+1}} = \delta^{l+1} \cdot (a^l)^T$$
One hidden layer Neural Network architecture from cs224d (1st assignment):
Dimensions
$X \in \mathbb{R}^{n \times m}$, n: number of examples, m: number of input dimensions
$Y \in \mathbb{R}^{n \times c}$, c: number of output units (labels)
NB. Here $X$ and $Y$ are used as an explicit matrix notation rather than a vector $x$ or $y$ (or $\hat{y}$).
$H \in \mathbb{R}^{n \times h}$, h: number of hidden units
$W^1 \in \mathbb{R}^{m \times h}$
$W^2 \in \mathbb{R}^{h \times c}$
$b^1 \in \mathbb{R}^{1 \times h}$, they are copied column-wise to make $n \times h$ matrix for calculation
$b^2 \in \mathbb{R}^{1 \times c}$, they are copied column-wise to make $n \times c$ matrix for calculation
Activation function at the hidden layer:
$$\begin{split}
\sigma(X) &= sigmoid(XW^1 + b^1) \\
&= \sigma(XW^1 + b^1)
\end{split}$$
Output function at the outpur layer: $$\begin{split} \hat{Y} &= softmax(HW^2 + b^2) \\ CE(Y, \hat{Y}) &= \sum_{i=1}^n\sum_{j=1}^c Y_{ij} \cdot log(\hat{Y_{ij}}) \end{split}$$
$$\begin{split} H &= \sigma(XW^1 + b^1) \\ Z &= HW^2 + b^2 \\ \hat{Y} &= softmax(Z) \\ &= f(Z) \end{split}$$
Get output error (BP1)
$$\begin{split} \delta^2 &= \frac{\partial CE(Y, \hat{Y})}{\partial Z} \\ &= \frac{\partial CE(Y, \hat{Y})}{\partial f(Z)} \cdot \frac{\partial f(Z)}{\partial Z} \\ &= \hat{Y} - Y \\ \end{split}$$
For more, look Peter's notesBackpropagate the error (BP2, BP3, BP4)
- Since our goal is to calculate $$\begin{split} \frac{\partial C}{\partial W^1} &= X^T \cdot \delta^1 \quad (BP4) \\ \frac{\partial C}{\partial b^1} &= \delta^1 \quad (BP3) \\ \frac{\partial C}{\partial W^2} &= H^T \cdot \delta^2 \quad (BP4) \\ \frac{\partial C}{\partial b^2} &= \delta^2 \quad (BP3) \\ \end{split}$$
we need to calculate $\delta^1$, which can be calculated using BP2
$$\delta^1 = ((W^2)^T) \odot \delta^2$$Update parameters
- $\eta$ is a learning rate
$$\begin{split} W^{1\ new} &= W^1 - \eta \frac{\partial C}{\partial W^1} \\ b^{1\ new} &= b^1 - \eta \frac{\partial C}{\partial b^1} \\ W^{2\ new} &= W^2 - \eta \frac{\partial C}{\partial W^2} \\ b^{2\ new} &= b^2 - \eta \frac{\partial C}{\partial b^2} \\ \end{split}$$