Derive the gradients of cross entropy loss w.r.t. inputs

2017-04-14 jkang
Python 3.5

Ref:

Softmax function and derivatives

Softmax function maps from input $\hat{x}$ to output probability $\hat{s}$ ( $\hat{x}$ and $\hat{s}$ is $\mathbb{R}^{C\times1}$, C is number of classes ) $$ \hat{s} = \mathbf{softmax}(\hat{x}) $$

which can be written for each class as: $$\begin{split} s_i &= \mathbf{softmax}_i(x_i) \\ &= \frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}} \\ \end{split}$$

We are interested in how much output $s$ changes by input $x$, but given, $s_i$ all $x$ values affect the changes of $s_i$. So to generalize the relationship between $s_i$ and $x$, we need to separate where $i=j$ ($s_i$, $x_i$) and $i\ne{j}$ ($s_i$, $x_j$).

Before calculating the derivative of softmax function, it's better to know what quotient rule is. This rule applies to calculating the derivative of the quotient of two function as: $$ \left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2} $$

Here is the calculation of the derivatives of softmax function. $$\begin{split} When\ i = j,\ & \frac{\partial s_i}{\partial x_i} = \frac{\partial \frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}}}{\partial x_i} = \frac{1}{\partial x_i} \cdot \frac{x_i e^{x_i}\sum_{c=1}^C e^{x_c} - e^{x_i}\cdot x_i e^{x_i}}{(\sum_{c=1}^C e^{x_c})^2} \\ &= \frac{e^{x_i} \sum_{c=1}^C e^{x_c} - \cdot e^{x_i} e^{x_i}}{(\sum_{c=1}^C e^{x_c})^2} = \frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}} \cdot \frac{\sum_{c=1}^C e^{x_c} - e^{x_i}}{\sum_{c=1}^C e^{x_c}} \\ &= \frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}} \left( 1 - \frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}} \right) \\ &= s_i (1 - s_i) \end{split}$$

$$\begin{split} When\ i \ne{j},\ & \frac{\partial s_i}{\partial x_j} = \frac{\partial \frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}}}{\partial x_j} = \frac{1}{\partial x_j} \cdot \frac{0 \cdot \sum_{c=1}^C e^{x_c} - e^{x_i}\cdot x_j e^{x_j}}{(\sum_{c=1}^C e^{x_c})^2} \\ &= -\frac{e^{x_i}e^{x_j}}{(\sum_{c=1}^C e^{x_c})^2} = -\frac{e^{x_i}}{\sum_{c=1}^C e^{x_c}} \cdot \frac{e^{x_j}}{\sum_{c=1}^C e^{x_c}} = -s_i s_j \end{split}$$