In mathematics, in particular probability theory and related fields, the softmax function, or normalized exponential, is a generalization of the logistic function that "squashes" a K-dimensional vector $\mathbf {z} $ of arbitrary real values to a K-dimensional vector $\sigma (\mathbf {z} )$ of real values in the range (0, 1) that add up to 1.
那么,根据定义,对于一个$\ n\ $维向量$\ \vec x\ $,向量$\ \vec x\ $中的每一个元素$\ x_i$,将会变成 $$ P(x_i)= \frac{e ^ {x_i}}{\sum_{j=1}^{n} e^{x_j}} $$ softmax的结果可以解释为概率,因此这里用$P(x_i)$表示。 那么,若$\ \vec x_i' = \vec x + 3\ $,那么对于每一个元素$x_i'$,有 $$P(x_i') = \frac{e^{x{_i}'} }{\sum_{j=1}^{n}e^{x_j'}} \\ = \frac{e^{x{_i} + 3} }{\sum_{j=1}^{n}e^{x_j + 3}} \\ = \frac{e^{x{_i}} * e ^ 3}{e^3 * \sum_{j=1}^{n}e^{x_j + 3}} \\ = \frac{e ^ {x_i}}{\sum_{j=1}^{n} e^{x_j}}$$
Note: In practice, we make use of this property and choose $c = − max_i\{x_i\}$ when computing softmax probabilities for numerical stability (i.e. subtracting its maximum element from all elements of x).
我是按$\vec x, \vec y, \vec h$都是列向量来做的。 $$ \frac{\partial J}{\partial x} = W^{(1)} * \delta^{(2)} \ = W^{(1)} * ((W^{(2)} * (\hat{\vec y} - \vec y)) \circ (\vec h \circ (1 - \hat h))) $$ 验证维度 $$W^{(1)} \in D_x * H, \ W^{(2)} \in H * D_y, \ \vec y \in D_y * 1, \ \vec h \in H * 1, \frac{\partial J}{\partial x} \in D_x * 1$$
======= 编程的时候按题意重新梳理了一遍。
forward $$ \mathbf z^{(1)} = \mathbf {x} \\ \mathbf a^{(1)} = \mathbf {x} \\ \mathbf z^{(2)} = \mathbf a^{(1)} \bullet W ^ {(1)} + \mathbf b^{(1)} \\ \mathbf a^{(2)} = sigmoid(\mathbf z^{(2)}) = \mathbf h \\ \mathbf z^{(3)} = \mathbf a^{(2)} \bullet W ^{(2)} + \mathbf b^{(2)} = \mathbf \theta\\ \mathbf a ^{(3)} = softmax(\mathbf z^{(3)})= \hat{\mathbf y} $$ backward $$ \delta^{(l)} = \frac{\partial J}{\partial z^{(l)}} \\ \delta^{(3)} = \mathbf a^{(3)} - \mathbf y \\ \frac{\partial J}{\partial W^{(2)}} = \mathbf a^{(2)T} \bullet \delta^{(3)}\\ \text{其他的看代码理解吧,懒得打了} $$ 一定要理解第五节课中总结的两个式子
$ J = - \sum p log q $, 其中p是one hot vector,只有正确词是1,q是softmax求得的probability,因此,有 $$ \frac{\partial J}{\partial \hat{\vec r}} = \frac{\partial (-1 * log(\frac{exp(\vec w_i^T, \hat {\vec r} )}{\sum_{j=1}^{|v|} exp(\vec w_j^T \bullet \hat{\vec r})}))}{\partial \hat{\vec r}} \\ = - \vec w_i + \frac{\partial (log(\sum_{j=1}^{|v|} exp(\vec w_j^T \bullet \hat{\vec r})))}{\partial \hat{\vec r}} \\ = - \vec w_i + \frac{1}{\sum_{j=1}^{|V|} exp(\vec w_j^T \bullet \hat {\vec r} )} * \sum_{x=1}^{|V|} exp(\vec w_x^T, \hat{\vec r}) * \vec w_x \\ = - \vec w_i + \sum_{x=1}^{|V|}Pr(word_x|\ \vec w, \hat {\vec r}) * \vec w_x $$
和前面的推导几乎一样,注意原式分子的sum只会剩下一项,因此,最终结果中的sum也只剩下一项;另外,$w_i$当i是expected word时,才会留下前面的$-\hat {\vec r}$
In [ ]: