Complementary Problem Set

原始文档链接：http://cs224d.stanford.edu/assignment1/assignment1.pdf

1. Softmax

softmax的准确定义，Wikipedia上写的很清楚：

In mathematics, in particular probability theory and related fields, the softmax function, or normalized exponential, is a generalization of the logistic function that "squashes" a K-dimensional vector $\mathbf {z} $ of arbitrary real values to a K-dimensional vector $\sigma (\mathbf {z} )$ of real values in the range (0, 1) that add up to 1.

那么，根据定义，对于一个$\ n\ $维向量$\ \vec x\ $，向量$\ \vec x\ $中的每一个元素$\ x_i$，将会变成 $$ P(x_i)= \frac{e ^ {x_i}}{\sum_{j=1}^{n} e^{x_j}} $$ softmax的结果可以解释为概率，因此这里用$P(x_i)$表示。那么，若$\ \vec x_i' = \vec x + 3\ $，那么对于每一个元素$x_i'$，有 $$P(x_i') = \frac{e^{x{_i}'} }{\sum_{j=1}^{n}e^{x_j'}} \\ = \frac{e^{x{_i} + 3} }{\sum_{j=1}^{n}e^{x_j + 3}} \\ = \frac{e^{x{_i}} * e ^ 3}{e^3 * \sum_{j=1}^{n}e^{x_j + 3}} \\ = \frac{e ^ {x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

给的材料里写了：

Note: In practice, we make use of this property and choose $c = − max_i\{x_i\}$ when computing softmax probabilities for numerical stability (i.e. subtracting its maximum element from all elements of x).

2. Neural Network Basics

(a)

$$\sigma'(z) = (\frac{1}{1 + e ^{-z}})' \\ = - (\frac{1}{1 + e ^ {-z}}) ^ 2 * (-e^{-z}) \\ = \frac{1}{1 + e ^ {-z}} * \frac{e ^ {-z}}{1 + e ^ {-z}} \\ = \sigma(z) * (1-\sigma(z))$$

(b)

$$ \frac{\partial CE(\vec y , \hat{\vec y})}{\partial {\vec \theta}} = \frac{-\sum_i^m y_i * log(\hat y_i)}{\partial {\vec \theta}} = ①$$$$ \text{assume} \ y_i = \begin{cases} 1, & \text{if $i$ = k} \\[2ex] 0, & \text{if $i \ != k$} \end{cases} $$$$ ① = \frac{-log(y_k)}{\partial {\vec \theta}} = \frac{-log(\frac{e^{\theta_k}}{\sum_{x=1}^m e ^ {\theta_x}})}{\partial {\vec \theta}} \\ = \frac{-\theta_k + log(\sum_{x=1}^m e ^ {\theta_x})}{\partial {\vec \theta}} = ② $$$$\text{if i != k, for each $\theta_i$, we have } ② = \frac{\partial(-\theta_k)}{\partial \theta_i} + \frac{\partial log(\sum_{x=1}^m e ^ {\theta_x})}{\partial \theta _i} = \frac{\partial(-\theta_k)}{\partial \theta_i} + \frac{e^{\theta_i}}{\sum_{x=1}^m e ^ {\theta_x}} = 0 + \hat y_i =\hat y_i \text{, } $$$$ \text{if i != k, apparently we have ② } = \hat y_i - 1 \text{, } $$$$ \text{thus, }\frac{\partial CE(\vec y , \hat{\vec y})}{\partial {\vec \theta}} = \hat {\vec y} - \vec y $$

(c)

我是按$\vec x, \vec y, \vec h$都是列向量来做的。 $$ \frac{\partial J}{\partial x} = W^{(1)} * \delta^{(2)} \ = W^{(1)} * ((W^{(2)} * (\hat{\vec y} - \vec y)) \circ (\vec h \circ (1 - \hat h))) $$ 验证维度 $$W^{(1)} \in D_x * H, \ W^{(2)} \in H * D_y, \ \vec y \in D_y * 1, \ \vec h \in H * 1, \frac{\partial J}{\partial x} \in D_x * 1$$

======= 编程的时候按题意重新梳理了一遍。

forward $$ \mathbf z^{(1)} = \mathbf {x} \\ \mathbf a^{(1)} = \mathbf {x} \\ \mathbf z^{(2)} = \mathbf a^{(1)} \bullet W ^ {(1)} + \mathbf b^{(1)} \\ \mathbf a^{(2)} = sigmoid(\mathbf z^{(2)}) = \mathbf h \\ \mathbf z^{(3)} = \mathbf a^{(2)} \bullet W ^{(2)} + \mathbf b^{(2)} = \mathbf \theta\\ \mathbf a ^{(3)} = softmax(\mathbf z^{(3)})= \hat{\mathbf y} $$ backward $$ \delta^{(l)} = \frac{\partial J}{\partial z^{(l)}} \\ \delta^{(3)} = \mathbf a^{(3)} - \mathbf y \\ \frac{\partial J}{\partial W^{(2)}} = \mathbf a^{(2)T} \bullet \delta^{(3)}\\ \text{其他的看代码理解吧，懒得打了} $$ 一定要理解第五节课中总结的两个式子

(d)

$$(D_x * H + H) + (H * D_y + D_y) $$

3. word2vec

(a)

$ J = - \sum p log q $, 其中p是one hot vector，只有正确词是1，q是softmax求得的probability，因此，有 $$ \frac{\partial J}{\partial \hat{\vec r}} = \frac{\partial (-1 * log(\frac{exp(\vec w_i^T, \hat {\vec r} )}{\sum_{j=1}^{|v|} exp(\vec w_j^T \bullet \hat{\vec r})}))}{\partial \hat{\vec r}} \\ = - \vec w_i + \frac{\partial (log(\sum_{j=1}^{|v|} exp(\vec w_j^T \bullet \hat{\vec r})))}{\partial \hat{\vec r}} \\ = - \vec w_i + \frac{1}{\sum_{j=1}^{|V|} exp(\vec w_j^T \bullet \hat {\vec r} )} * \sum_{x=1}^{|V|} exp(\vec w_x^T, \hat{\vec r}) * \vec w_x \\ = - \vec w_i + \sum_{x=1}^{|V|}Pr(word_x|\ \vec w, \hat {\vec r}) * \vec w_x $$

(b)

和前面的推导几乎一样，注意原式分子的sum只会剩下一项，因此，最终结果中的sum也只剩下一项；另外，$w_i$当i是expected word时，才会留下前面的$-\hat {\vec r}$

(c)

$$ \require{cancel} \frac{\partial J}{\partial \hat {\vec r}} = -\frac{1}{\cancel{\sigma ()}} * \cancel{\sigma()} * (1-\sigma()) * \vec w_i - \sum_{k=1}^K \frac{1}{\sigma()} * \sigma() * (1-\sigma) * (-\vec {w_k}) \\ = (\sigma(w_i^T \bullet \hat{\vec r}) - 1)* \vec w_i + \sum_{k=1}^K (1-\sigma(-w_k^T \bullet \hat{\vec r})) \vec w_k $$$$ j = \text{expected word i, we have} \frac{\partial J}{\partial w_i} = (\sigma(w_i^T \bullet \hat{\vec r}) - 1) * \hat{\vec r} \\ j != i \text{and j in K}, \frac{\partial J}{\partial w_j} = (1-\sigma(-w_j^T \bullet \hat{\vec r})) * \hat{\vec r} \\ \text{j != i and j not in K}, \frac{\partial J}{\partial w_j} =0 $$

(d)

$$ \frac{\partial J}{\partial \mathbf {v_{w_i}}} = \sum _{-c \le j \le C, j \ne c} \frac{\partial F(\mathbf v'_{w_{i+j}}, \mathbf v_{w_i})}{\partial \mathbf {v_{w_i}}} $$$$ \frac{\partial J}{\partial \mathbf {v'_{w_{i+j}}}} = \frac{\partial F(\mathbf v'_{w_{i+j}}, \mathbf v_{w_i})}{\partial \mathbf {v'_{w_{i+j}}}} \text{for $ -c \le j \le c$ and $ j \ne 0$} $$