Homework 1: Differentiation

Since it easy to google every task please please please try to undestand what's going on. The "just answer" thing will be not counted, make sure to present derivation of your solution. It is absolutely OK if you found an answer on web then just exercise in $\LaTeX$ copying it into here. A good way to derive solutions for these tasks is to derive it for single elements and then generalize to the resulting matrix/vector.

Useful links: 1 2 3

ex. 1

Scalar w.r.t. vector: $$ y = c^Tx, \quad x \in \mathbb{R}^N $$

$$ \frac{dy}{dx} = $$

Assuming that $x$ is a column vector size of $N$ and $c$ is a likewise column vector, we can rewrite $y = c^T x$:

$$ y = c_1 x_1 + c_2 x_2 + ... + c_N x_N $$

Since $\frac{dy}{dx}$ is a vector size of $N$:

$$ \frac{dy}{dx} = \begin{bmatrix} \frac{\partial{y}}{x_1} & \dots & \frac{\partial{y}}{x_N} \end{bmatrix} $$

We can simply put the unrolled statement for $y$ and derive final answer:

$$ \frac{dy}{dx} = \begin{bmatrix} c_1 \dots c_N \end{bmatrix} = \boxed{c^T} $$

ex. 2

Vector w.r.t. vector: $$ y = \sum_{j=1}^{N} cx^T \quad c \in \mathbb{R}^{M} ,x \in \mathbb{R}^{N}, cx^T \in \mathbb{R}^{M \times N} $$

$$ \frac{dy}{dx} = $$

Assuming $c$ to be a column vector size of $M$, $x$ to be a column vector size of $N$:

$$cx^T = \begin{bmatrix} c_1 x_1 & \dots & c_1 x_N \\ \vdots & \ddots & \vdots \\ c_M x_1 & \dots & c_M x_N \end{bmatrix} \implies y = \begin{bmatrix} c_1 \sum_{i=1}^N x_i \\ \vdots \\ c_M \sum_{i=1}^N x_i \\ \end{bmatrix} $$

Finally:

$$ \frac{dy}{dx} = \begin{bmatrix} \frac{\partial{y_1}}{\partial{x_1}} & \dots & \frac{\partial{y_1}}{\partial{x_N}} \\ \vdots & \ddots & \vdots \\ \frac{\partial{y_M}}{\partial{x_1}} & \dots & \frac{\partial{y_M}}{\partial{x_N}} \end{bmatrix} = \begin{bmatrix} c_1 & \dots & c_1 \\ \vdots & \ddots & \vdots \\ c_M & \dots & c_M \end{bmatrix} = \boxed{ c \boldsymbol{1}^T } $$

ex. 3

Vector w.r.t. vector: $$ y = x x^T x , x \in \mathbb{R}^{N} $$

$$ \frac{dy}{dx} = $$

Assuming that $x$ is a column vector size of $N$, let's break it down step by step:

$$ xx^T = \begin{bmatrix} x_1 x_1 & \dots & x_1 x_N \\ \vdots & \ddots & \dots \\ x_N x_1 & \dots & x_N x_N \end{bmatrix} $$

Multiply it by $x$ from the left side:

$$ xx^Tx = \begin{bmatrix} x_1 \sum_{i=1}^N x_i^2 \\ \vdots \\ x_N \sum_{i=1}^N x_i^2 \end{bmatrix} $$

Let's write out the $\frac{d{y}}{d{x}}$: $$ \frac{d{y}}{d{x}} = \begin{bmatrix} \frac{\partial{y_1}}{\partial{x_1}} & \dots & \frac{\partial{y_1}}{\partial{x_N}} \\ \vdots & \ddots & \vdots \\ \frac{\partial{y_N}}{\partial{x_1}} & \dots & \frac{\partial{y_N}}{\partial{x_N}} \end{bmatrix} = \begin{bmatrix} 2x_1^2 + \sum_{i=1}^N x_i^2 & \dots & 2x_1 x_N \\ \vdots & \ddots & \vdots \\ 2x_N x_1 & \dots & 2x_N^2 + \sum_{i=1}^N x_i^2 \end{bmatrix} = \boxed{x^T x I + 2xx^T} $$

ex. 4

Derivatives for the parameters of the Dense layer:

Given : $$Y = XW, Y \in \mathbb{R}^{N \times OUT}, X \in \mathbb{R}^{N \times IN}, W \in \mathbb{R}^{IN \times OUT} $$

The derivative of the hypothetic loss function w.r.t. to $Y$ is known: $\Delta Y \in \mathbb{R}^{N \times OUT}$

Task : Please, derive the gradients of the loss w.r.t the weight matrix $W$: $\Delta W \in \mathbb{R}^{IN \times OUT}$. Use the chain rule. First, please, derive each element of the $\Delta W$, then generalize to the matrix form.

Useful link: http://cs231n.stanford.edu/vecDerivs.pdf

As it is advised in the handbook, let's suppose $X$ and $Y$ to be a row vectors size of $1\times IN$ and $1\times OUT$ respectively:

$$ y_i = \sum_{j=1}^{IN} x_j W_{ji} $$

Here we get $$ \frac{\partial{y_i}}{\partial{W_{ji}}} = x_j $$

Now let's stack $N$ samples in $X$ where each sample is size of $1\times IN$:

$$ y_{ik} = \sum{j=1}^{IN} x_{ij} W_{jk} $$

Then: $$ \frac{\partial{y_{ik}}}{\partial{W_{jk}}} = x_{ij} \implies \boxed{\frac{dY}{dW} = X } $$

Finally:

$$\boxed{\Delta W = X^T \Delta Y}$$