Exercise: Compute the partial derivative $$ (\nabla_{\vec{w}}E)_j = \frac{\partial E}{\partial w_j} $$ where $$ E(\vec{w}) = \frac{1}{2} \sum_{n=1}^N \sum_{i=1}^{M} \left( w_i \phi_i(\vec{x}_n) - t_n \right)^2 $$
In the summation over $i$, only $i = j$ term is a function of $w_j$. So, $$ \frac{\partial E}{\partial w_j} = \frac{1}{2} \sum_{n=1}^N \frac{\partial}{\partial w_j} \left( w_j \phi_j(\vec{x}_n) - t_n \right)^2 = \sum_{n=1}^{N} (w_j \phi_j(\vec{x}_n) - t_n) $$
Tip: If you find subscript notations confusing, just plug in $j = 1$, differentiate, and get $\sum_{n=1}^{N} (w_1 \phi_1(\vec{x}_n) - t_n)$.
The matrix $\Phi \in \R^{N \times M}$ is called design matrix. Each row represents one sample. Each column represents one feature $$\Phi = \begin{bmatrix} \phi(\vec{x}_1)^\top\\ \phi(\vec{x}_2)^\top\\ \vdots\\ \phi(\vec{x}_N)^\top \end{bmatrix} = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix} $$
Target value vector is $\vec{t} \in \mathbb{R}^M$.
Write the objective function in matrix-vector form: $ \begin{align*} E(\vec{w}) &= \frac{1}{2} \sum_{n=1}^N \sum_{i=1}^{M} \left( w_i \phi_i(\vec{x}_n) - t_n \right)^2 \\ &= \frac{1}{2} \sum_{n=1}^N \left( \phi(\vec{x}_n)^\top \vec{w} - t_n \right)^2 = \frac{1}{2} \|\Phi \vec{w} - \vec{t}\|_2^2 \end{align*} $
Rewrite $E$ as a sum of three matrix-vector products. Hints:
Treat $\Phi \vec{w}$ as a vector and we get from the distributive law $$ E(\vec{w}) = \frac{1}{2} \|\Phi \vec{w} - \vec{t}\|_2^2 = \frac{1}{2} \left(\vec{w}^\top \Phi^\top \Phi \vec{w} - \vec{w}^\top \Phi^\top \vec{t} - \vec{t}^\top \Phi \vec{w} + \vec{t}^\top \vec{t}\right). $$
Note that $\vec{w}^\top (\Phi^\top \vec{t}) = (\Phi^\top\vec{t})^\top \vec{w} = \vec{t}^\top \Phi \vec{w}$. So, the above simplifies to $$ \frac{1}{2} \left(\vec{w}^\top \Phi^\top \Phi \vec{w} - 2\vec{w}^\top \Phi^\top \vec{t} + \vec{t}^\top \vec{t}\right).$$
Write the objective function in matrix-vector form: $$ E(\vec{w}) = \frac{1}{2} \left(\vec{w}^\top \Phi^\top \Phi \vec{w} - 2\vec{w}^\top \Phi^\top \vec{t} + \vec{t}^\top \vec{t}\right). $$
Compute the gradient $\nabla_\vec{w} E(\vec{w})$ with matrix calculus. Hints:
So, $\nabla_{\vec{w}} E(\vec{w}) = (\Phi^\top\Phi)\vec{w} - \Phi^\top t$
Exercise : How do you implement the update rule for a minibatch gradient descent (of size, let's say, 5% of the whole dataset)?
You randomly choose 5% of indices between 0 and $M$. Take the corresponding rows of $\Phi$ and $t$. Compute the gradient on this subset of data and desend along.
Answer: It is sufficent to find a local minimum because $E(\vec{w})$ is convex. The solution is $\vec{w} = (\Phi^\top\Phi)^{-1}\Phi^\top t$.
Exercise : Show that $\Phi^\top \Phi$ is invertible if $\Phi$ has linearly independent columns. Interpret its implications about our features.
Answer: It implies that our features are linearly dependent
Challenge: Similarly, we can show $\Phi\Phi^\top$ is invertible if $\Phi$ has linearly independent rows. Why do we care/not care about this case?
Challenge: Show that $\vec{b}$ is in the column space of $A$ if and only if there exists a vector $\vec{x}$ such that $A\vec{x} = \vec{b}$.
Exercise : One property of Psuedoinverse is that $A A^\dagger A = A$. Show that $$(A^{\top} A)^{-1}A^\top$$ satisfies this property (assuming linearly independent columns of $A$)
Challenge: Show that $$\hat{\vec{w}} = (\Phi^\top\Phi)^\dagger \Phi^\top \vec{t} = \Phi^\dagger \vec{t}$$ satisfies $\nabla_\vec{w} E(\vec{w}) = \Phi^\top\Phi \vec{w} - \Phi^\top \vec{t} = 0$.
Discuss : What are the advantages and disadvtanges of each method we learned today (stochastic gradient descent, batch gradient descent, and closed-form solution)?
Answer: There are no right answers, but you can say things like: matrix inversion is a cubic-time operation (technically $O(n^{2.37...})$). Performing better on your training data $\Phi$ doesn't necessarily mean performing better on unseen test data.