In [ ]:
%matplotlib inline
from Lec04 import *
In this lecture, we will cover linear regression. First some basic concepts and examples of linear regression will be introduced. Then we will show solving linear regression can be done by solving least squares problem. To solve least squares problem, two methods will be given: gradient descent method and closed form solution. Finally, the concept of Moore-Penrose pseudoinverse is introduced which is highly related to least squares.
In [5]:
regression_example_draw(degree1=0,degree2=1,degree3=3, ifprint=True)
Remark
- In the above image, we try to predict the true sinusoidal curve hidden in the data.
- $0\text{th}$ order means $\phi(x_n)=1$
- $1\text{th}$ order means $\phi(x_n)=[1, x_n]^T$
- $3\text{th}$ order means $\phi(x_n)=[1, x_n, x_n^2, x_n^3]^T$
- As the degree M increases, the learned curve matches the observed data better
- In real world, the model for data label is usually $$t = z + \epsilon$$ The true state/label is $z$, but it's corrupted by some noise $\epsilon$ so what we observe is actually $z + \epsilon$. Many machine tasks are essentially to infer $z$ of new observations based on $z + \epsilon$ of training dataset.
In [6]:
basis_function_plot()
Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual training sample and update.
Remark
The derivative is computed as followimg: $$ \begin{aligned} \nabla_{\vec{w}}E(\vec{w} | \vec{x}_n) &= \frac{\partial}{\partial \vec{w}} \left[ \frac12 \left( \vec{w}^T \phi(\vec{x}_n) - t_n \right)^2 \right] \\ &= \left( \vec{w}^T\phi(\vec{x}_n) - t_n \right)\phi(\vec{x}_n) \end{aligned} $$
This derivative w.r.t individual sample is just one summand of gradient in batch gradient descent.
- Sometimes, stochastic gradient descent has faster convergence than batch gradient descent.
Main Idea: Compute gradient and set to gradient to zero, solving in closed form.
Remark
- Note that this derivative can simply be obtained from the derivative we just derived in gradient descent method, which is $$ \begin{aligned} \nabla_{\vec{w}} E(\vec{w}) &= \sum_{n=1}^N \left( \vec{w}^T\phi(x_n) - t_n \right)\phi(x_n) & \text{non-matrix/vector form} \\ &= \Phi^T\Phi \vec{w} - \Phi^T \vec{t} & \text{matrix/vector form} \end{aligned} $$
- For gradient descent and closed form solution we obtain the derivative from two different perspectives, non-matrix/vector form and matrix/vector form. You could choose whichever you like when in use. We suggest matrix/vector form which is more neat.
- From $\vec{e}=\Phi \vec{w}-\vec{t}$, we know that $\Phi \vec{w}$ is the prediction of training samples and $\vec{e}$ is just the residual error. What we are doing in least squares is just making prediction of training samples as close to training labels as possible, i.e. $\Phi \vec{w} \approx \vec{t}$
Remark
- A common mistake is writing $\hat{\vec{w}}=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}=\Phi^{-1}(\Phi^T)^{-1}\Phi^T \vec{t}=\Phi^{-1} \vec{t}$. This is wrong because $\Phi$ is not necessarily square and invertible.
- When $\Phi$ is square and invertible, you are free to write that.
Remark
- The solution to $A\vec{x}=\vec{b}$ is $\hat{\vec{x}}=A^{\dagger}b$ in the sense that $\left \| A\vec{x} - \vec{b} \right \|^2$ is minimized. Meanwhile, among all the minimizers that achieve the same minimal value of $\left \| A\vec{x} - \vec{b} \right \|_2^2$, $\hat{\vec{x}}$ has the smallest norm $\left \| \vec{x} \right \|_2$.
- $Ax$ doesn't necessarily equal $\vec{b}$:
- When $\vec{b}$ is in the column space of $A$, $A\vec{x} = \vec{b}$.
- When $\vec{b}$ is not in the column space of $A$, $A\vec{x} \neq \vec{b}$.
Remark
- In the above derivation, we get $\hat{\vec{w}} = \Phi^\dagger \vec{t}$ by letting the derivative $\nabla_{\vec{w}} E(\vec{w})=0$. Actually, $\hat{\vec{w}}=\Phi^\dagger \vec{t}$ is just the solution to $\Phi \vec{w} = \vec{t}$ in the sense that $\left \| \Phi\vec{w} - \vec{t} \right \|_2^2$ is minimized. Meanwhile, among all the minimizers that achieve the same minimal value of $\left \| \Phi\vec{w} - \vec{t} \right \|_2^2$, $\hat{\vec{w}}$ has the smallest norm $\left \| \vec{w} \right \|_2$.
- Now that we have neat and closed form solution to least squares, why do we introduce gradient descent method?
- $\hat{w}=(\Phi^T \Phi)^{-1} \Phi^T \vec{t}$ involves matrix inversion. It could be quite computationally expensive when the dimension $M$ is high. On the other hand, gradient descent involves nothing but matrix multiplication which can be computed much faster.
Suppose that $f : \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$, that is, the function $f$
Then, the gradient of $f$ with respect to $A$ is:
In particular, if $A$ is a vector $\vec{x} \in \mathbb{R}^n$, $$ \nabla_{\vec{x}} f(\vec{x}) = \begin{bmatrix} \frac{\partial f}{\partial \vec{x}_1} \\ \frac{\partial f}{\partial \vec{x}_2} \\ \cdots \\ \frac{\partial f}{\partial \vec{x}_n} \end{bmatrix} $$
The gradient is a linear operator from $\mathbb{R}^n \mapsto \mathbb{R}^n$: