$$ \LaTeX \text{ command declarations here.} \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} $$

EECS 545: Machine Learning

Lecture 04: Linear Regression I

Instructor: Jacob Abernethy, Benjamin Bray, Jia Deng and Chansoo Lee
Date: 9/21/2016

Notation

In this lecture, we will use
- Let vector $\vec{x}_n \in \R^D$ denote the $n\text{th}$ data. $D$ denotes number of attributes in dataset.
- Let vector $\phi(\vec{x}_n) \in \R^M$ denote features for data $\vec{x}_n$. $\phi_j(\vec{x}_n)$ denotes the $j\text{th}$ feature for data $x_n$.
- Feature $\phi(\vec{x}_n)$ is the artificial features which represents the preprocessing step. $\phi(\vec{x}_n)$ is usually some combination of transformations of $\vec{x}_n$. For example, $\phi(\vec{x})$ could be vector constructed by $[\vec{x}_n^\top, \cos(\vec{x}_n)^\top, \exp(\vec{x}_n)^\top]^\top$. If we do nothing to $\vec{x}_n$, then $\phi(\vec{x}_n)=\vec{x}_n$.
- Continuous-valued label vector $t \in \R^D$ (target values). $t_n \in \R$ denotes the target value for $i\text{th}$ data.

Linear Regression

Linear Regression (General Case)

The function $y(\vec{x}_n, \vec{w})$ is linear in parameters $\vec{w}$.
- Goal: Find the best value for the weights $\vec{w}$.
- For simplicity, add a bias term $\phi_0(\vec{x}_n) = 1$. $$ \begin{align} y(\vec{x}_n, \vec{w}) &= w_0 \phi_0(\vec{x}_n)+w_1 \phi_1(\vec{x}_n)+ w_2 \phi_2(\vec{x}_n)+\dots +w_{M-1} \phi_{M-1}(\vec{x}_n) \\ &= \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}_n) \\ &= \vec{w}^\top \phi(\vec{x}_n) \end{align} $$ of which $\phi(\vec{x}_n) = [\phi_0(\vec{x}_n),\phi_1(\vec{x}_n),\phi_2(\vec{x}_n), \dots, \phi_{M-1}(\vec{x}_n)]^\top$

Method I: Batch Gradient Descent

To minimize the objective function, take derivative w.r.t coefficient vector $\vec{w}$ and descend: initialize $\vec{w}^0$ to be any vector, and at each step $s$, $$ \vec{w}^{s+1} \gets \vec{w}^{s} - \nabla_{\vec{w}}E(\vec{w}^s) $$

Exercise: Compute the partial derivative $$ (\nabla_{\vec{w}}E)_j = \frac{\partial E}{\partial w_j} $$ where $$ E(\vec{w}) = \frac{1}{2} \sum_{n=1}^N \sum_{i=1}^{M} \left( w_i \phi_i(\vec{x}_n) - t_n \right)^2 $$

Solution

In the summation over $i$, only $i = j$ term is a function of $w_j$. So, $$ \frac{\partial E}{\partial w_j} = \frac{1}{2} \sum_{n=1}^N \frac{\partial}{\partial w_j} \left( w_j \phi_j(\vec{x}_n) - t_n \right)^2 = \sum_{n=1}^{N} (w_j \phi_j(\vec{x}_n) - t_n) $$

Tip: If you find subscript notations confusing, just plug in $j = 1$, differentiate, and get $\sum_{n=1}^{N} (w_1 \phi_1(\vec{x}_n) - t_n)$.

Linear Regression: Matrix Notations

The matrix $\Phi \in \R^{N \times M}$ is called design matrix. Each row represents one sample. Each column represents one feature $$\Phi = \begin{bmatrix} \phi(\vec{x}_1)^\top\\ \phi(\vec{x}_2)^\top\\ \vdots\\ \phi(\vec{x}_N)^\top \end{bmatrix} = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix} $$

Target value vector is $\vec{t} \in \mathbb{R}^M$.

$$ E(\vec{w}) = \frac{1}{2} \sum_{n=1}^N (y(\vec{x}_n, \vec{w}) - t_n)^2 = \frac{1}{2} \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}_n) - t_n \right)^2 = \frac{1}{2} \sum_{n=1}^N \left( \vec{w}^\top \phi(\vec{x}_n) - t_n \right)^2 $$

Batch Gradient Descent with Matrix Calculus

Write the objective function in matrix-vector form: $ \begin{align*} E(\vec{w}) &= \frac{1}{2} \sum_{n=1}^N \sum_{i=1}^{M} \left( w_i \phi_i(\vec{x}_n) - t_n \right)^2 \\ &= \frac{1}{2} \sum_{n=1}^N \left( \phi(\vec{x}_n)^\top \vec{w} - t_n \right)^2 = \frac{1}{2} \|\Phi \vec{w} - \vec{t}\|_2^2 \end{align*} $

Rewrite $E$ as a sum of three matrix-vector products. Hints:

$ \vec{x}^\top \vec{x} = (x_1,\ldots,x_M)^\top (x_1,\ldots,x_M) = x_1^2 + \cdots + x_M^2 = \left(\sqrt{x_1^2 + \cdots + x_M^2}\right)^2 = \|\vec{x}\|_2^2$

Distributive law: $(\vec{a} + \vec{b})^\top(\vec{c} + \vec{d}) = \vec{a}^\top \vec{c} + \vec{a}^\top \vec{d} + \vec{b}^\top \vec{c} + \vec{b}^\top\vec{d}$

Transpose of a product: $(AB)^\top = B^\top A^\top$ for matrix-vector multiplication.

Batch Gradient Descent with Matrix Calculus

Treat $\Phi \vec{w}$ as a vector and we get from the distributive law $$ E(\vec{w}) = \frac{1}{2} \|\Phi \vec{w} - \vec{t}\|_2^2 = \frac{1}{2} \left(\vec{w}^\top \Phi^\top \Phi \vec{w} - \vec{w}^\top \Phi^\top \vec{t} - \vec{t}^\top \Phi \vec{w} + \vec{t}^\top \vec{t}\right). $$

Note that $\vec{w}^\top (\Phi^\top \vec{t}) = (\Phi^\top\vec{t})^\top \vec{w} = \vec{t}^\top \Phi \vec{w}$. So, the above simplifies to $$ \frac{1}{2} \left(\vec{w}^\top \Phi^\top \Phi \vec{w} - 2\vec{w}^\top \Phi^\top \vec{t} + \vec{t}^\top \vec{t}\right).$$

Batch Gradient Descent with Matrix Calculus

Write the objective function in matrix-vector form: $$ E(\vec{w}) = \frac{1}{2} \left(\vec{w}^\top \Phi^\top \Phi \vec{w} - 2\vec{w}^\top \Phi^\top \vec{t} + \vec{t}^\top \vec{t}\right). $$

Compute the gradient $\nabla_\vec{w} E(\vec{w})$ with matrix calculus. Hints:

$\nabla_\vec{x} (\vec{x}^\top A \vec{x}) = (A + A^\top) \vec{x}$ (Challenge: prove this!)
$\nabla_\vec{x} (\vec{x}^\top \vec{y}) = \nabla_\vec{x} (\vec{y}^\top\vec{x}) = \vec{y}$

$\Phi^\top \Phi$ has a special property.

Batch Gradient Descent with Matrix Calculus

Since $\Phi^\top\Phi$ is symmetric, $\nabla_{\vec{w}} \vec{w}^\top(\Phi^\top\Phi)\vec{w} = 2(\Phi^\top\Phi)\vec{w}$.
Treating $\Phi^\top t$ as a vector, $\nabla_{\vec{w}} \vec{w}^\top(\Phi^\top t) = \Phi^\top t$.
Finally, $t^\top t$ is constant with respect to $\vec{w}$.

So, $\nabla_{\vec{w}} E(\vec{w}) = (\Phi^\top\Phi)\vec{w} - \Phi^\top t$

Method I-2: Gradient Descent—Stochastic Gradient Descent

Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual (or a small subset of) training sample and update.

Exercise : How do you implement the update rule for a minibatch gradient descent (of size, let's say, 5% of the whole dataset)?

You randomly choose 5% of indices between 0 and $M$. Take the corresponding rows of $\Phi$ and $t$. Compute the gradient on this subset of data and desend along.

Method II: Closed-Form solution, invertible case

Main Idea, also Exercise: Solve $\nabla_\vec{w} E(\vec{w}) = 0$, assuming $\Phi^\top\Phi$ is invertible. Discuss why it is sufficent to solve this equation to find optimal $\vec{w}$.

Answer: It is sufficent to find a local minimum because $E(\vec{w})$ is convex. The solution is $\vec{w} = (\Phi^\top\Phi)^{-1}\Phi^\top t$.

Exercise : Show that $\Phi^\top \Phi$ is invertible if $\Phi$ has linearly independent columns. Interpret its implications about our features.

Answer: It implies that our features are linearly dependent

Challenge: Similarly, we can show $\Phi\Phi^\top$ is invertible if $\Phi$ has linearly independent rows. Why do we care/not care about this case?

Challenge: Show that $\vec{b}$ is in the column space of $A$ if and only if there exists a vector $\vec{x}$ such that $A\vec{x} = \vec{b}$.

Digression: Moore-Penrose Pseudoinverse

When we have a matrix $A$ that is non-invertible or not even square, we might want to invert anyway
For these situations we use $A^\dagger$, the Moore-Penrose Pseudoinverse of $A$
In general, we can get $A^\dagger$ by SVD: if we write $A \in \R^{m \times n} = U_{m \times m} \Sigma_{m \times n} V_{n \times n}^\top$ then $A^\dagger \in \R^{n \times m} = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger \in \R^{n \times m}$ is obtained by taking reciprocals of non-zero entries of $\Sigma^\top$.
Particularly, when $A$ has linearly independent columns then $A^\dagger = (A^\top A)^{-1} A^\top$. When $A$ is invertible, then $A^\dagger = A^{-1}$.

Exercise : One property of Psuedoinverse is that $A A^\dagger A = A$. Show that $$(A^{\top} A)^{-1}A^\top$$ satisfies this property (assuming linearly independent columns of $A$)

Challenge: Show that $$\hat{\vec{w}} = (\Phi^\top\Phi)^\dagger \Phi^\top \vec{t} = \Phi^\dagger \vec{t}$$ satisfies $\nabla_\vec{w} E(\vec{w}) = \Phi^\top\Phi \vec{w} - \Phi^\top \vec{t} = 0$.

Discuss : What are the advantages and disadvtanges of each method we learned today (stochastic gradient descent, batch gradient descent, and closed-form solution)?

Answer: There are no right answers, but you can say things like: matrix inversion is a cubic-time operation (technically $O(n^{2.37...})$). Performing better on your training data $\Phi$ doesn't necessarily mean performing better on unseen test data.