$$ \LaTeX \text{ command declarations here.} \newcommand{\N}{\mathcal{N}} \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\norm}[1]{\|#1\|_2} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\qed}{\qquad \mathbf{Q.E.D.}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\vt}{\mathbf{t}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vw}{\mathbf{w}} $$

EECS 445: Machine Learning

Hands On 05: Linear Regression II

  • Instructor: Zhao Fu, Valli, Jacob Abernethy and Jia Deng
  • Date: September 26, 2016

Problem 1a: MAP estimation for Linear Regression with unusual Prior

Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, where $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ where $\epsilon$ is the "noise term".

(a) Quick quiz: what is the likelihood given $\vec{w}$? That is, what's $p(t_i | \vec{x}_i, \vec{w})$?

Problem 1: MAP estimation for Linear Regression with unusual Prior

Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, sampled IID. We will also put a prior on $\vec{w}$, using PSD matrix $\Sigma$. $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \\ \vec{w} \sim \mathcal{N}(0, \Sigma) \end{gather} $$ Note: the difference here is that our prior is a multivariate gaussian with non-identity covariance! Also we let $\mathcal{X} = \{\vec{x}_1, \cdots, \vec{x}_n\}$

(a) Compute the log posterior function, $\log p(\vec{w}|\vec{t}, \mathcal{X},\beta)$

Hint: use Bayes' Rule

(b) Compute the MAP estimate of $\vec{w}$ for this model

Hint: the solution is very similar to the MAP estimate for a gaussian prior with identity covariance

Problem 2: Handwritten digit classification with logistic regression

Download the file mnist_49_3000.mat from Canvas. This is a subset of the MNIST handwritten digit database, which is a well-known benchmark database for classification algorithms. This subset contains examples of the digits 4 and 9. The data file contains variables x and y, with the former containing patterns and the latter labels. The images are stored as column vectors.

Exercise:

  • Load data and visualize data
  • Add bias to the feature $\phi(\vx)^T = [1, \vx^T]$
  • Split dataset into training set with the first 2000 data and test set with the last 1000 data

In [ ]:
# Enable plot here inside Jupyter Notebook
%matplotlib inline

# all the packages you need
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
from numpy.linalg import inv

# load data from .mat
mat = scipy.io.loadmat('mnist_49_3000.mat')

Implement Newton’s method to find a minimizer of the regularized negative log likelihood. Try setting $\lambda$ = 10. Use the first 2000 examples as training data, and the last 1000 as test data.

Exercise

  • Implement the loss function with the following formula: $$ \begin{align} E(\vw) &= -\ln P(\vy = \vt| \mathcal{X}, \vw) \\ &= \boxed{\sum \nolimits_{n=1}^N \left[ t_n \ln (1+\exp(-\vw^T\phi(\vx_n))) + (1-t_n) \ln(1+\exp(\vw^T\phi(\vx_n))) \right] + \lambda \vw^T\vw}\\ \end{align} $$
  • Implement the gradient of loss $$\nabla_\vw E(\vw) = \boxed{ \Phi^T \left( \sigma(\Phi \vw) - \vt \right) + \lambda \vw}$$ where $\sigma(a) = \frac{\exp(a)}{1+\exp(a)}$

Recall: Newton's Method

$$ \vx_{n+1}= \vx_n - \left(\nabla^2 f(\vx_n)\right)^{-1} \nabla_\vx f(\vx_n) $$

of which $\nabla^2 f(\vx_n)$ is Hessian matrix which is the second order derivative $$ \nabla^2 f = \begin{bmatrix} \frac{\partial f}{\partial x_1\partial x_1} & \cdots & \frac{\partial f}{\partial x_1\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f}{\partial x_n\partial x_1} & \cdots & \frac{\partial f}{\partial x_n\partial x_n} \end{bmatrix} $$

$$ \begin{align} \nabla^2 E(\vw) &= \nabla_\vw \nabla_\vw E(\vw) \\ &= \sum \nolimits_{n=1}^N \phi(\vx_n) r_n(\vw) \phi(\vx_n)^T + \lambda I \end{align} $$

of which $r_n(\vw) = \sigma(\vw^T \phi(\vx_n)) \cdot ( 1 - \sigma(\vw^T \phi(\vx_n)) )$

Exercise

  • Implement $r_n(\vw)$
  • Implement $\nabla^2 E(\vw)$
  • Implement update function

Exercise

  • Implement training process(When to stop iterating?)
  • Implement test function
  • Compute the test error
  • Compute the value of the objective function at the optimum