In [1]:
%matplotlib inline
Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, where $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ where $\epsilon$ is the "noise term".
(a) Quick quiz: what is the likelihood given $\vec{w}$? That is, what's $p(t_i | \vec{x}_i, \vec{w})$?
Answer: $p(t_i | \vec{x}_i, \vec{w}) = \mathcal{N}(t_i|\vec{w}^\top \vec{x_i}, \beta^{-1}) = \frac{1}{(2\pi \beta^{-1})^\frac{1}{2}} \exp{(-\frac{\beta}{2}(t_i - \vec{w}^\top \vec{x_i})^2)}$
Assume we have $n$ vectors $\vec{x}_1, \cdots, \vec{x}_n$. We also assume that for each $\vec{x}_i$ we have observed a target value $t_i$, sampled IID. We will also put a prior on $\vec{w}$, using PSD matrix $\Sigma$. $$ \begin{gather} t_i = \vec{w}^T \vec{x_i} + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \\ \vec{w} \sim \mathcal{N}(0, \Sigma) \end{gather} $$ Note: the difference here is that our prior is a multivariate gaussian with non-identity covariance! Also we let $\mathcal{X} = \{\vec{x}_1, \cdots, \vec{x}_n\}$
(a) Compute the log posterior function, $\log p(\vec{w}|\vec{t}, \mathcal{X},\beta)$
Hint: use Bayes' Rule
(b) Compute the MAP estimate of $\vec{w}$ for this model
Hint: the solution is very similar to the MAP estimate for a gaussian prior with identity covariance
Download the file mnist_49_3000.mat from Canvas. This is a subset of the MNIST handwritten digit database, which is a well-known benchmark database for classification algorithms. This subset contains examples of the digits 4 and 9. The data file contains variables x and y, with the former containing patterns and the latter labels. The images are stored as column vectors.
Exercise:
In [2]:
# all the packages you need
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
from numpy.linalg import inv
In [3]:
# load data from .mat
mat = scipy.io.loadmat('mnist_49_3000.mat')
print (mat.keys())
In [4]:
x = mat['x'].T
y = mat['y'].T
print (x.shape, y.shape)
In [5]:
# show example image
plt.imshow (x[4, :].reshape(28, 28))
Out[5]:
In [6]:
# add bias term
x = np.hstack([np.ones((3000, 1)), x])
# convert label -1 to 0
y[y == -1] = 0
print(y[y == 0].size, y[y == 1].size)
# split into train set and test set
x_train = x[: 2000, :]
y_train = y[: 2000, :]
x_test = x[2000 : , :]
y_test = y[2000 : , :]
Implement Newton’s method to find a minimizer of the regularized negative log likelihood. Try setting $\lambda$ = 10. Use the first 2000 examples as training data, and the last 1000 as test data.
Exercise
In [7]:
# Initialization of parameters
w = np.zeros((785, 1))
lmd = 10
In [8]:
def computeE(w, x, y, lmd) :
E = np.dot(y.T, np.log(1 + np.exp(-np.dot(x, w)))) + np.dot(1 - y.T, np.log(1 + \
np.exp(np.dot(x, w)))) + lmd * np.dot(w.T, w)
return E[0][0]
print (computeE(w, x, y, lmd))
In [9]:
def sigmoid(a) :
return np.exp(a + 1e-6) / (1 + np.exp(a + 1e-6))
def computeGradientE(w, x, y, lmd) :
return np.dot(x.T, sigmoid(np.dot(x, w)) - y) + lmd * w
print (computeGradientE(w, x, y, lmd).shape)
of which $\nabla^2 f(\vx_n)$ is Hessian matrix which is the second order derivative $$ \nabla^2 f = \begin{bmatrix} \frac{\partial f}{\partial x_1\partial x_1} & \cdots & \frac{\partial f}{\partial x_1\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f}{\partial x_n\partial x_1} & \cdots & \frac{\partial f}{\partial x_n\partial x_n} \end{bmatrix} $$
$$ \begin{align} \nabla^2 E(\vw) &= \nabla_\vw \nabla_\vw E(\vw) \\ &= \sum \nolimits_{n=1}^N \phi(\vx_n) r_n(\vw) \phi(\vx_n)^T + \lambda I \end{align} $$of which $r_n(\vw) = \sigma(\vw^T \phi(\vx_n)) \cdot ( 1 - \sigma(\vw^T \phi(\vx_n)) )$
Exercise
In [10]:
def computeR(w, x, y) :
return sigmoid(np.dot(x, w)) * (1 - sigmoid(np.dot(x, w)))
# print (computeR(w, x, y).T)
def computeHessian(w, x, y, lmd) :
return np.dot(x.T * computeR(w, x, y).T, x) + lmd * np.eye(w.shape[0])
# print (computeHessian(w, x, y, lmd))
def update(w, x, y, lmd) :
hessian = computeHessian(w, x, y, lmd)
gradient = computeGradientE(w, x, y, lmd)
# print (np.sum(hessian))
return w - np.dot(inv(hessian), gradient)
print (update(w, x, y, lmd).shape)
Exercise
In [11]:
def train(w, x, y, lmd) :
w_new = update(w, x, y, lmd)
diff = np.sum(np.abs(w_new - w))
while diff > 1e-6:
w = w_new
w_new = update(w, x, y, lmd)
diff = np.sum(np.abs(w_new - w))
return w
w_train = train(w, x_train, y_train, lmd)
In [12]:
def test(w, x, y) :
tmp = np.dot(x, w)
y_pred = np.zeros(y.shape)
y_pred[tmp > 0] = 1
error = np.mean(np.abs(y_pred - y))
return error
print (test(w, x_test, y_test))
print (test(w_train, x_test, y_test))
print (computeE(w_train, x_train, y_train, lmd))