Save this file as studentid1_studentid2_lab#.ipynb

(Your student-id is the number shown on your student card.)

E.g. if you work with 3 people, the notebook should be named: 12301230_3434343_1238938934_lab1.ipynb.

This will be parsed by a regexp, so please double check your filename.

Before you turn this problem in, please make sure everything runs correctly. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your names and email adresses below.


In [1]:
NAME = "Michelle Appel"
NAME2 = "Verna Dankers"
NAME3 = "Yves van Montfort"
EMAIL = "michelle.appel@student.uva.nl"
EMAIL2 = "verna.dankers@student.uva.nl"
EMAIL3 = "yves.vanmontfort@student.uva.nl"

Lab 2: Classification

Machine Learning 1, September 2017

Notes on implementation:

  • You should write your code and answers in this IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact your teaching assistant.
  • Please write your answers right below the questions.
  • Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.
  • Use the provided test cells to check if your answers are correct
  • Make sure your output and plots are correct before handing in your assignment with Kernel -> Restart & Run All

$\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\ba}{\mathbf{a}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bq}{\mathbf{q}}$ $\newcommand{\bp}{\mathbf{p}}$ $\newcommand{\bh}{\mathbf{h}}$ $\newcommand{\bI}{\mathbf{I}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\bT}{\mathbf{T}}$ $\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bV}{\mathbf{V}}$


In [2]:
%pylab inline
plt.rcParams["figure.figsize"] = [9,5]


Populating the interactive namespace from numpy and matplotlib

Part 1. Multiclass logistic regression

Scenario: you have a friend with one big problem: she's completely blind. You decided to help her: she has a special smartphone for blind people, and you are going to develop a mobile phone app that can do machine vision using the mobile camera: converting a picture (from the camera) to the meaning of the image. You decide to start with an app that can read handwritten digits, i.e. convert an image of handwritten digits to text (e.g. it would enable her to read precious handwritten phone numbers).

A key building block for such an app would be a function predict_digit(x) that returns the digit class of an image patch $\bx$. Since hand-coding this function is highly non-trivial, you decide to solve this problem using machine learning, such that the internal parameters of this function are automatically learned using machine learning techniques.

The dataset you're going to use for this is the MNIST handwritten digits dataset (http://yann.lecun.com/exdb/mnist/). You can download the data with scikit learn, and load it as follows:


In [3]:
from sklearn.datasets import fetch_mldata
# Fetch the data
mnist = fetch_mldata('MNIST original')
data, target = mnist.data, mnist.target.astype('int')
# Shuffle
indices = np.arange(len(data))
np.random.seed(123)
np.random.shuffle(indices)
data, target = data[indices].astype('float32'), target[indices]

# Normalize the data between 0.0 and 1.0:
data /= 255. 

# Split
x_train, x_valid, x_test = data[:50000], data[50000:60000], data[60000: 70000]
t_train, t_valid, t_test = target[:50000], target[50000:60000], target[60000: 70000]

MNIST consists of small 28 by 28 pixel images of written digits (0-9). We split the dataset into a training, validation and testing arrays. The variables x_train, x_valid and x_test are $N \times M$ matrices, where $N$ is the number of datapoints in the respective set, and $M = 28^2 = 784$ is the dimensionality of the data. The second set of variables t_train, t_valid and t_test contain the corresponding $N$-dimensional vector of integers, containing the true class labels.

Here's a visualisation of the first 8 digits of the trainingset:


In [4]:
def plot_digits(data, num_cols, targets=None, shape=(28,28)):
    num_digits = data.shape[0]
    num_rows = int(num_digits/num_cols)
    for i in range(num_digits):
        plt.subplot(num_rows, num_cols, i+1)
        plt.imshow(data[i].reshape(shape), interpolation='none', cmap='Greys')
        if targets is not None:
            plt.title(int(targets[i]))
        plt.colorbar()
        plt.axis('off')
    plt.tight_layout()
    plt.show()
    
plot_digits(x_train[0:40000:5000], num_cols=4, targets=t_train[0:40000:5000])


In multiclass logistic regression, the conditional probability of class label $j$ given the image $\bx$ for some datapoint is given by:

$ \log p(t = j \;|\; \bx, \bb, \bW) = \log q_j - \log Z$

where $\log q_j = \bw_j^T \bx + b_j$ (the log of the unnormalized probability of the class $j$), and $Z = \sum_k q_k$ is the normalizing factor. $\bw_j$ is the $j$-th column of $\bW$ (a matrix of size $784 \times 10$) corresponding to the class label, $b_j$ is the $j$-th element of $\bb$.

Given an input image, the multiclass logistic regression model first computes the intermediate vector $\log \bq$ (of size $10 \times 1$), using $\log q_j = \bw_j^T \bx + b_j$, containing the unnormalized log-probabilities per class.

The unnormalized probabilities are then normalized by $Z$ such that $\sum_j p_j = \sum_j \exp(\log p_j) = 1$. This is done by $\log p_j = \log q_j - \log Z$ where $Z = \sum_i \exp(\log q_i)$. This is known as the softmax transformation, and is also used as a last layer of many classifcation neural network models, to ensure that the output of the network is a normalized distribution, regardless of the values of second-to-last layer ($\log \bq$)

Warning: when computing $\log Z$, you are likely to encounter numerical problems. Save yourself countless hours of debugging and learn the log-sum-exp trick.

The network's output $\log \bp$ of size $10 \times 1$ then contains the conditional log-probabilities $\log p(t = j \;|\; \bx, \bb, \bW)$ for each digit class $j$. In summary, the computations are done in this order:

$\bx \rightarrow \log \bq \rightarrow Z \rightarrow \log \bp$

Given some dataset with $N$ independent, identically distributed datapoints, the log-likelihood is given by:

$ \mathcal{L}(\bb, \bW) = \sum_{n=1}^N \mathcal{L}^{(n)}$

where we use $\mathcal{L}^{(n)}$ to denote the partial log-likelihood evaluated over a single datapoint. It is important to see that the log-probability of the class label $t^{(n)}$ given the image, is given by the $t^{(n)}$-th element of the network's output $\log \bp$, denoted by $\log p_{t^{(n)}}$:

$\mathcal{L}^{(n)} = \log p(t = t^{(n)} \;|\; \bx = \bx^{(n)}, \bb, \bW) = \log p_{t^{(n)}} = \log q_{t^{(n)}} - \log Z^{(n)}$

where $\bx^{(n)}$ and $t^{(n)}$ are the input (image) and class label (integer) of the $n$-th datapoint, and $Z^{(n)}$ is the normalizing constant for the distribution over $t^{(n)}$.

1.1 Gradient-based stochastic optimization

1.1.1 Derive gradient equations (20 points)

Derive the equations for computing the (first) partial derivatives of the log-likelihood w.r.t. all the parameters, evaluated at a single datapoint $n$.

You should start deriving the equations for $\frac{\partial \mathcal{L}^{(n)}}{\partial \log q_j}$ for each $j$. For clarity, we'll use the shorthand $\delta^q_j = \frac{\partial \mathcal{L}^{(n)}}{\partial \log q_j}$.

For $j = t^{(n)}$: $ \delta^q_j = \frac{\partial \mathcal{L}^{(n)}}{\partial \log p_j} \frac{\partial \log p_j}{\partial \log q_j}

  • \frac{\partial \mathcal{L}^{(n)}}{\partial \log Z} \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} = 1 \cdot 1 - \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} = 1 - \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} $

For $j \neq t^{(n)}$: $ \delta^q_j = \frac{\partial \mathcal{L}^{(n)}}{\partial \log Z} \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} = - \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} $

Complete the above derivations for $\delta^q_j$ by furtherly developing $\frac{\partial \log Z}{\partial Z}$ and $\frac{\partial Z}{\partial \log q_j}$. Both are quite simple. For these it doesn't matter whether $j = t^{(n)}$ or not.

OUR ANSWER

For $j = t^{(n)}$: \begin{align} \delta^q_j &= 1 - \frac{\partial \log Z}{\partial Z}\frac{\partial Z}{\partial \log q_j} = 1 - \frac{1}{Z} \cdot e^{\log q_j} \end{align} For $j \neq t^{(n)}$: \begin{align} \delta^q_j &= - \frac{\partial \log Z}{\partial Z}\frac{\partial Z}{\partial \log q_j} = - \frac{1}{Z} \cdot e^{\log q_j} \end{align}

Given your equations for computing the gradients $\delta^q_j$ it should be quite straightforward to derive the equations for the gradients of the parameters of the model, $\frac{\partial \mathcal{L}^{(n)}}{\partial W_{ij}}$ and $\frac{\partial \mathcal{L}^{(n)}}{\partial b_j}$. The gradients for the biases $\bb$ are given by:

$ \frac{\partial \mathcal{L}^{(n)}}{\partial b_j} = \frac{\partial \mathcal{L}^{(n)}}{\partial \log q_j} \frac{\partial \log q_j}{\partial b_j} = \delta^q_j \cdot 1 = \delta^q_j $

The equation above gives the derivative of $\mathcal{L}^{(n)}$ w.r.t. a single element of $\bb$, so the vector $\nabla_\bb \mathcal{L}^{(n)}$ with all derivatives of $\mathcal{L}^{(n)}$ w.r.t. the bias parameters $\bb$ is:

$ \nabla_\bb \mathcal{L}^{(n)} = \mathbf{\delta}^q $

where $\mathbf{\delta}^q$ denotes the vector of size $10 \times 1$ with elements $\mathbf{\delta}_j^q$.

The (not fully developed) equation for computing the derivative of $\mathcal{L}^{(n)}$ w.r.t. a single element $W_{ij}$ of $\bW$ is:

$ \frac{\partial \mathcal{L}^{(n)}}{\partial W_{ij}} = \frac{\partial \mathcal{L}^{(n)}}{\partial \log q_j} \frac{\partial \log q_j}{\partial W_{ij}} = \mathbf{\delta}_j^q \frac{\partial \log q_j}{\partial W_{ij}} $

What is $\frac{\partial \log q_j}{\partial W_{ij}}$? Complete the equation above.

If you want, you can give the resulting equation in vector format ($\nabla_{\bw_j} \mathcal{L}^{(n)} = ...$), like we did for $\nabla_\bb \mathcal{L}^{(n)}$.

OUR ANSWER

$ \frac{\partial \mathcal{L}^{(n)}}{\partial W_{ij}} = \mathbf{\delta}_j^q \frac{\partial \log q_j}{\partial W_{ij}} = \mathbf{\delta}_j^q \frac{\partial}{\partial W_{ij}} (\bw_j^T \bx + b_j) = \mathbf{\delta}_j^q \frac{\partial}{\partial W_{ij}} (\sum\limits_i( \bw_{ij}\bx_i) + b_j) = \mathbf{\delta}_j^q \bx_i $

1.1.2 Implement gradient computations (10 points)

Implement the gradient calculations you derived in the previous question. Write a function logreg_gradient(x, t, w, b) that returns the gradients $\nabla_{\bw_j} \mathcal{L}^{(n)}$ (for each $j$) and $\nabla_{\bb} \mathcal{L}^{(n)}$, i.e. the first partial derivatives of the log-likelihood w.r.t. the parameters $\bW$ and $\bb$, evaluated at a single datapoint (x, t). The computation will contain roughly the following intermediate variables:

$ \log \bq \rightarrow Z \rightarrow \log \bp\,,\, \mathbf{\delta}^q $

followed by computation of the gradient vectors $\nabla_{\bw_j} \mathcal{L}^{(n)}$ (contained in a $784 \times 10$ matrix) and $\nabla_{\bb} \mathcal{L}^{(n)}$ (a $10 \times 1$ vector).

For maximum points, ensure the function is numerically stable.


In [5]:
import numpy as np

# 1.1.2 Compute gradient of log p(t|x;w,b) wrt w and b
def logreg_gradient(x, t, w, b):
    """
    Return the log likelihood of the data point's target,
    together with the gradient with respect to w and b.
    """
    logq = np.array(np.matmul(x, w) + b)

    if len(logq.shape) == 1:
        logq = np.expand_dims(logq, 0)
    
    if len(x.shape) == 1:
        x = np.expand_dims(x, 0)

    # Calculate log z with trick mentioned above
    a = max(logq.flatten())
    logz = a + np.log(sum([np.power(np.e, q - a) for q in logq]))
    logp = logq - logz
    
    zeros = np.zeros(logq.shape)
    zeros[:, t[0]] = 1
    dq = zeros - np.power(np.e, logq - logz)
    
    # Final derivatives of the log-likelihood
    dL_dw = np.transpose(np.matmul(dq.T, x))
    dL_db = dq
    return logp[:,t].squeeze(), dL_dw, dL_db.squeeze()

In [6]:
np.random.seed(123)
# scalar, 10 X 768  matrix, 10 X 1 vector
w = np.random.normal(size=(28*28,10), scale=0.001)
# w = np.zeros((784,10))
b = np.zeros((10,))

# test gradients, train on 1 sample
logpt, grad_w, grad_b = logreg_gradient(x_train[0:1,:], t_train[0:1], w, b)

print("Test gradient on one point")
print("Likelihood:\t", logpt)
print("\nGrad_W_ij\t",grad_w.shape,"matrix")
print("Grad_W_ij[0,152:158]=\t", grad_w[152:158,0])
print("\nGrad_B_i shape\t",grad_b.shape,"vector")
print("Grad_B_i=\t", grad_b.T)
print("i in {0,...,9}; j in M")

assert logpt.shape == (), logpt.shape
assert grad_w.shape == (784, 10), grad_w.shape
assert grad_b.shape == (10,), grad_b.shape


Test gradient on one point
Likelihood:	 -2.2959726720744777

Grad_W_ij	 (784, 10) matrix
Grad_W_ij[0,152:158]=	 [-0.04518971 -0.06758809 -0.07819784 -0.09077237 -0.07584012 -0.06365855]

Grad_B_i shape	 (10,) vector
Grad_B_i=	 [-0.10020327 -0.09977827 -0.1003198   0.89933657 -0.10037941 -0.10072863
 -0.09982729 -0.09928672 -0.09949324 -0.09931994]
i in {0,...,9}; j in M

In [7]:
# It's always good to check your gradient implementations with finite difference checking:
# Scipy provides the check_grad function, which requires flat input variables.
# So we write two helper functions that provide can compute the gradient and output with 'flat' weights:
from scipy.optimize import check_grad

np.random.seed(123)
# scalar, 10 X 768  matrix, 10 X 1 vector
w = np.random.normal(size=(28*28,10), scale=0.001)
# w = np.zeros((784,10))
b = np.zeros((10,))

def func(w):
    logpt, grad_w, grad_b = logreg_gradient(x_train[0:1,:], t_train[0:1], w.reshape(784,10), b)
    return logpt
def grad(w):
    logpt, grad_w, grad_b = logreg_gradient(x_train[0:1,:], t_train[0:1], w.reshape(784,10), b)
    return grad_w.flatten()
finite_diff_error = check_grad(func, grad, w.flatten())
print('Finite difference error grad_w:', finite_diff_error)
assert finite_diff_error < 1e-3, 'Your gradient computation for w seems off'

def func(b):
    logpt, grad_w, grad_b = logreg_gradient(x_train[0:1,:], t_train[0:1], w, b)
    return logpt
def grad(b):
    logpt, grad_w, grad_b = logreg_gradient(x_train[0:1,:], t_train[0:1], w, b)
    return grad_b.flatten()
finite_diff_error = check_grad(func, grad, b)
print('Finite difference error grad_b:', finite_diff_error)
assert finite_diff_error < 1e-3, 'Your gradient computation for b seems off'


Finite difference error grad_w: 6.36129469427e-07
Finite difference error grad_b: 5.23511749186e-08

In [ ]:

1.1.3 Stochastic gradient descent (10 points)

Write a function sgd_iter(x_train, t_train, w, b) that performs one iteration of stochastic gradient descent (SGD), and returns the new weights. It should go through the trainingset once in randomized order, call logreg_gradient(x, t, w, b) for each datapoint to get the gradients, and update the parameters using a small learning rate of 1E-6. Note that in this case we're maximizing the likelihood function, so we should actually performing gradient ascent... For more information about SGD, see Bishop 5.2.4 or an online source (i.e. https://en.wikipedia.org/wiki/Stochastic_gradient_descent)


In [8]:
import random

def predict(x, w, b):
    """
    Return the log likelihood for a datapoint's real target.
    """
    logq = np.array(np.matmul(x, w) + b)

    if len(logq.shape) == 1:
        logq = np.expand_dims(logq, 0)
    
    if len(x.shape) == 1:
        x = np.expand_dims(x, 0)

    # Save number of classes
    k = logq.shape[1]

    # Calculate log z with trick mentioned above 
    a = max(logq.flatten())
    logz = a + np.log(sum([np.power(np.e, logq[:,i] - a) for i in range(k)]))
    logp = logq - logz
    return logp

def sgd_iter(x_train, t_train, W, b): 
    """
    Go over all datapoints randomly and adapt the weights and bias terms
    according to the stochastic gradient descent algorithm.
    """
    indices = np.arange(0, len(x_train))
    random.shuffle(indices)
    eta = 1e-6

    for i in indices:
        logp_train, dL_dw, dL_db = logreg_gradient(x_train[i], [t_train[i]], W, b)
        W = W - eta * -dL_dw
        b = b - eta * -dL_db

    predictions = []
    for i in indices:
        logp_train = predict(x_train[i], W, b)[:, t_train[i]].squeeze()
        predictions.append(np.asscalar(logp_train))

    return predictions, W, b

In [9]:
# Sanity check:
np.random.seed(1243)
w = np.zeros((28*28, 10))
b = np.zeros(10)
    
logp_train, W, b = sgd_iter(x_train[:5], t_train[:5], w, b)

1.2. Train

1.2.1 Train (10 points)

Perform 10 SGD iterations through the trainingset. Plot (in one graph) the conditional log-probability of the trainingset and validation set after each iteration.

OUR ANSWER

We created a graph of the average log probability, to be able to compare the results for the training and validation set. Because the learning rate is so small (1e-6), the average log probability of both sets is still very similar after 10 iterations.

If you would increase the learning rate to 1e-4 or 1e-3 you do see differences between both sets, where the average log probability is higher for the training set (overfitting on training set).


In [11]:
from matplotlib import pyplot as plt

def test_sgd(x_train, t_train, w, b):
    y_train = []
    y_valid = []
    k = 10
    
    for i in range(k):
        predictions_train, w, b = sgd_iter(x_train, t_train, w, b)
        
        # Check how well the weights generalize for the validation set
        predictions_valid = [predict(x, w, b)[:, t_valid[i]].squeeze()
                             for i, x in enumerate(x_valid)]

        y_train.append(np.mean(predictions_train))
        y_valid.append(np.mean(predictions_valid))
    plt.plot(np.arange(1, k+1), y_train, label = "Training set")
    plt.plot(np.arange(1, k+1), y_valid, label = "Validation set")
    plt.legend()
    plt.xlabel("Iteration")
    plt.ylabel("Avg. log-probability")
    plt.show()
    return w, b
    
np.random.seed(1243)
w = np.zeros((28*28, 10))
b = np.zeros(10)

w,b = test_sgd(x_train, t_train, w, b)


788.9947500228882

1.2.2 Visualize weights (10 points)

Visualize the resulting parameters $\bW$ after a few iterations through the training set, by treating each column of $\bW$ as an image. If you want, you can use or edit the plot_digits(...) above.


In [12]:
def plot_digits(data, num_cols, targets=None, shape=(28,28)):
    num_digits = data.shape[0]

    num_rows = int(num_digits/num_cols)
    for i in range(num_digits):
        plt.subplot(num_rows, num_cols, i+1)
        plt.imshow(data[i].reshape(shape), interpolation='none', cmap='Greys')
        if targets is not None:
            plt.title(int(targets[i]))
        plt.colorbar()
        plt.axis('off')
    plt.tight_layout()
    plt.show()
    
plot_digits(np.transpose(w), num_cols=5)


Describe in less than 100 words why these weights minimize the loss The plot illustrates that every column j of W is `specialized' in recognising a number. The colours represent the penalty or reward of pixel intensity in data point x. Intensity in the grey area does not matter (contributes 0 to the likelihood), in the black area should be intensity if j is x's target (positive contribution) and in the white area there should not be intensity (negative contribution). If x's target follows this pattern, the likelihood of its target is maximized and thus the loss is minimized. The fuzziness in the numbers accounts for variations in handwriting.

1.2.3. Visualize the 8 hardest and 8 easiest digits (10 points)

Visualize the 8 digits in the validation set with the highest probability of the true class label under the model. Also plot the 8 digits that were assigned the lowest probability. Ask yourself if these results make sense.


In [13]:
predictions_valid = np.array([predict(x, w, b)[:, t_valid[i]].squeeze() for i, x in enumerate(x_valid)])
ordered = predictions_valid.argsort()
ind_worst = ordered[:8]
ind_best = ordered[-8:]

print("Lowest probability for the true class label")
plot_digits(x_valid[ind_worst], 4, t_valid[ind_worst])

print("Highest probability for the true class label")
plot_digits(x_valid[ind_best], 4, t_valid[ind_best])


Lowest probability for the true class label
Highest probability for the true class label

We think these images make sense:

BAD IMAGES:

1. The very first bad digit is something that we even cannot really classify. It looks like an 'a' or a 0, but according to its target it should be a 4.
2. The two nines are very different from the 9 'prototype' that we have seen in the weights, where the curly part at the bottom of the nine is not really visible. Therefore the log likelihood for the target (9) would be very low.
3. The random stripe might be a 1 written diagonally? But it is very different from regular digit handwriting, thus resulting in a very low log likelihood for class 1.
4. The curly 1 is very atypical when we compare it to the learned weights, resulting in a low log likelihood for class 1.
5. The fifth digit is probabibly a 7, but we can see how its curliness results in a very low log likelihood for 7, given the weights for class 7.

GOOD IMAGES: These are all zeros, probably because there is much less variation in zeros compared to the other digits. We can see this from the visualized weights. Additionally, the other digits do not realy look like zeros. Therefore, a zero would not be misclassified very quickly.

Part 2. Multilayer perceptron

You discover that the predictions by the logistic regression classifier are not good enough for your application: the model is too simple. You want to increase the accuracy of your predictions by using a better model. For this purpose, you're going to use a multilayer perceptron (MLP), a simple kind of neural network. The perceptron wil have a single hidden layer $\bh$ with $L$ elements. The parameters of the model are $\bV$ (connections between input $\bx$ and hidden layer $\bh$), $\ba$ (the biases/intercepts of $\bh$), $\bW$ (connections between $\bh$ and $\log q$) and $\bb$ (the biases/intercepts of $\log q$.

The conditional probability of the class label $j$ is given by:

$\log p(t = j \;|\; \bx, \bb, \bW) = \log q_j - \log Z$

where $q_j$ are again the unnormalized probabilities per class, and $Z = \sum_j q_j$ is again the probability normalizing factor. Each $q_j$ is computed using:

$\log q_j = \bw_j^T \bh + b_j$

where $\bh$ is a $L \times 1$ vector with the hidden layer activations (of a hidden layer with size $L$), and $\bw_j$ is the $j$-th column of $\bW$ (a $L \times 10$ matrix). Each element of the hidden layer is computed from the input vector $\bx$ using:

$h_j = \sigma(\bv_j^T \bx + a_j)$

where $\bv_j$ is the $j$-th column of $\bV$ (a $784 \times L$ matrix), $a_j$ is the $j$-th element of $\ba$, and $\sigma(.)$ is the so-called sigmoid activation function, defined by:

$\sigma(x) = \frac{1}{1 + \exp(-x)}$

Note that this model is almost equal to the multiclass logistic regression model, but with an extra 'hidden layer' $\bh$. The activations of this hidden layer can be viewed as features computed from the input, where the feature transformation ($\bV$ and $\ba$) is learned.

2.1 Derive gradient equations (20 points)

State (shortly) why $\nabla_{\bb} \mathcal{L}^{(n)}$ is equal to the earlier (multiclass logistic regression) case, and why $\nabla_{\bw_j} \mathcal{L}^{(n)}$ is almost equal to the earlier case.

Like in multiclass logistic regression, you should use intermediate variables $\mathbf{\delta}_j^q$. In addition, you should use intermediate variables $\mathbf{\delta}_j^h = \frac{\partial \mathcal{L}^{(n)}}{\partial h_j}$.

Given an input image, roughly the following intermediate variables should be computed:

$ \log \bq \rightarrow Z \rightarrow \log \bp \rightarrow \mathbf{\delta}^q \rightarrow \mathbf{\delta}^h $

where $\mathbf{\delta}_j^h = \frac{\partial \mathcal{L}^{(n)}}{\partial \bh_j}$.

Give the equations for computing $\mathbf{\delta}^h$, and for computing the derivatives of $\mathcal{L}^{(n)}$ w.r.t. $\bW$, $\bb$, $\bV$ and $\ba$.

You can use the convenient fact that $\frac{\partial}{\partial x} \sigma(x) = \sigma(x) (1 - \sigma(x))$.

OUR ANSWER

  • $\nabla_{\bb} \mathcal{L}^{(n)}$ is equal to the multiclass logistic regression case, because the factor $b_j$ in the log probability $\mathcal{L}^{(n)}$ is equal in both cases.
  • $\mathbf{\delta}_j^q$ is different from the MLR case, because $w_j$ in $\mathcal{L}^{(n)}$ is multiplied by the vector $\bh$, that are sigmoid mappings of $x$, in stead of the vector $\bx$.

For $j=t^{(n)}$, $\delta^q_j = \frac{\partial \mathcal{L}}{\partial log p} \frac{\partial log p}{\partial log q_j} + \frac{\partial \mathcal{L}}{\partial log Z} \frac{\partial log Z}{\partial Z} \frac{\partial Z}{\partial log q_j}$ $= 1 - \frac{\partial log Z}{\partial Z} \frac{\partial Z}{\partial log q_j}$

For $j \neq t^{(n)}$, $\delta^q_j = - \frac{\partial log Z}{\partial Z} \frac{\partial Z}{\partial log q_j}$

$\frac{\partial \mathcal{L}}{\partial w_{ij}} = \frac{\partial \mathcal{L}}{\partial log q_j} \frac{\partial log q_j}{\partial w_{ij}} = \delta^q_j \frac{\partial}{\partial w_{ij}} \sum^L_{i=1} w_{ij}h_i + b_j = \delta^q_j h_i$

$\frac{\partial \mathcal{L}}{\partial b_j} = \frac{\partial \mathcal{L}}{\partial log q_j} \frac{\partial log q_j}{\partial b_j} = \delta_j^q \rightarrow \nabla_b \mathcal{L} = \delta^q$

$\frac{\partial \mathcal{L}}{\partial V_i} = \frac{\partial \mathcal{L}}{\partial log q_j} \frac{\partial log q_j}{\partial h_i}\frac{\partial h_i}{\partial V_i} = \delta_j^q w_{ij} h_i (1-h_i)x = \delta_i^h h_i (1-h_i)x$

2.2 MAP optimization (10 points)

You derived equations for finding the maximum likelihood solution of the parameters. Explain, in a few sentences, how you could extend this approach so that it optimizes towards a maximum a posteriori (MAP) solution of the parameters, with a Gaussian prior on the parameters.

OUR ANSWER

MAP can be achieved using Bayes' rule. We can use the fact that: $MAP \propto likelihood \cdot prior$. If apply this to sequential learning, like we saw in class, we get: $p(W | (x_N, t_n), \{(x_i, t_i)\}_{i=1}^{N-1}, b) \propto p(t_N | x_N, W, b) p (W | \{(x_i, t_i)\}_{i=1}^{N-1})$, where for a Gaussian prior on the parameters $p (W | \{(x_i, t_i)\}_{i=1}^{N-1}) = \mathcal{N}(W|m_{N-1} , S_{N-1} )$, where $m$ and $S$ would be calculated from all data points seen up to that point. This can be extended to estimate the other unknowns V, a and b.

In 2.1 we took the derivative of the $likelihood$. To extend this approach to the MAP solution of parameters, we should take the derivatives for $V$, $W$, $a$ and $b$ from the $likelihood \cdot prior$.

2.3. Implement and train a MLP (15 points)

Implement a MLP model with a single hidden layer of 20 neurons. Train the model for 10 epochs. Plot (in one graph) the conditional log-probability of the trainingset and validation set after each two iterations, as well as the weights.

  • 10 points: Working MLP that learns with plots
  • +5 points: Fast, numerically stable, vectorized implementation

As suggested in errata_lab2.pdf we have visualized the average log probability to be able to compare the training and the validation set.

We have also experimented with different eta: eta = 1e-1, eta = 1e-2, eta = 1e-3. This is visualized in the image shown below this piece of text. eta = 1e-1 resulted in very unstable results, and eta = 1e-3 was learning very slowly. Therefore, we have used eta = 1e-2 in this assignment.


In [16]:
# Write all helper functions here
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def dsig(s):
    return np.multiply(s, (1 - s))

def calc_h(x, v, a):
    return sigmoid(np.matmul(v.T, x) + a).squeeze()

In [18]:
# Write training code here:
# Plot the conditional loglikelihoods for the train and validation dataset after every iteration.
# Plot the weights of the first layer.

def sgd_iter(x_train, t_train, V, W, a, b): 
    """
    Go over all datapoints randomly and adapt the weights and bias terms
    according to the stochastic gradient descent algorithm.
    """
    indices = np.arange(0, len(x_train))
    random.shuffle(indices)

    for i in indices:
        h = calc_h(x_train[i], V, a)
        _, dL_dw, dL_db = logreg_gradient(h, [t_train[i]], W, b)

        dL_dh = np.matmul(W, dL_db)
        dL_da = np.multiply(dL_dh, dsig(h))
        dL_dv = np.multiply(np.matrix(x_train[i]).T, np.matrix(dL_da))

        V = V + np.multiply(eta, dL_dv)
        W = W + np.multiply(eta, dL_dw)
        a = a + np.multiply(eta, dL_da)
        b = b + np.multiply(eta, dL_db)

    predictions = []
    for i in indices:
        logp_train = predict(calc_h(x_train[i], V, a), W, b)[:, t_train[i]].squeeze()
        predictions.append(np.asscalar(logp_train))

    return predictions, V, W, a, b

def test_sgd(x_train, t_train, V, W, a, b):
    y_train = []
    y_valid = []
    k = 10
    eta = 1e-2
    
    for i in range(k):
        predictions_train, V, W, a, b = sgd_iter(x_train, t_train, V, W, a, b)
        
        # Check how well the weights generalize for the validation set
        predictions_valid = [predict(calc_h(x, V, a), W, b)[:, t_valid[i]].squeeze()
                             for i, x in enumerate(x_valid)]

        y_train.append(np.mean(predictions_train))
        y_valid.append(np.mean(predictions_valid))
        
    plt.plot(np.arange(1, k+1), y_train, label = "Training set")
    plt.plot(np.arange(1, k+1), y_valid, label = "Validation set")
    plt.legend()
    plt.xlabel("Iteration")
    plt.ylabel("Avg. log-probability")
    plt.show()
    return V, W, a, b

K = x_train.shape[1] # No. of input units
L = 20 # No. of hidden units
M = 10 # No. of output units
eta = 1e-2
# Initialize weights gaussian
W = np.random.normal(loc=0, scale=0.025, size=(L, M))
b = np.random.uniform(size=(10,))
a = np.random.uniform(size=(L,))
V = np.random.normal(loc=0, scale=0.025, size=(784, L))

V, W, a, b = test_sgd(x_train, t_train, V, W, a, b)
plot_digits(V.T, num_cols=5)


2.3.1. Explain the weights (5 points)

In less than 80 words, explain how and why the weights of the hidden layer of the MLP differ from the logistic regression model, and relate this to the stronger performance of the MLP.

OUR ANSWER

The weights represent parts of the digits that are captured: curves and lines. The hidden layer does not map digits to their class (as in 1.2.2), but recognizes features of digits. This allows MLP to capture more sensitive variations between digits and this results in a higher classification accuracy.

2.3.1. Less than 250 misclassifications on the test set (10 bonus points)

You receive an additional 10 bonus points if you manage to train a model with very high accuracy: at most 2.5% misclasified digits on the test set. Note that the test set contains 10000 digits, so you model should misclassify at most 250 digits. This should be achievable with a MLP model with one hidden layer. See results of various models at : http://yann.lecun.com/exdb/mnist/index.html. To reach such a low accuracy, you probably need to have a very high $L$ (many hidden units), probably $L > 200$, and apply a strong Gaussian prior on the weights. In this case you are allowed to use the validation set for training. You are allowed to add additional layers, and use convolutional networks, although that is probably not required to reach 2.5% misclassifications.

OUR ANSWER

In an attempt to get the accuracy as high as possible, we have used a learning rate that adapts itself. Because it takes quite a while to train, we were unable to get to 97.5% accuracy before 23:59. However, we believe that if you set eta to 0.01 and then decrease it along the way, you will get there within 25 iterations.


In [28]:
predict_test = np.zeros(len(t_test))
# Fill predict_test with the predicted targets from your model, don't cheat :-).
# YOUR CODE HERE

# Write training code here:
# Plot the conditional loglikelihoods for the train and validation dataset after every iteration.
# Plot the weights of the first layer.

def sgd_iter2(x_train, t_train, V, W, a, b, eta): 
    """
    Go over all datapoints randomly and adapt the weights and bias terms
    according to the stochastic gradient descent algorithm.
    """
    indices = np.arange(0, len(x_train))
    random.shuffle(indices)
    

    for i in indices:
        h = calc_h(x_train[i], V, a)
        _, dL_dw, dL_db = logreg_gradient(h, [t_train[i]], W, b)
        dL_dh = np.matmul(W, dL_db)
        dL_da = np.multiply(dL_dh, dsig(h))
        dL_dv = np.multiply(np.matrix(x_train[i]).T, np.matrix(dL_da))

        V = V + np.multiply(eta, dL_dv)
        W = W + np.multiply(eta, dL_dw)
        a = a + np.multiply(eta, dL_da)
        b = b + np.multiply(eta, dL_db)

    predictions = []
    for i in indices:
        logp_train = predict(calc_h(x_train[i], V, a), W, b)[:, t_train[i]].squeeze()
        predictions.append(np.asscalar(logp_train))

    return predictions, V, W, a, b

def test_sgd_super(x_train, t_train, V, W, a, b):
    y_train = []
    y_valid = []
    k = 25
    
    eta = 0.1
    for i in range(k):
        _, V, W, a, b = sgd_iter2(x_train, t_train, V, W, a, b, eta)
        _, V, W, a, b = sgd_iter2(x_valid, t_valid, V, W, a, b, eta)

        for j, x in enumerate(x_test):
            all_classes = predict(calc_h(x, V, a), W, b).tolist()[0]
            predict_test[j] = all_classes.index(max(all_classes))
        p = (10000 - np.sum(predict_test != t_test)) / 10000
        print("Iteration {} done for eta {:.5f}, {:.2f}% correct.".format(i+1, eta, p * 100))
        eta = eta * 0.5
        if (p*100) > 97:
            eta = 1e-4
        if (p*100) > 97.5:
            break
    return V, W, a, b

# init all values with gaussians
K = 10
L = 205

eta = 1e-4
W = np.random.normal(loc =0, scale =0.25, size=(L, K))
b = np.random.uniform(size=(K,))
a = np.random.uniform(size=(L,))
V = np.random.normal(loc=0, scale=0.25, size=(28*28, L))
V, W, a, b = test_sgd_super(x_train, t_train, V, W, a, b)


Iteration 1 done for eta 0.10000, 95.03% correct.
Iteration 2 done for eta 0.05000, 97.21% correct.
Iteration 3 done for eta 0.00010, 97.32% correct.
Iteration 4 done for eta 0.00010, 97.44% correct.
Iteration 5 done for eta 0.00010, 97.45% correct.
Iteration 6 done for eta 0.00010, 97.49% correct.
Iteration 7 done for eta 0.00010, 97.53% correct.

In [29]:
assert predict_test.shape == t_test.shape
n_errors = np.sum(predict_test != t_test)
print('Test errors: %d' % n_errors)


Test errors: 247

In [ ]: