Notes on implementation:
$\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bb}{\mathbf{b}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\ba}{\mathbf{a}}$ $\newcommand{\bz}{\mathbf{z}}$ $\newcommand{\bv}{\mathbf{v}}$ $\newcommand{\bq}{\mathbf{q}}$ $\newcommand{\bp}{\mathbf{p}}$ $\newcommand{\bh}{\mathbf{h}}$
$\newcommand{\bI}{\mathbf{I}}$ $\newcommand{\bX}{\mathbf{X}}$ $\newcommand{\bT}{\mathbf{T}}$ $\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bW}{\mathbf{W}}$ $\newcommand{\bV}{\mathbf{V}}$
In [2]:
%pylab inline
import gzip, cPickle
Scenario: you have a friend with one big problem: she's completely blind. You decided to help her: she has a special smartphone for blind people, and you are going to develop a mobile phone app that can do machine vision using the mobile camera: converting a picture (from the camera) to the meaning of the image. You decide to start with an app that can read handwritten digits, i.e. convert an image of handwritten digits to text (e.g. it would enable her to read precious handwritten phone numbers).
A key building block for such an app would be a function predict_digit(x)
that returns the digit class of an image patch $\bx$. Since hand-coding this function is highly non-trivial, you decide to solve this problem using machine learning, such that the internal parameters of this function are automatically learned using machine learning techniques.
The dataset you're going to use for this is the MNIST handwritten digits dataset (http://yann.lecun.com/exdb/mnist/
). You can load the data from mnist.pkl.gz
we provided, using:
In [3]:
def load_mnist():
f = gzip.open('mnist.pkl.gz', 'rb')
data = cPickle.load(f)
f.close()
return data
(x_train, t_train), (x_valid, t_valid), (x_test, t_test) = load_mnist()
The tuples represent train, validation and test sets. The first element (x_train
, x_valid
, x_test
) of each tuple is a $N \times M$ matrix, where $N$ is the number of datapoints and $M = 28^2 = 784$ is the dimensionality of the data. The second element (t_train
, t_valid
, t_test
) of each tuple is the corresponding $N$-dimensional vector of integers, containing the true class labels.
Here's a visualisation of the first 8 digits of the trainingset:
In [4]:
def plot_digits(data, numcols, shape=(28,28)):
numdigits = data.shape[0]
numrows = int(numdigits/numcols)
for i in range(numdigits):
plt.subplot(numrows, numcols, i)
plt.axis('off')
plt.imshow(data[i].reshape(shape), interpolation='nearest', cmap='Greys')
plt.show()
plot_digits(x_train[0:8], numcols=4)
In multiclass logistic regression, the conditional probability of class label $j$ given the image $\bx$ for some datapoint is given by:
$ \log p(t = j \;|\; \bx, \bb, \bW) = \log q_j - \log Z$
where $\log q_j = \bw_j^T \bx + b_j$ (the log of the unnormalized probability of the class $j$), and $Z = \sum_k q_k$ is the normalizing factor. $\bw_j$ is the $j$-th column of $\bW$ (a matrix of size $784 \times 10$) corresponding to the class label, $b_j$ is the $j$-th element of $\bb$.
Given an input image, the multiclass logistic regression model first computes the intermediate vector $\log \bq$ (of size $10 \times 1$), using $\log q_j = \bw_j^T \bx + b_j$, containing the unnormalized log-probabilities per class.
The unnormalized probabilities are then normalized by $Z$ such that $\sum_j p_j = \sum_j \exp(\log p_j) = 1$. This is done by $\log p_j = \log q_j - \log Z$ where $Z = \sum_j \exp(\log q_j)$. This is known as the softmax transformation, and is also used as a last layer of many classifcation neural network models, to ensure that the output of the network is a normalized distribution, regardless of the values of second-to-last layer ($\log \bq$)
Warning: when computing $\log Z$, you are likely to encounter numerical problems. Save yourself countless hours of debugging and learn the log-sum-exp trick.
The network's output $\log \bp$ of size $10 \times 1$ then contains the conditional log-probabilities $\log p(t = j \;|\; \bx, \bb, \bW)$ for each digit class $j$. In summary, the computations are done in this order:
$\bx \rightarrow \log \bq \rightarrow Z \rightarrow \log \bp$
Given some dataset with $N$ independent, identically distributed datapoints, the log-likelihood is given by:
$ \mathcal{L}(\bb, \bW) = \sum_{i=1}^N \mathcal{L}^{(i)}$
where we use $\mathcal{L}^{(i)}$ to denote the partial log-likelihood evaluated over a single datapoint. It is important to see that the log-probability of the class label $t^{(i)}$ given the image, is given by the $t^{(i)}$-th element of the network's output $\log \bp$, denoted by $\log p_{t^{(i)}}$:
$\mathcal{L}^{(i)} = \log p(t = t^{(i)} \;|\; \bx = \bx^{(i)}, \bb, \bW) = \log p_{t^{(i)}} = \log q_{t^{(i)}} - \log Z^{(i)}$
where $\bx^{(i)}$ and $t^{(i)}$ are the input (image) and class label (integer) of the $i$-th datapoint, and $Z^{(i)}$ is the normalizing constant for the distribution over $t^{(i)}$.
Derive the equations for computing the (first) partial derivatives of the log-likelihood w.r.t. all the parameters, evaluated at a single datapoint $i$.
You should start deriving the equations for $\frac{\partial \mathcal{L}^{(i)}}{\partial \log q_j}$ for each $j$. For clarity, we'll use the shorthand $\delta^q_j = \frac{\partial \mathcal{L}^{(i)}}{\partial \log q_j}$.
For $j = t^{(i)}$: $ \delta^q_j = \frac{\partial \mathcal{L}^{(i)}}{\partial \log p_j} \frac{\partial \log p_j}{\partial \log q_j}
For $j \neq t^{(i)}$: $ \delta^q_j = \frac{\partial \mathcal{L}^{(i)}}{\partial \log Z} \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} = - \frac{\partial \log Z}{\partial Z} \frac{\partial Z}{\partial \log q_j} $
Complete the above derivations for $\delta^q_j$ by furtherly developing $\frac{\partial \log Z}{\partial Z}$ and $\frac{\partial Z}{\partial \log q_j}$. Both are quite simple. For these it doesn't matter whether $j = t^{(i)}$ or not.
Given your equations for computing the gradients $\delta^q_j$ it should be quite straightforward to derive the equations for the gradients of the parameters of the model, $\frac{\partial \mathcal{L}^{(i)}}{\partial W_{ij}}$ and $\frac{\partial \mathcal{L}^{(i)}}{\partial b_j}$. The gradients for the biases $\bb$ are given by:
$ \frac{\partial \mathcal{L}^{(i)}}{\partial b_j} = \frac{\partial \mathcal{L}^{(i)}}{\partial \log q_j} \frac{\partial \log q_j}{\partial b_j} = \delta^q_j \cdot 1 = \delta^q_j $
The equation above gives the derivative of $\mathcal{L}^{(i)}$ w.r.t. a single element of $\bb$, so the vector $\nabla_\bb \mathcal{L}^{(i)}$ with all derivatives of $\mathcal{L}^{(i)}$ w.r.t. the bias parameters $\bb$ is:
$ \nabla_\bb \mathcal{L}^{(i)} = \mathbf{\delta}^q $
where $\mathbf{\delta}^q$ denotes the vector of size $10 \times 1$ with elements $\mathbf{\delta}_j^q$.
The (not fully developed) equation for computing the derivative of $\mathcal{L}^{(i)}$ w.r.t. a single element $W_{ij}$ of $\bW$ is:
$ \frac{\partial \mathcal{L}^{(i)}}{\partial W_{ij}} = \frac{\partial \mathcal{L}^{(i)}}{\partial \log q_j} \frac{\partial \log q_j}{\partial W_{ij}} = \mathbf{\delta}_j^q \frac{\partial \log q_j}{\partial W_{ij}} $
What is $\frac{\partial \log q_j}{\partial W_{ij}}$? Complete the equation above.
If you want, you can give the resulting equation in vector format ($\nabla_{\bw_j} \mathcal{L}^{(i)} = ...$), like we did for $\nabla_\bb \mathcal{L}^{(i)}$.
$\frac{\partial \log Z}{\partial Z}\frac{\partial Z}{\partial \log q_j} = \frac{1}{Z} \sum_k \frac{\partial \exp(\log q_k)}{\partial \log q_j} = \frac{1}{Z} \exp(\log q_j) = \frac{q_j}{\sum_k q_k}$
$ \frac{\partial \mathcal{L}^{(i)}}{\partial W_{ij}} = \frac{\partial \mathcal{L}^{(i)}}{\partial \log q_j} \frac{\partial \log q_j}{\partial W_{ij}} = \mathbf{\delta}_j^q \frac{\partial \log q_j}{\partial W_{ij}} $$ = \mathbf{\delta}_{j}^q \frac{\partial \bw^T \bx + \bb}{\partial W_{ij}} = \mathbf{\delta}_{j}^q \bx_i $
$ \nabla_{\bw_j} \mathcal{L}^{(i)} = \mathbf{\delta}_{j}^q \bx $
Implement the gradient calculations you derived in the previous question. Write a function logreg_gradient(x, t, w, b)
that returns the gradients $\nabla_{\bw_j} \mathcal{L}^{(i)}$ (for each $j$) and $\nabla_{\bb} \mathcal{L}^{(i)}$, i.e. the first partial derivatives of the log-likelihood w.r.t. the parameters $\bW$ and $\bb$, evaluated at a single datapoint (x
, t
).
The computation will contain roughly the following intermediate variables:
$ \log \bq \rightarrow Z \rightarrow \log \bp \rightarrow \mathbf{\delta}^q $
followed by computation of the gradient vectors $\nabla_{\bw_j} \mathcal{L}^{(i)}$ (contained in a $784 \times 10$ matrix) and $\nabla_{\bb} \mathcal{L}^{(i)}$ (a $10 \times 1$ vector).
In [11]:
evaluate_ln_q = lambda W, x, b: W.T.dot(x) + b
def logreg_gradient(x, t, W, b):
D = size(W, 1)
dW = zeros(shape(W))
ln_q = evaluate_ln_q(W, x, b)
Z = sum(exp(ln_q))
delta_q = zeros((D, 1))
for j in range(D):
q_over_Z = exp(ln_q[j])/ Z
delta_q[j] = 1 - q_over_Z if j == t else - q_over_Z
dW = x.dot(delta_q.T)
#dW[:, t:t+1] = (x * delta_q.T)[:, t:t+1]
return dW, delta_q
Write a function sgd_iter(x_train, t_train, w, b)
that performs one iteration of stochastic gradient descent (SGD), and returns the new weights. It should go through the trainingset once in randomized order, call logreg_gradient(x, t, w, b)
for each datapoint to get the gradients, and update the parameters using a small learning rate (e.g. 1E-4
). Note that in this case we're maximizing the likelihood function, so we should actually performing gradient ascent..
In [6]:
def sgd_iter(x_train, t_train, W, b, alpha=1 * 10**-4):
N = size(x_train, 0)
indices = range(N)
random.shuffle(indices)
for index in indices:
(x, t) = x_train[index:index+1, :], t_train[index]
# x is a row vector, we pass it as a column vector
dW, db = logreg_gradient(x.T, t, W, b)
# Gradient ascent so we have to use a + sign
W += alpha * dW
b += alpha * db
return W, b
In [12]:
(x_train, t_train), (x_valid, t_valid), (x_test, t_test) = load_mnist()
K = 10
N, D = shape(x_train)
#W = random.rand(D, K)
#b = random.rand(K, 1)
# Perhaps we've got our gradients wrong, but right now
# the gradient in the weights for each class is a scalar times the input.
# Due to the nature of the data, if we start with randomised weights,
# the resulting filters/basis functions will be noisy.
# Although we're not sure if this harms performance, we'll start
# with zero-weights just in case.
W = zeros((D, K))
b = zeros((K, 1))
def evaluate_ln_p(x, W, b):
ln_q = evaluate_ln_q(W, x, b)
a = np.max(ln_q)
ln_Z = a + log(sum(exp(ln_q - a)))
return ln_q - ln_Z
def ln_p_avg(X, t_eval, W, b):
S = 0
N = size(X, 0)
for i in range(N):
(x, t) = X[i:i+1, :], t_eval[i]
ln_p = evaluate_ln_p(x.T, W, b)
S += ln_p[t]
return S/N
iterations = 5
train_likelihood = zeros((iterations,1))
test_likelihood = zeros((iterations,1))
for i in range(iterations):
W, b = sgd_iter(x_train, t_train, W, b, alpha = 1 * 10**-4)
train_likelihood[i] = ln_p_avg(x_train, t_train, W, b)
test_likelihood[i] = ln_p_avg(x_test, t_test, W, b)
#print 'Iteration %d. Test log likelihood %f' % (i, test_likelihood[i])
plt.plot(range(iterations), train_likelihood, label='Train set')
plt.plot(range(iterations), test_likelihood, label='Test set')
plt.xlabel('Number of iterations through the train set')
plt.ylabel('Likelihood given the model')
plt.legend(bbox_to_anchor=(1, 1),
bbox_transform=plt.gcf().transFigure)
plt.show()
In [8]:
plot_digits(W.T, numcols=5)
In [9]:
N = size(t_test, 0)
probs = [(evaluate_ln_p(x_test[i:i+1, :].T, W, b)[t_test[i]], x_test[i]) for i in range(N)]
ascending = sorted(probs, key=lambda x: x[0])
descending = reversed(ascending)
D = size(x_test, 1)
lowest = zeros((D, 8))
highest = zeros((D, 8))
for i in range(8):
lowest[:, i] = ascending[i][1]
highest[:, i] = descending.next()[1]
plot_digits(lowest.T, numcols=4)
plot_digits(highest.T, numcols=4)
You discover that the predictions by the logistic regression classifier are not good enough for your application: the model is too simple. You want to increase the accuracy of your predictions by using a better model. For this purpose, you're going to use a multilayer perceptron (MLP), a simple kind of neural network. The perceptron wil have a single hidden layer $\bh$ with $L$ elements. The parameters of the model are $\bV$ (connections between input $\bx$ and hidden layer $\bh$), $\ba$ (the biases/intercepts of $\bh$), $\bW$ (connections between $\bh$ and $\log q$) and $\bb$ (the biases/intercepts of $\log q$.
The conditional probability of the class label $j$ is given by:
$\log p(t = j \;|\; \bx, \bb, \bW) = \log q_j - \log Z$
where $q_j$ are again the unnormalized probabilities per class, and $Z = \sum_j q_j$ is again the probability normalizing factor. Each $q_j$ is computed using:
$\log q_j = \bw_j^T \bh + b_j$
where $\bh$ is a $L \times 1$ vector with the hidden layer activations (of a hidden layer with size $L$), and $\bw_j$ is the $j$-th column of $\bW$ (a $L \times 10$ matrix). Each element of the hidden layer is computed from the input vector $\bx$ using:
$h_j = \sigma(\bv_j^T \bx + a_j)$
where $\bv_j$ is the $j$-th column of $\bV$ (a $784 \times L$ matrix), $a_j$ is the $j$-th element of $\ba$, and $\sigma(.)$ is the so-called sigmoid activation function, defined by:
$\sigma(x) = \frac{1}{1 + \exp(-x)}$
Note that this model is almost equal to the multiclass logistic regression model, but with an extra 'hidden layer' $\bh$. The activations of this hidden layer can be viewed as features computed from the input, where the feature transformation ($\bV$ and $\ba$) is learned.
State (shortly) why $\nabla_{\bb} \mathcal{L}^{(i)}$ is equal to the earlier (multiclass logistic regression) case, and why $\nabla_{\bw_j} \mathcal{L}^{(i)}$ is almost equal to the earlier case.
Like in multiclass logistic regression, you should use intermediate variables $\mathbf{\delta}_j^q$. In addition, you should use intermediate variables $\mathbf{\delta}_j^h = \frac{\partial \mathcal{L}^{(i)}}{\partial h_j}$.
Given an input image, roughly the following intermediate variables should be computed:
$ \log \bq \rightarrow Z \rightarrow \log \bp \rightarrow \mathbf{\delta}^q \rightarrow \mathbf{\delta}^h $
where $\mathbf{\delta}_j^h = \frac{\partial \mathcal{L}^{(i)}}{\partial \bh_j}$.
Give the equations for computing $\mathbf{\delta}^h$, and for computing the derivatives of $\mathcal{L}^{(i)}$ w.r.t. $\bW$, $\bb$, $\bV$ and $\ba$.
You can use the convenient fact that $\frac{\partial}{\partial x} \sigma(x) = \sigma(x) (1 - \sigma(x))$.
$\nabla_{\bb} \mathcal{L}^{(i)}$ is equal to the earlier case because we have the same conditional probability of the class label $j$, $\log p(t = j \;|\; \bx, \bb, \bW) = \log q_j - \log Z$ and hence the same derivative and gradient $\nabla_\bb \mathcal{L}^{(i)} = \mathbf{\delta}^q$
$\nabla_{\bw_j} \mathcal{L}^{(i)}$ differs slightly because in the MLP, $\bW$ connects the hidden layer values $\bh$ instead of the input values $\bx$. Because of this, and following the derivative made in 1.1.1, the gradient is: $ \frac{\partial \mathcal{L}^{(i)}}{\partial \bw_j} = \frac{\partial \mathcal{L}^{(i)}}{\partial \log q_j} \frac{\partial \log q_j}{\partial bw_j} = \mathbf{\delta}_j^q \frac{\partial \log q_j}{\partial bw_j} $$ = \mathbf{\delta}_{j}^q \frac{\partial \bw_j^T \bh + \bb_j}{\partial bw_j} = \mathbf{\delta}_{j}^q \bh_i $
Because $\delta^q$ is the same as before, we can easily find $\delta^h$:
$\delta^h_j = \delta^q \frac{\partial \log q_j}{\partial \bh} = \delta^q \frac{\partial \bw_j^T \bh + \bb_j}{\partial \bh} = \delta^q \bw_j $
$\nabla_{\bv_j} \mathcal{L} = \delta_j^h \frac{\partial \sigma(\mathbf{V}^T\bx + \ba)}{\partial \mathbf{V}^T\bx + \ba} \frac{\partial \mathbf{V}^T\bx + \ba}{\partial \bv_j} = \delta_j^h \sigma (\bv_j^T\bx + \ba_j) \sigma (1 - \bv_j^T\bx - \ba_j) \bx$
$\frac{\partial \mathcal{L}^{(i)}}{\partial \ba_j} = \delta^h \frac{\partial \sigma(\mathbf{V}^T\bx + \ba)}{\partial \mathbf{V}^T\bx + \ba} \frac{\partial \mathbf{V}^T\bx + \ba}{\partial \ba_j} = \delta_j^h \sigma (\bv_j^T\bx + \ba_j) \sigma (1 - \bv_j^T\bx - \ba_j)$
You derived equations for finding the maximum likelihood solution of the parameters. Explain, in a few sentences, how you could extend this approach so that it optimizes towards a maximum a posteriori (MAP) solution of the parameters, with a Gaussian prior on the parameters.
Answer:
In [125]:
sigma = lambda a: 1 / (1 + exp(a))
MLP_evaluate_ln_q = lambda h, W, b: W.T.dot(h) + b
MLP_evaluate_h = lambda x, V, a: sigma(V.T.dot(x) + a)
def MLP_logreg_gradient(x, t, V, W, a, b):
L = size(W, 0)
K = size(W, 1)
D = size(V, 0)
z = V.T.dot(x) + a # L * 1
sigma_gradient = sigma(z) * sigma(1 - z) # L * 1
h = MLP_evaluate_h(x, V, a)
ln_q = MLP_evaluate_ln_q(h, W, b)
Z = sum(exp(ln_q))
delta_q = - exp(ln_q) / Z
delta_q[t] = 1 + delta_q[t]
delta_h = W.dot(delta_q) * sigma_gradient
dW = h.dot(delta_q.T)
dV = x.dot(delta_h.T)
da = delta_h * sigma_gradient
db = delta_q
return dV, dW, da, db
def MLP_sgd_iter(x_train, t_train, V, W, a, b, alpha=1 * 10**-4):
N = size(x_train, 0)
indices = range(N)
random.shuffle(indices)
for index in indices:
(x, t) = x_train[index:index+1, :], t_train[index]
# x is a row vector, we pass it as a column vector
dV, dW, da, db = MLP_logreg_gradient(x.T, t, V, W, a, b)
# Gradient ascent so we have to use a + sign
V += alpha * dV
W += alpha * dW
a += alpha * da
b += alpha * db
return V, W, a, b
(x_train, t_train), (x_valid, t_valid), (x_test, t_test) = load_mnist()
K = 10
N, D = shape(x_train)
print shape(x_train)
L = 50
V = random.rand(D, L)
W = random.rand(L, K)
a = random.rand(L, 1)
b = random.rand(K, 1)
# Perhaps we've got our gradients wrong, but right now
# the gradient in the weights for each class is a scalar times the input.
# Due to the nature of the data, if we start with randomised weights,
# the resulting filters/basis functions will be noisy.
# Although we're not sure if this harms performance, we'll start
# with zero-weights just in case.
V = zeros((D, L))
W = zeros((L, K))
a = zeros((L, 1))
b = zeros((K, 1))
def MLP_evaluate_ln_p(x, V, W, a, b):
h = MLP_evaluate_h(x, V, a)
ln_q = MLP_evaluate_ln_q(h, W, b)
a = np.max(ln_q)
ln_Z = a + log(sum(exp(ln_q - a)))
return ln_q - ln_Z
def MLP_ln_p_avg(X, t_eval, V, W, a, b):
S = 0
N = size(X, 0)
for i in range(N):
(x, t) = X[i:i+1, :], t_eval[i]
ln_p = MLP_evaluate_ln_p(x.T, V, W, a, b)
S += ln_p[t]
return S/N
iterations = 10
train_likelihood = zeros((iterations,1))
test_likelihood = zeros((iterations,1))
for i in range(iterations):
V, W, a, b = MLP_sgd_iter(x_train, t_train, V, W, a, b, alpha = 1.5 * 10**-4)
train_likelihood[i] = MLP_ln_p_avg(x_train, t_train, V, W, a, b)
test_likelihood[i] = MLP_ln_p_avg(x_test, t_test, V, W, a, b)
print 'Iteration %d. Test log likelihood %f' % (i, test_likelihood[i])
plt.plot(range(iterations), train_likelihood, label='Train set')
plt.plot(range(iterations), test_likelihood, label='Test set')
plt.xlabel('Number of iterations through the train set')
plt.ylabel('Likelihood given the model')
plt.legend(bbox_to_anchor=(1, 1),
bbox_transform=plt.gcf().transFigure)
plt.show()
You receive an additional 10 bonus points if you manage to train a model with very high accuracy: at most 2.5% misclasified digits on the test set. Note that the test set contains 10000 digits, so you model should misclassify at most 250 digits. This should be achievable with a MLP model with one hidden layer. See results of various models at : http://yann.lecun.com/exdb/mnist/index.html
. To reach such a low accuracy, you probably need to have a very high $L$ (many hidden units), probably $L > 200$, and apply a strong Gaussian prior on the weights. In this case you are allowed to use the validation set for training.
You are allowed to add additional layers, and use convolutional networks, although that is probably not required to reach 2.5% misclassifications.
In [126]:
plot_digits(V[:, 40:46].T, numcols=3)
plt.show()
i = random.choice(range(size(t_test, 0)))
x, t = x_test[i:i+1, :], t_test[i]
h = MLP_evaluate_h(x, V, a)
ln_q = MLP_evaluate_ln_q(h, W, b)
Z = sum(exp(ln_q))
delta_q = - exp(ln_q) / Z
delta_q[t] = 1 + delta_q[t]
res = exp(MLP_evaluate_ln_p(x.T, V, W, a, b))
print t
for i, p in enumerate(res):
print i, p,
if p == max(res):
print ' (*)',
if i == t:
print ' (T)',
print
In [107]:
In [100]:
In [ ]: