The main idea of this assignment is to allow you to undestand how neural networks (NNs) work. We will cover the main aspects such as the Backpropagation and Optimization Methods. All mathematical operations should be implemented in NumPy only.
The assignmnent consist of 2 notebooks:
Some of the concepts below have not (yet) been discussed durin the lecture. These will be discussed further during the next Wednesday lecture
In this assignment we will use classes and their instances (objects). It will allow us to write less code and make it more readable. However, you don't have to take care about the exact implementation of the classes. We did it for you. But if you are interested in it, here are some useful links:
The interface of the current blocks is mostly inspired by Torch / PyTorch. You can also take a look at the first implementation of Keras
We use automark to check if the answers are correct. Register first
In [3]:
import automark as am
username = 'sosnovik'
am.register_id(username, ('ivan sosnovik', 'i.sosnovik@uva.nl'))
am.get_progress(username)
Dense Layer (Fully-Connected, Multiplicative) is the basic layer of a neural network. It transforms input matrix of size (n_objects, n_in)
to the matrix of size (n_objects, n_out)
by performing the following operation:
$$
H = XW + b
$$
or in other words:
$$
H_{ij} = \sum\limits_{k=1}^{n_\text{in}} X_{ik}W_{kj} + b_j
$$
Example:
You have a model of just 1 layer. The input is a point in a 3D space. And you want to predict its label: $-1$ or $1$. You have $75$ objects in you training subset (or batch).
Therefore, $X$ has shape $75 \times 3$. $Y$ has shape $75 \times 1$. Weight $W$ of the layer has shape $3 \times 1$. And $b$ is a scalar.
In [3]:
from __future__ import print_function, absolute_import, division
import numpy as np
Implement the forward path: $$ H = XW + b $$
In [5]:
def dense_forward(x_input, W, b):
"""Perform the mapping of the input
# Arguments
x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
W: np.array of size `(n_in, n_out)`
b: np.array of size `(n_out,)`
# Output
the output of a dense layer
np.array of size `(n_objects, n_out)`
"""
#################
### YOUR CODE ###
#################
return output
Let's check your first function. We set the matrices $X, W, b$: $$ X = \begin{bmatrix} 1 & -1 \\ -1 & 0 \\ \end{bmatrix} \quad W = \begin{bmatrix} 4 & 0\\ 2 & -1\\ \end{bmatrix} \quad b = \begin{bmatrix} 1 & 2 \\ \end{bmatrix} $$
And then compute $$ XW = \begin{bmatrix} 1 & -1 \\ -1 & 0 \\ \end{bmatrix} \begin{bmatrix} 4 & 0 \\ 2 & -1\\ \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ -4 & 0 \\ \end{bmatrix} \\ XW + b = \begin{bmatrix} 3 & 3 \\ -3 & 2 \\ \end{bmatrix} $$
In [6]:
X_test = np.array([[1, -1],
[-1, 0]])
W_test = np.array([[4, 0],
[2, -1]])
b_test = np.array([1, 2])
h_test = dense_forward(X_test, W_test, b_test)
print(h_test)
In [12]:
am.test_student_function(username, dense_forward, ['x_input', 'W', 'b'])
Implement the chain rule: $$ \frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial H} \frac{\partial H}{\partial X} $$
In [8]:
def dense_grad_input(x_input, grad_output, W, b):
"""Calculate the partial derivative of
the loss with respect to the input of the layer
# Arguments
x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
grad_output: partial derivative of the loss functions with
respect to the ouput of the dense layer
np.array of size `(n_objects, n_out)`
W: np.array of size `(n_in, n_out)`
b: np.array of size `(n_out,)`
# Output
the partial derivative of the loss with
respect to the input of the layer
np.array of size `(n_objects, n_in)`
"""
#################
### YOUR CODE ###
#################
return grad_input
In [9]:
am.test_student_function(username, dense_grad_input, ['x_input', 'grad_output', 'W', 'b'])
Compute the gradient of the weights: $$ \frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial H} \frac{\partial H}{\partial W} \\ \frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial H} \frac{\partial H}{\partial b} \\ $$
In [7]:
def dense_grad_W(x_input, grad_output, W, b):
"""Calculate the partial derivative of
the loss with respect to W parameter of the layer
# Arguments
x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
grad_output: partial derivative of the loss functions with
respect to the ouput of the dense layer
np.array of size `(n_objects, n_out)`
W: np.array of size `(n_in, n_out)`
b: np.array of size `(n_out,)`
# Output
the partial derivative of the loss
with respect to W parameter of the layer
np.array of size `(n_in, n_out)`
"""
#################
### YOUR CODE ###
#################
return grad_W
def dense_grad_b(x_input, grad_output, W, b):
"""Calculate the partial derivative of
the loss with respect to b parameter of the layer
# Arguments
x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
grad_output: partial derivative of the loss functions with
respect to the ouput of the dense layer
np.array of size `(n_objects, n_out)`
W: np.array of size `(n_in, n_out)`
b: np.array of size `(n_out,)`
# Output
the partial derivative of the loss
with respect to b parameter of the layer
np.array of size `(n_out,)`
"""
#################
### YOUR CODE ###
#################
return grad_b
In [8]:
am.test_student_function(username, dense_grad_W, ['x_input', 'grad_output', 'W', 'b'])
am.test_student_function(username, dense_grad_b, ['x_input', 'grad_output', 'W', 'b'])
In [9]:
class Layer(object):
def __init__(self):
self.training_phase = True
self.output = 0.0
def forward(self, x_input):
self.output = x_input
return self.output
def backward(self, x_input, grad_output):
return grad_output
def get_params(self):
return []
def get_params_gradients(self):
return []
In [10]:
class Dense(Layer):
def __init__(self, n_input, n_output):
super(Dense, self).__init__()
#Randomly initializing the weights from normal distribution
self.W = np.random.normal(size=(n_input, n_output))
self.grad_W = np.zeros_like(self.W)
#initializing the bias with zero
self.b = np.zeros(n_output)
self.grad_b = np.zeros_like(self.b)
def forward(self, x_input):
self.output = dense_forward(x_input, self.W, self.b)
return self.output
def backward(self, x_input, grad_output):
# get gradients of weights
self.grad_W = dense_grad_W(x_input, grad_output, self.W, self.b)
self.grad_b = dense_grad_b(x_input, grad_output, self.W, self.b)
# propagate the gradient backwards
return dense_grad_input(x_input, grad_output, self.W, self.b)
def get_params(self):
return [self.W, self.b]
def get_params_gradients(self):
return [self.grad_W, self.grad_b]
In [14]:
dense_layer = Dense(2, 1)
x_input = np.random.random((3, 2))
y_output = dense_layer.forward(x_input)
print(x_input)
print(y_output)
The dense layer is a linear layer. Combinging several linear (dense) layers is always equivalent to one dense layer see the proof below: $$ H_1 = XW_1 + b_1\\ H_2 = H_1W_2 + b_2\\ H_2 = (XW_1 + b_1)W_2 + b_2 = X(W_1W_2) + (b_1W_2 + b_2) = XW^* + b^* $$ Using multiple layers (ie a deep model) using only linear layers is equivalent to a single dense layer. A deep model using only linear layers is therefore ineffective.
In order to overcome this we need to add some non-linearities. Usually they are element-wise (ie per dimension). $$ H_1 = XW_1 + b_1\\ H_2 = f(H_1)\\ H_3 = H_2W_3 + b_3 = f(XW_1 + b_1)W_2 + b_2\neq XW^* + b^* $$
Nowadays, one of the most popular nonlinearity is ReLU: $$ \text{ReLU}(x) = \max(0, x) $$ It is so popular, given that it is very simple and has an easy gradient.
Example
$$ \text{ReLU} \Big( \begin{bmatrix} 1 & -0.5 \\ 0.3 & 0.1 \end{bmatrix} \Big) = \begin{bmatrix} 1 & 0 \\ 0.3 & 0.1 \end{bmatrix} $$It is a layer without trainable parameters. Just implement two functions to make it work
In [23]:
def relu_forward(x_input):
"""relu nonlinearity
# Arguments
x_input: np.array of size `(n_objects, n_in)`
# Output
the output of relu layer
np.array of size `(n_objects, n_in)`
"""
#################
### YOUR CODE ###
#################
return output
def relu_grad_input(x_input, grad_output):
"""relu nonlinearity gradient.
Calculate the partial derivative of the loss
with respect to the input of the layer
# Arguments
x_input: np.array of size `(n_objects, n_in)`
grad_output: np.array of size `(n_objects, n_in)`
# Output
the partial derivative of the loss
with respect to the input of the layer
np.array of size `(n_objects, n_in)`
"""
#################
### YOUR CODE ###
#################
return grad_input
In [ ]:
am.test_student_function(username, relu_forward, ['x_input'])
am.test_student_function(username, relu_grad_input, ['x_input', 'grad_output'])
In [24]:
class ReLU(Layer):
def forward(self, x_input):
self.output = relu_forward(x_input)
return self.output
def backward(self, x_input, grad_output):
return relu_grad_input(x_input, grad_output)
In [2]:
class SequentialNN(object):
def __init__(self):
self.layers = []
self.training_phase = True
def set_training_phase(is_training=True):
self.training_phase = is_training
for layer in self.layers:
layer.training_phase = is_training
def add(self, layer):
self.layers.append(layer)
def forward(self, x_input):
self.output = x_input
for layer in self.layers:
self.output = layer.forward(self.output)
return self.output
def backward(self, x_input, grad_output):
inputs = [x_input] + [l.output for l in self.layers[:-1]]
for input_, layer_ in zip(inputs[::-1], self.layers[::-1]):
grad_output = layer_.backward(input_, grad_output)
def get_params(self):
params = []
for layer in self.layers:
params.extend(layer.get_params())
return params
def get_params_gradients(self):
grads = []
for layer in self.layers:
grads.extend(layer.get_params_gradients())
return grads
Here is the simple neural netowrk. It takes an input of shape (Any, 10)
. Pass it through Dense(10, 4)
, ReLU
and Dense(4, 1)
. The output is a batch of size (Any, 1)
INPUT
|
##########
|
[ReLU]
|
####
|
#
In [54]:
nn = SequentialNN()
nn.add(Dense(10, 4))
nn.add(ReLU())
nn.add(Dense(4, 1))
Here we will define the loss functions. Each loss should be able to compute its value and compute its gradient with respect to the input.
In [4]:
# This is a basic class.
# All other losses will inherit it
class Loss(object):
def __init__(self):
self.output = 0.0
def forward(self, target_pred, target_true):
return self.output
def backward(self, target_pred, target_true):
return np.zeros_like(target_pred).reshape((-1, 1))
First of all, we will define Hinge loss function. $$ \mathcal{L}(T, \dot{T}) = \frac{1}{N}\sum\limits_{k=1}^{N}\max(0, 1 - \dot{t}_k \cdot t_k) $$
Let's implement the calculation of the loss.
In [30]:
def hinge_forward(target_pred, target_true):
"""Compute the value of Hinge loss
for a given prediction and the ground truth
# Arguments
target_pred: predictions - np.array of size `(n_objects,)`
target_true: ground truth - np.array of size `(n_objects,)`
# Output
the value of Hinge loss
for a given prediction and the ground truth
scalar
"""
#################
### YOUR CODE ###
#################
return output
In [ ]:
am.test_student_function(username, hinge_forward, ['target_pred', 'target_true'])
Now you should compute the gradient of the loss function with respect to its input. It is a vector with the same shape as the input. $$ \frac{\partial \mathcal{L}}{\partial \dot{T}} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \dot{t}_1} \\ \frac{\partial \mathcal{L}}{\partial \dot{t}_2} \\ \vdots \\ \frac{\partial \mathcal{L}}{\partial \dot{t}_N} \\ \end{bmatrix} $$
In [7]:
def hinge_grad_input(target_pred, target_true):
"""Compute the partial derivative
of Hinge loss with respect to its input
# Arguments
target_pred: predictions - np.array of size `(n_objects, 1)`
target_true: ground truth - np.array of size `(n_objects, 1)`
# Output
the partial derivative
of Hinge loss with respect to its input
np.array of size `(n_objects, 1)`
"""
#################
### YOUR CODE ###
#################
return grad_input
In [ ]:
am.test_student_function(username, hinge_grad_input, ['target_pred', 'target_true'])
In [5]:
class Hinge(Loss):
def forward(self, target_pred, target_true):
self.output = hinge_forward(target_pred, target_true)
return self.output
def backward(self, target_pred, target_true):
return hinge_grad_input(target_pred, target_true).reshape((-1, 1))
There are several ways of the regularization of a model. They are used to avoid learning models which behave well on the training subset and fail during testing. We will implement $L_2$ regularization also known as weight decay.
The key idea of $L_2$ regularization is to add an extra term to the loss functions: $$ \mathcal{L}^* = \mathcal{L} + \frac{\lambda}{2} \|\theta\|^2_2 $$
For some cases only the weights of a single layer are penalized, but we will penalize all the weights.
$$ \mathcal{L}^* = \mathcal{L} + \frac{\lambda}{2} \sum\limits_{m=1}^k \|\theta_m\|^2_2 $$Therefore, the updating scheme is also modified
$$ \theta_m \leftarrow \theta_m - \gamma \frac{\partial \mathcal{L}^*}{\partial \theta_m}\\ \frac{\partial \mathcal{L}^*}{\partial \theta_m} = \frac{\partial \mathcal{L}}{\partial \theta_m} + \lambda \theta_m\\ \theta_m \leftarrow \theta_m - \gamma \frac{\partial \mathcal{L}}{\partial \theta_m} - \lambda \theta_m $$As you can see, the updating scheme also gets an extra term. $\lambda$ is the coefficient of the weight decay.
The update of the weights would be implemented later in Optimizer
class. Here you should implement the computation of $L_2$ norm of the weights from the given list.
$$
f(\lambda, [\theta_1, \theta_2, \dots, \theta_k]) = \frac{\lambda}{2} \sum\limits_{m=1}^k \|\theta_m\|^2_2
$$
In [23]:
def l2_regularizer(weight_decay, weights):
"""Compute the L2 regularization term
# Arguments
weight_decay: float
weights: list of arrays of different shapes
# Output
sum of the L2 norms of the input weights
scalar
"""
#################
### YOUR CODE ###
#################
return 0.0
In [ ]:
am.test_student_function(username, l2_regularizer, ['weight_decay', 'weights'])
In [24]:
class Optimizer(object):
'''This is a basic class.
All other optimizers will inherit it
'''
def __init__(self, model, lr=0.01, weight_decay=0.0):
self.model = model
self.lr = lr
self.weight_decay = weight_decay
def update_params(self):
pass
class SGD(Optimizer):
'''Stochastic gradient descent optimizer
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
'''
def update_params(self):
weights = self.model.get_params()
grads = self.model.get_params_gradients()
for w, dw in zip(weights, grads):
update = self.lr * dw + self.weight_decay * w
# it writes the result to the previous variable instead of copying
np.subtract(w, update, out=w)
This is an optional section. If you liked the process of understanding NNs by implementing them from scratch, here are several more tasks for you.
Dropout is a method of regularization. It is also could be interpreted as the augmentation method. The key idea is to randomly drop some values of the input tensor. It avoids overfitting of the model. Its behaviour is different during the training and testing.
First of all, you should implement the method of calculating the binary mask. The binary mask has the same shape as the input. The mask could have the value 0 for the certain element with the probability drop_rate
and 1 - with 1.0 - drop_rate
. None, in $p$ from article is 1.0 - drop_rate
In [25]:
def dropout_generate_mask(shape, drop_rate):
"""Generate mask
# Arguments
shape: shape of the input array
tuple
drop_rate: probability of the element
to be multiplied by 0
scalar
# Output
binary mask
"""
#################
### YOUR CODE ###
#################
return mask
Now implement the above-described operation of mapping.
In [26]:
def dropout_forward(x_input, mask, drop_rate, training_phase):
"""Perform the mapping of the input
# Arguments
x_input: input of the layer
np.array of size `(n_objects, n_in)`
mask: binary mask
np.array of size `(n_objects, n_in)`
drop_rate: probability of the element to be multiplied by 0
scalar
training_phase: bool eiser `True` - training, or `False` - testing
# Output
the output of the dropout layer
np.array of size `(n_objects, n_in)`
"""
#################
### YOUR CODE ###
#################
return output
In [ ]:
am.test_student_function(username, dropout_forward, ['x_input', 'mask', 'drop_rate', 'training_phase'])
And, as usual, implement the calculation of the partial derivative of the loss function with respect to the input of a layer
In [27]:
def dropout_grad_input(x_input, grad_output, mask):
"""Calculate the partial derivative of
the loss with respect to the input of the layer
# Arguments
x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
grad_output: partial derivative of the loss functions with
respect to the ouput of the dropout layer
np.array of size `(n_objects, n_in)`
mask: binary mask
np.array of size `(n_objects, n_in)`
# Output
the partial derivative of the loss with
respect to the input of the layer
np.array of size `(n_objects, n_in)`
"""
#################
### YOUR CODE ###
#################
return grad_input
In [ ]:
am.test_student_function(username, dropout_grad_input, ['x_input', 'grad_output', 'mask'])
In [29]:
class Dropout(Layer):
def __init__(self, drop_rate):
super(Dropout, self).__init__()
self.drop_rate = drop_rate
self.mask = 1.0
def forward(self, x_input):
if self.training_phase:
self.mask = dropout_generate_mask(x_input.shape, self.drop_rate)
self.output = dropout_forward(x_input, self.mask,
self.drop_rate, self.training_phase)
return self.output
def backward(self, x_input, grad_output):
grad_input = dropout_grad_input(x_input, grad_output, self.mask)
return grad_input
MSE (Mean Squared Error) is a popular loss for the regression tasks.
$$ \mathcal{L}(T, \dot{T}) = \frac{1}{2N}\sum\limits_{k=1}^N(t_k - \dot{t}_k)^2 $$Let's implement the calculation of the loss.
In [8]:
def mse_forward(target_pred, target_true):
"""Compute the value of MSE loss
for a given prediction and the ground truth
# Arguments
target_pred: predictions - np.array of size `(n_objects, 1)`
target_true: ground truth - np.array of size `(n_objects, 1)`
# Output
the value of MSE loss
for a given prediction and the ground truth
scalar
"""
#################
### YOUR CODE ###
#################
return output
In [ ]:
am.test_student_function(username, mse_forward, ['target_pred', 'target_true'])
Now you should compute the gradient of the loss function with respect to its input.
In [6]:
def mse_grad_input(target_pred, target_true):
"""Compute the partial derivative
of MSE loss with respect to its input
# Arguments
target_pred: predictions - np.array of size `(n_objects, 1)`
target_true: ground truth - np.array of size `(n_objects, 1)`
# Output
the partial derivative
of MSE loss with respect to its input
np.array of size `(n_objects, 1)`
"""
#################
### YOUR CODE ###
#################
return grad_input
In [ ]:
am.test_student_function(username, mse_grad_input, ['target_pred', 'target_true'])
In [34]:
class MSE(Loss):
def forward(self, target_pred, target_true):
self.output = mse_forward(target_pred, target_true)
return self.output
def backward(self, target_pred, target_true):
return mse_grad_input(target_pred, target_true).reshape((-1, 1))
Let's check the progress one more time
In [ ]:
am.get_progress(username)