Assignment 1

Blocks

The main idea of this assignment is to allow you to undestand how neural networks (NNs) work. We will cover the main aspects such as the Backpropagation and Optimization Methods. All mathematical operations should be implemented in NumPy only.

The assignmnent consist of 2 notebooks:

Blocks - the place where the main building blocks of the NNs should be implemented
Experiments - a playground. There we will train the models

Note

Some of the concepts below have not (yet) been discussed durin the lecture. These will be discussed further during the next Wednesday lecture

0. Preliminaries
1. Backpropagation
2. Dense layer
3. ReLU nonlinearity
4. Sequential model
5. Loss
6. $L_2$ Regularization & Weight Decay
7. Optimizer
8. Advanced blocks
- 8.1 Dropout
- 8.2 MSE Loss

0. Preliminaries

In this assignment we will use classes and their instances (objects). It will allow us to write less code and make it more readable. However, you don't have to take care about the exact implementation of the classes. We did it for you. But if you are interested in it, here are some useful links:

The official documentation
Video by sentdex: Object Oriented Programming Introduction
Antipatterns in OOP: Stop Writing Classes

The interface of the current blocks is mostly inspired by Torch / PyTorch. You can also take a look at the first implementation of Keras

We use automark to check if the answers are correct. Register first



In [3]:

    
import automark as am

username = 'sosnovik'
am.register_id(username, ('ivan sosnovik', 'i.sosnovik@uva.nl'))
am.get_progress(username)









    



Username already registered.
http://ec2-18-194-63-174.eu-central-1.compute.amazonaws.com:8080/submissions/sosnovik

1. Backpropagation

Each layer is a function of several parameters (weights): $h = f(x, \theta)$
The layers could be chained. Therefore, the neural network $F$ is a composition of functions: $$ F = f_k \circ f_{k-1} \circ \dots \ f_1\\ h_1 = f_1(x, \theta_1)\\ h_2 = f_2(h_1, \theta_2)\\ \dots \\ \dot{t} = f_k(h_{k-1}, \theta_k) $$
The neural network is trained by minimizing the loss function $\mathcal{L}$. During class we have discussed the squared-loss for linear regression, where we used $\mathcal{L}_{\textrm{reg}} = \tfrac{1}{2}\sum_n (t_n - \dot{t}_n)^2$, where $t_n$ is the target-value of training example $n$ and $\dot{t}_n$ the predicted value by the network/regressor.
Currently, the most effective way of training is to use the variation of the Gradient descent called Stochastic gradient descent (and its improvements).
The parameters of the $m$-th layer are updated according to the following scheme: $$ \theta_m \leftarrow \theta_m - \gamma \frac{\partial \mathcal{L}}{\partial \theta_m} $$
Hyperparameter $\gamma$ is called learning rate
As the layers are chained, the computation of $\partial \mathcal{L}/\partial \theta_m$ in advance is a complicated task. However, it is easily computed, when the forward pass is finished using the chain rule.
The above-stated gradient is calculated using the chain rule: $$ \frac{\partial \mathcal{L}}{\partial \theta_m} = \frac{\partial \mathcal{L}}{\partial h_m} \frac{\partial h_m}{\partial \theta_m} = \frac{\partial \mathcal{L}}{\partial h_{m+1}} \frac{\partial h_{m+1}}{\partial h_m} \frac{\partial h_m}{\partial \theta_m} = \dots $$
Therefore, for each layer we have to be able to calculate several expressions:
- $h_m = f_m(h_{m-1}, \theta_m)$ - the forward inference
- $\partial h_{m} / \partial h_{m-1}$ - the partial derivative of the output with respect to the input
- $\partial h_{m} / \partial \theta_m$ - the partial derivative of the output with respect to the parameters
The algorithm of training of a NN using the chain rule is called Backpropagation

2. Dense layer

Dense Layer (Fully-Connected, Multiplicative) is the basic layer of a neural network. It transforms input matrix of size (n_objects, n_in) to the matrix of size (n_objects, n_out) by performing the following operation: $$ H = XW + b $$ or in other words: $$ H_{ij} = \sum\limits_{k=1}^{n_\text{in}} X_{ik}W_{kj} + b_j $$

Example:

You have a model of just 1 layer. The input is a point in a 3D space. And you want to predict its label: $-1$ or $1$. You have $75$ objects in you training subset (or batch).

Therefore, $X$ has shape $75 \times 3$. $Y$ has shape $75 \times 1$. Weight $W$ of the layer has shape $3 \times 1$. And $b$ is a scalar.



In [3]:

    
from __future__ import print_function, absolute_import, division
import numpy as np

Implement the forward path: $$ H = XW + b $$



In [5]:

    
def dense_forward(x_input, W, b):
    """Perform the mapping of the input
    # Arguments
        x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
        W: np.array of size `(n_in, n_out)`
        b: np.array of size `(n_out,)`
    # Output
        the output of a dense layer 
        np.array of size `(n_objects, n_out)`
    """
    #################
    ### YOUR CODE ###
    #################
    return output

Let's check your first function. We set the matrices $X, W, b$: $$ X = \begin{bmatrix} 1 & -1 \\ -1 & 0 \\ \end{bmatrix} \quad W = \begin{bmatrix} 4 & 0\\ 2 & -1\\ \end{bmatrix} \quad b = \begin{bmatrix} 1 & 2 \\ \end{bmatrix} $$

And then compute $$ XW = \begin{bmatrix} 1 & -1 \\ -1 & 0 \\ \end{bmatrix} \begin{bmatrix} 4 & 0 \\ 2 & -1\\ \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ -4 & 0 \\ \end{bmatrix} \\ XW + b = \begin{bmatrix} 3 & 3 \\ -3 & 2 \\ \end{bmatrix} $$



In [6]:

    
X_test = np.array([[1, -1],
                   [-1, 0]])

W_test = np.array([[4, 0],
                   [2, -1]])

b_test = np.array([1, 2])
h_test =  dense_forward(X_test, W_test, b_test)
print(h_test)



In [12]:

    
am.test_student_function(username, dense_forward, ['x_input', 'W', 'b'])









    



Test was successful. Congratulations!

Implement the chain rule: $$ \frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial H} \frac{\partial H}{\partial X} $$



In [8]:

    
def dense_grad_input(x_input, grad_output, W, b):
    """Calculate the partial derivative of 
        the loss with respect to the input of the layer
    # Arguments
        x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
        grad_output: partial derivative of the loss functions with 
            respect to the ouput of the dense layer 
            np.array of size `(n_objects, n_out)`
        W: np.array of size `(n_in, n_out)`
        b: np.array of size `(n_out,)`
    # Output
        the partial derivative of the loss with 
        respect to the input of the layer
        np.array of size `(n_objects, n_in)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_input



In [9]:

    
am.test_student_function(username, dense_grad_input, ['x_input', 'grad_output', 'W', 'b'])

Compute the gradient of the weights: $$ \frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial H} \frac{\partial H}{\partial W} \\ \frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial H} \frac{\partial H}{\partial b} \\ $$



In [7]:

    
def dense_grad_W(x_input, grad_output, W, b):
    """Calculate the partial derivative of 
        the loss with respect to W parameter of the layer
    # Arguments
        x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
        grad_output: partial derivative of the loss functions with 
            respect to the ouput of the dense layer 
            np.array of size `(n_objects, n_out)`
        W: np.array of size `(n_in, n_out)`
        b: np.array of size `(n_out,)`
    # Output
        the partial derivative of the loss 
        with respect to W parameter of the layer
        np.array of size `(n_in, n_out)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_W

def dense_grad_b(x_input, grad_output, W, b):
    """Calculate the partial derivative of 
        the loss with respect to b parameter of the layer
    # Arguments
        x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
        grad_output: partial derivative of the loss functions with 
            respect to the ouput of the dense layer 
            np.array of size `(n_objects, n_out)`
        W: np.array of size `(n_in, n_out)`
        b: np.array of size `(n_out,)`
    # Output
        the partial derivative of the loss 
        with respect to b parameter of the layer
        np.array of size `(n_out,)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_b



In [8]:

    
am.test_student_function(username, dense_grad_W, ['x_input', 'grad_output', 'W', 'b'])
am.test_student_function(username, dense_grad_b, ['x_input', 'grad_output', 'W', 'b'])

Dense Layer Class

First of all we define the basic class Layer. And then inherit it.

We implement it for you. But Dense class is based on the above-written functions.



In [9]:

    
class Layer(object):
    
    def __init__(self):
        self.training_phase = True
        self.output = 0.0
        
    def forward(self, x_input):
        self.output = x_input
        return self.output
    
    def backward(self, x_input, grad_output):
        return grad_output
    
    def get_params(self):
        return []
    
    def get_params_gradients(self):
        return []



In [10]:

    
class Dense(Layer):
    
    def __init__(self, n_input, n_output):
        super(Dense, self).__init__()
        #Randomly initializing the weights from normal distribution
        self.W = np.random.normal(size=(n_input, n_output))
        self.grad_W = np.zeros_like(self.W)
        #initializing the bias with zero
        self.b = np.zeros(n_output)
        self.grad_b = np.zeros_like(self.b)
      
    def forward(self, x_input):
        self.output = dense_forward(x_input, self.W, self.b)
        return self.output
    
    def backward(self, x_input, grad_output):
        # get gradients of weights
        self.grad_W = dense_grad_W(x_input, grad_output, self.W, self.b)
        self.grad_b = dense_grad_b(x_input, grad_output, self.W, self.b)
        # propagate the gradient backwards
        return dense_grad_input(x_input, grad_output, self.W, self.b)
    
    def get_params(self):
        return [self.W, self.b]

    def get_params_gradients(self):
        return [self.grad_W, self.grad_b]



In [14]:

    
dense_layer = Dense(2, 1)
x_input = np.random.random((3, 2))
y_output = dense_layer.forward(x_input)
print(x_input)
print(y_output)

3. ReLU nonlinearity

The dense layer is a linear layer. Combinging several linear (dense) layers is always equivalent to one dense layer see the proof below: $$ H_1 = XW_1 + b_1\\ H_2 = H_1W_2 + b_2\\ H_2 = (XW_1 + b_1)W_2 + b_2 = X(W_1W_2) + (b_1W_2 + b_2) = XW^* + b^* $$ Using multiple layers (ie a deep model) using only linear layers is equivalent to a single dense layer. A deep model using only linear layers is therefore ineffective.

In order to overcome this we need to add some non-linearities. Usually they are element-wise (ie per dimension). $$ H_1 = XW_1 + b_1\\ H_2 = f(H_1)\\ H_3 = H_2W_3 + b_3 = f(XW_1 + b_1)W_2 + b_2\neq XW^* + b^* $$

Nowadays, one of the most popular nonlinearity is ReLU: $$ \text{ReLU}(x) = \max(0, x) $$ It is so popular, given that it is very simple and has an easy gradient.

Example

$$ \text{ReLU} \Big( \begin{bmatrix} 1 & -0.5 \\ 0.3 & 0.1 \end{bmatrix} \Big) = \begin{bmatrix} 1 & 0 \\ 0.3 & 0.1 \end{bmatrix} $$

It is a layer without trainable parameters. Just implement two functions to make it work



In [23]:

    
def relu_forward(x_input):
    """relu nonlinearity
    # Arguments
        x_input: np.array of size `(n_objects, n_in)`
    # Output
        the output of relu layer
        np.array of size `(n_objects, n_in)`
    """
    #################
    ### YOUR CODE ###
    #################
    return output

def relu_grad_input(x_input, grad_output):
    """relu nonlinearity gradient. 
        Calculate the partial derivative of the loss 
        with respect to the input of the layer
    # Arguments
        x_input: np.array of size `(n_objects, n_in)`
            grad_output: np.array of size `(n_objects, n_in)`
    # Output
        the partial derivative of the loss 
        with respect to the input of the layer
        np.array of size `(n_objects, n_in)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_input



In [ ]:

    
am.test_student_function(username, relu_forward, ['x_input'])
am.test_student_function(username, relu_grad_input, ['x_input', 'grad_output'])



In [24]:

    
class ReLU(Layer):
        
    def forward(self, x_input):
        self.output = relu_forward(x_input)
        return self.output
    
    def backward(self, x_input, grad_output):
        return relu_grad_input(x_input, grad_output)

4. Sequential model

In order to make the work with layers more comfortable, we create SequentialNN - a class, which stores all its layers and performs the basic manipulations.



In [2]:

    
class SequentialNN(object):

    def __init__(self):
        self.layers = []
        self.training_phase = True
        
    def set_training_phase(is_training=True):
        self.training_phase = is_training
        for layer in self.layers:
            layer.training_phase = is_training
        
    def add(self, layer):
        self.layers.append(layer)
        
    def forward(self, x_input):
        self.output = x_input
        for layer in self.layers:
            self.output = layer.forward(self.output)
            
        return self.output
    
    def backward(self, x_input, grad_output):
        inputs = [x_input] + [l.output for l in self.layers[:-1]]
        for input_, layer_ in zip(inputs[::-1], self.layers[::-1]):
            grad_output = layer_.backward(input_, grad_output)
            
    def get_params(self):
        params = []
        for layer in self.layers:
            params.extend(layer.get_params())
        return params
    
    def get_params_gradients(self):
        grads = []
        for layer in self.layers:
            grads.extend(layer.get_params_gradients())
        return grads

Here is the simple neural netowrk. It takes an input of shape (Any, 10). Pass it through Dense(10, 4), ReLU and Dense(4, 1). The output is a batch of size (Any, 1)

  INPUT
    |
##########
    |
  [ReLU]
    |
   ####
    |
    #



In [54]:

    
nn = SequentialNN()
nn.add(Dense(10, 4))
nn.add(ReLU())
nn.add(Dense(4, 1))

5. Loss

Here we will define the loss functions. Each loss should be able to compute its value and compute its gradient with respect to the input.



In [4]:

    
# This is a basic class. 
# All other losses will inherit it
class Loss(object):
    
    def __init__(self):
        self.output = 0.0
        
    def forward(self, target_pred, target_true):
        return self.output
    
    def backward(self, target_pred, target_true):
        return np.zeros_like(target_pred).reshape((-1, 1))

First of all, we will define Hinge loss function. $$ \mathcal{L}(T, \dot{T}) = \frac{1}{N}\sum\limits_{k=1}^{N}\max(0, 1 - \dot{t}_k \cdot t_k) $$

$N$ - number of objects
$\dot{T}$ and $T$ are the vectors of length $N$.
$\dot{t}_k$ is the predicted class of the $k$-th object. $\dot{t}_k \in {\rm I\!R}$
$t_k$ is the real class of this object. $t_k \in \{-1, 1\}$
This loss function is used to train SVM estimators.

Let's implement the calculation of the loss.



In [30]:

    
def hinge_forward(target_pred, target_true):
    """Compute the value of Hinge loss 
        for a given prediction and the ground truth
    # Arguments
        target_pred: predictions - np.array of size `(n_objects,)`
        target_true: ground truth - np.array of size `(n_objects,)`
    # Output
        the value of Hinge loss 
        for a given prediction and the ground truth
        scalar
    """
    #################
    ### YOUR CODE ###
    #################
    return output



In [ ]:

    
am.test_student_function(username, hinge_forward, ['target_pred', 'target_true'])

Now you should compute the gradient of the loss function with respect to its input. It is a vector with the same shape as the input. $$ \frac{\partial \mathcal{L}}{\partial \dot{T}} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \dot{t}_1} \\ \frac{\partial \mathcal{L}}{\partial \dot{t}_2} \\ \vdots \\ \frac{\partial \mathcal{L}}{\partial \dot{t}_N} \\ \end{bmatrix} $$



In [7]:

    
def hinge_grad_input(target_pred, target_true):
    """Compute the partial derivative 
        of Hinge loss with respect to its input
    # Arguments
        target_pred: predictions - np.array of size `(n_objects, 1)`
        target_true: ground truth - np.array of size `(n_objects, 1)`
    # Output
        the partial derivative 
        of Hinge loss with respect to its input
        np.array of size `(n_objects, 1)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_input



In [ ]:

    
am.test_student_function(username, hinge_grad_input, ['target_pred', 'target_true'])



In [5]:

    
class Hinge(Loss):
    
    def forward(self, target_pred, target_true):
        self.output = hinge_forward(target_pred, target_true)
        return self.output
    
    def backward(self, target_pred, target_true):
        return hinge_grad_input(target_pred, target_true).reshape((-1, 1))

6. $L_2$ Regularization & Weight Decay

There are several ways of the regularization of a model. They are used to avoid learning models which behave well on the training subset and fail during testing. We will implement $L_2$ regularization also known as weight decay.

The key idea of $L_2$ regularization is to add an extra term to the loss functions: $$ \mathcal{L}^* = \mathcal{L} + \frac{\lambda}{2} \|\theta\|^2_2 $$

For some cases only the weights of a single layer are penalized, but we will penalize all the weights.

$$ \mathcal{L}^* = \mathcal{L} + \frac{\lambda}{2} \sum\limits_{m=1}^k \|\theta_m\|^2_2 $$

Therefore, the updating scheme is also modified

$$ \theta_m \leftarrow \theta_m - \gamma \frac{\partial \mathcal{L}^*}{\partial \theta_m}\\ \frac{\partial \mathcal{L}^*}{\partial \theta_m} = \frac{\partial \mathcal{L}}{\partial \theta_m} + \lambda \theta_m\\ \theta_m \leftarrow \theta_m - \gamma \frac{\partial \mathcal{L}}{\partial \theta_m} - \lambda \theta_m $$

As you can see, the updating scheme also gets an extra term. $\lambda$ is the coefficient of the weight decay.

The update of the weights would be implemented later in Optimizer class. Here you should implement the computation of $L_2$ norm of the weights from the given list. $$ f(\lambda, [\theta_1, \theta_2, \dots, \theta_k]) = \frac{\lambda}{2} \sum\limits_{m=1}^k \|\theta_m\|^2_2 $$



In [23]:

    
def l2_regularizer(weight_decay, weights):
    """Compute the L2 regularization term
    # Arguments
        weight_decay: float
        weights: list of arrays of different shapes
    # Output
        sum of the L2 norms of the input weights
        scalar
    """
    #################
    ### YOUR CODE ###
    #################
    return 0.0



In [ ]:

    
am.test_student_function(username, l2_regularizer, ['weight_decay', 'weights'])

7. Optimizer

We implement the optimizer to perform the updates of the weights according to the certain scheme.



In [24]:

    
class Optimizer(object):
    '''This is a basic class. 
    All other optimizers will inherit it
    '''
    def __init__(self, model, lr=0.01, weight_decay=0.0):
        self.model = model
        self.lr = lr
        self.weight_decay = weight_decay
        
    def update_params(self):
        pass


class SGD(Optimizer):
    '''Stochastic gradient descent optimizer
    https://en.wikipedia.org/wiki/Stochastic_gradient_descent
    '''
        
    def update_params(self):
        weights = self.model.get_params()
        grads = self.model.get_params_gradients()
        for w, dw in zip(weights, grads):
            update = self.lr * dw + self.weight_decay * w
            # it writes the result to the previous variable instead of copying
            np.subtract(w, update, out=w)

8. Advanced blocks

This is an optional section. If you liked the process of understanding NNs by implementing them from scratch, here are several more tasks for you.

8.1 Dropout

Dropout is a method of regularization. It is also could be interpreted as the augmentation method. The key idea is to randomly drop some values of the input tensor. It avoids overfitting of the model. Its behaviour is different during the training and testing.

First of all, you should implement the method of calculating the binary mask. The binary mask has the same shape as the input. The mask could have the value 0 for the certain element with the probability drop_rate and 1 - with 1.0 - drop_rate. None, in $p$ from article is 1.0 - drop_rate



In [25]:

    
def dropout_generate_mask(shape, drop_rate):
    """Generate mask 
    # Arguments
        shape: shape of the input array 
            tuple 
        drop_rate: probability of the element 
            to be multiplied by 0
            scalar
    # Output
        binary mask 
    """
    #################
    ### YOUR CODE ###
    #################
    return mask

Now implement the above-described operation of mapping.



In [26]:

    
def dropout_forward(x_input, mask, drop_rate, training_phase):
    """Perform the mapping of the input
    # Arguments
        x_input: input of the layer 
            np.array of size `(n_objects, n_in)`
        mask: binary mask
            np.array of size `(n_objects, n_in)`
        drop_rate: probability of the element to be multiplied by 0
            scalar
        training_phase: bool eiser `True` - training, or `False` - testing
    # Output
        the output of the dropout layer 
        np.array of size `(n_objects, n_in)`
    """
    #################
    ### YOUR CODE ###
    #################
    return output



In [ ]:

    
am.test_student_function(username, dropout_forward, ['x_input', 'mask', 'drop_rate', 'training_phase'])

And, as usual, implement the calculation of the partial derivative of the loss function with respect to the input of a layer



In [27]:

    
def dropout_grad_input(x_input, grad_output, mask):
    """Calculate the partial derivative of 
        the loss with respect to the input of the layer
    # Arguments
        x_input: input of a dense layer - np.array of size `(n_objects, n_in)`
        grad_output: partial derivative of the loss functions with 
            respect to the ouput of the dropout layer 
            np.array of size `(n_objects, n_in)`
        mask: binary mask
            np.array of size `(n_objects, n_in)`
    # Output
        the partial derivative of the loss with 
        respect to the input of the layer
        np.array of size `(n_objects, n_in)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_input



In [ ]:

    
am.test_student_function(username, dropout_grad_input, ['x_input', 'grad_output', 'mask'])



In [29]:

    
class Dropout(Layer):
    
    def __init__(self, drop_rate):
        super(Dropout, self).__init__()
        self.drop_rate = drop_rate
        self.mask = 1.0
        
    def forward(self, x_input):
        if self.training_phase:
            self.mask = dropout_generate_mask(x_input.shape, self.drop_rate)
        self.output = dropout_forward(x_input, self.mask, 
                                      self.drop_rate, self.training_phase)
        return self.output
    
    def backward(self, x_input, grad_output):
        grad_input = dropout_grad_input(x_input, grad_output, self.mask)
        return grad_input

8.2 MSE Loss

MSE (Mean Squared Error) is a popular loss for the regression tasks.

$$ \mathcal{L}(T, \dot{T}) = \frac{1}{2N}\sum\limits_{k=1}^N(t_k - \dot{t}_k)^2 $$

$N$ - number of objects
$\dot{T}$ and $T$ are the vectors of length $N$.
$\dot{t}_k$ is the predicted target value of the $k$-th object. $\dot{t}_k \in {\rm I\!R}$
$t_k$ is the real target value of the $k$-th object. $t_k \in {\rm I\!R}$
This loss function is used to train regressors.