Assignment 2

The main idea of this assignment is to understand the convolutional neural networks and the basics of image filtering. Matrix convolution and convolutional layer will be implemented from scratch.

  1. Please copy the code from the previous assignment into a separate file Blocks.py. Put there the implementation of the building blocks.

  2. All functions should be implemented in NumPy.

1. Recap

During the previous assignment, you implemented the main building blocks of the neural networks: Dense Layer, nonlinearities, losses, and optimizers.

  • Dense layer is useful enough
  • Dense layer performs the following mapping of the input matrix $X$ (matrix of objects): $$ X \rightarrow XW + b $$
  • It allows one to build and train flexible models
  • Let's look precisely at image processing with Dense Layer
    • We have a grayscale image $x$ of size $N \times M$
    • We reshape it into a vector of length $NM$
    • Then we map it with a dense layer
    • And obtain the transformed vector $y$
    • Each element of $y$ depends on each element of $x$. That's why it is also called Fully-Connected
  • When we work with images, we assume that each pixel is correlated with its neighbours and close pixels. Distant pixels are not corellated. Various experiments demonstrate that this assumption is correct.
  • Dense layer captures these corellations, but it also captures noisy corellations.
  • There is a way to create Locally-Connected layer which will learn only local corellations with less number of parameters.
  • This layer is called Convolutional Layer and it is based on matrix convolution

2. Matrix Convolution

It is easier to understand the convolution when you see the image. Here is the image.

  • We "put" the kernel on the matrix. Each element from the kernel is multiplied by the correcponding element of the source matrix. The results are summed and are written to the new matrix.

  • The output matrix has smaller size than the source one. It is so because of the border effects.

  • In order to obtain the matrix of the same size, zero padding could be used.

  • We have a matrix $X$ of size $N \times M$ and a kernel K of size $(2p+1) \times (2q +1 )$.

  • We also define $X_{ij} = 0$ for $i > N, i < 1$ and $j > M, j < 1$. It is called zero padding

  • Therefore the convolution of matrix with the kernel is defined as follows:

$$ Y = X \star K \\ Y_{ij} = \sum\limits_{\alpha=0}^{2p} \sum\limits_{\beta=0}^{2q} K_{\alpha \beta} X_{i + \alpha - p, j+\beta - q} $$
  • In machine learning this operation is called convolution and in mathematics it is cross-corellation.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from Blocks import *
%matplotlib inline

In [2]:
import automark as am
username = 'sosnovik'
# if yor are not registered
am.register_id(username, ('ivan sosnovik', 'i.sosnovik@uva.nl'))


Username already registered.

Now you should implement matrix convolution


In [3]:
def conv_matrix(matrix, kernel):
    """Perform the convolution of the matrix 
        with the kernel using zero padding
    # Arguments
        matrix: input matrix np.array of size `(N, M)`
        kernel: kernel of the convolution 
            np.array of size `(2p + 1, 2q + 1)`
    # Output
        the result of the convolution
        np.array of size `(N, M)`
    """
    return output

Let's test the function

$$ X = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \\ \end{bmatrix} \quad K = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 2 \\ \end{bmatrix} \quad X \star K = \begin{bmatrix} 7 & 10 & 3 \\ 10 & 14 & 6 \\ 3 & 6 & 8 \\ \end{bmatrix} $$

In [112]:
X = np.array([
    [1, 2, 3],
    [2, 3, 4],
    [3, 4, 5]
])

K = np.eye(3)
K[-1, -1] = 2

In [113]:
print(conv_matrix(X, K))

In [117]:
am.test_student_function(username, conv_matrix, ['matrix', 'kernel'])

3. Basic Kernels

Matrix convolution could be used to process the image: to blur it, to shift the image, to get the edges etc. Here is the very interesting article. It is interactive. So you are able to get the better understanding of convolutions. Here are the examples of some kernels.

  • Sharpen Kernel $$ \begin{equation*} \begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end {bmatrix} \end{equation*} $$
  • Edge detection Filter $$ \begin{equation*} \begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end {bmatrix} \end{equation*} $$
  • Box blur of size 3 $$ \frac{1}{9} \begin{equation*} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end {bmatrix} \end{equation*} $$

Let's play with convolutions


In [114]:
rgb_img = plt.imread('./images/dog.png')
plt.imshow(rgb_img)

We will convert it to grayscale. RGB image is a 3 dimensional tensor. But grayscale image could be represented as a matrix


In [115]:
img = rgb_img.mean(axis=2)
plt.imshow(img, cmap='gray')

First of all, let's blur the image with box blur. It is just a convolution of a matrix with the kernel of size $N \times N$ of the following form:

$$ \frac{1}{N^2} \begin{bmatrix} 1 & \dots & 1\\ \vdots & \ddots & \vdots\\ 1 & \dots & 1\\ \end{bmatrix} $$

In [9]:
def box_blur(image, box_size):
    """Perform the blur of the image
    # Arguments
        image: input matrix - np.array of size `(N, M)`
        box_size: the size of the blur kernel - int > 0  
            the kernel is of size `(box_size, box_size)`
    # Output
        the result of the blur
            np.array of size `(N, M)`
    """   
    return output

In [116]:
am.test_student_function(username, box_blur, ['image', 'box_size'])

In [118]:
blur_dog = box_blur(img, box_size=5)
plt.imshow(blur_dog, cmap='gray')

Now we will get the vertical and horizontal gradients. To perform it we just calculate the convolution of the image with the following kernels:

$$ K_h = \begin{bmatrix} -1 & 0 & 1\\ \end{bmatrix} \quad K_v = \begin{bmatrix} 1 \\ 0 \\ -1\\ \end{bmatrix} \\ X_h = X \star K_h \quad X_v = X \star K_v\\ $$

And then we just calculate the amplitude of the gradient:

$$ X_\text{grad} = \sqrt{X_h^2 + X_v^2} $$

In [119]:
dog_h = conv_matrix(blur_dog, np.array([[-1, 0, 1]]))
dog_v = conv_matrix(blur_dog, np.array([[-1, 0, 1]]).T)
dog_grad = np.sqrt(dog_h ** 2 + dog_v ** 2)
plt.imshow(dog_grad, cmap='gray')

Now we have the edges we can work with. This is not the only way to get the edges. There are plenty of them:

If you convolve an image with the kernel, you will obtaine the map of responses. The more correlated the patch of the image with the kernel, the higher the reposponse. Let's demonstarte it.


In [120]:
pattern = np.array([
    [0, 1, 0],
    [1, 1, 1],
    [0, 1, 0]
])
# create the image
image = np.pad(pattern, [(12, 12), (10, 14)], mode='constant', constant_values=0)
plt.imshow(image, cmap='gray')
plt.title('original image')
plt.show()

# add some noise
image = 0.5 * image + 0.5 * np.random.random(image.shape)
plt.imshow(image, cmap='gray')
plt.title('noisy image')
plt.show()

# let's find the cross 
response = conv_matrix(image, pattern)
plt.imshow(response, cmap='gray')
plt.title('local response')
plt.show()

The brightest pixel is the place where the cross is. We can find the place where the image is locally close to the kernel. It allows one to find different patterns in the images such as eyes, legs, dogs, cats, etc.

We defined the kernels and then just used them. But we can also learn them such by minimizing sime loss and making the processing as effective as it is possible. To do it, we have to define Convolutional layer

4. Convolutional Layer

Convolutional Layer works with images. Each image is a 3-dimensional object $N_{\text{channels}} \times H \times W$. And therefore, the collection of images is 4-dimensioonal tensor of shape $N_{\text{objects}} \times N_{\text{channels}} \times H \times W$.

For example, 32 RGB images of size $224 \times 224$ are represented as a tensor of shape $32 \times 3 \times 224 \times 224$

Convolutional Layer takes the image as an input. Here the explanation of how it works:

  • The layer has n_in * n_out kernels. It is also a tensor of size (n_in, n_out, kernel_h, kernel_w)
  • It takes a 4-dimensional tensor of size n_objects, n_in, H, W as an input.
    • It is a collection of n_objects images.
    • Each of them has n_in channels
    • The resolution of the images is (H, W)
  • For each of the images the following operation is performed:
    • In order to get the 1st output channel, all the input are convolned with the corresponding kernels
    • Then the results are summed. And this is the output channel
    • Here is the code:
      for i in range(n_out):
        out_channel = 0.0
        for j in range(n_in):
            kernel_2d = K[i, j] # get the kernel from the collection of kernels
            input_channel = input_image[j] # get just one channel of the input iamge
            out_channel += conv_matrix(input_channel, kernel_2d) # convolve
        output_image.append(out_channel) # append the calculated channel to the output
      

Here we implement convolutional layer for you. The implementation of backward is based on the idea that convolution could be represented as matrix multiplication.


In [14]:
class ConvLayer(Layer):
    """
    Convolutional Layer. The implementation is based on 
        the representation of the convolution as matrix multiplication
    """
    
    def __init__(self, n_in, n_out, filter_size):
        super(ConvLayer, self).__init__()
        self.W = np.random.normal(size=(n_out, n_in, filter_size, filter_size))
        self.b = np.zeros(n_out)
        
    def forward(self, x_input):
        n_obj, n_in, h, w = x_input.shape
        n_out = len(self.W)
        
        self.output = []
        
        for image in x_input:
            output_image = []
            for i in range(n_out):
                out_channel = 0.0
                for j in range(n_in):
                    out_channel += conv_matrix(image[j], self.W[i, j])
                output_image.append(out_channel)
            self.output.append(np.stack(output_image, 0))

        self.output = np.stack(self.output, 0)
        return self.output

    
    def backward(self, x_input, grad_output):

        N, C, H, W = x_input.shape 
        F, C, HH, WW = self.W.shape
        
        pad = int((HH - 1) / 2)

        self.grad_b = np.sum(grad_output, (0, 2, 3)) 

        # pad input array
        x_padded = np.pad(x_input, ((0,0), (0,0), (pad, pad), (pad, pad)), 'constant')
        H_padded, W_padded = x_padded.shape[2], x_padded.shape[3]
        # naive implementation of im2col
        x_cols = None
        for i in range(HH, H_padded + 1):
            for j in range(WW, W_padded+1):
                for n in range(N):
                    field = x_padded[n, :, i-HH:i, j-WW:j].reshape((1,-1))    
                    if x_cols is None:
                        x_cols = field
                    else:
                        x_cols = np.vstack((x_cols, field))
                        
        x_cols = x_cols.T

        d_out = grad_output.transpose(1, 2, 3, 0) 
        dout_cols = d_out.reshape(F, -1) 

        dw_cols = np.dot(dout_cols, x_cols.T) 
        self.grad_W = dw_cols.reshape(F, C, HH, WW) 

        w_cols = self.W.reshape(F, -1) 
        dx_cols = np.dot(w_cols.T, dout_cols) 

        dx_padded = np.zeros((N, C, H_padded, W_padded))
        idx = 0
        for i in range(HH, H_padded + 1):
            for j in range(WW, W_padded + 1):
                for n in range(N):
                    dx_padded[n:n+1, :, i-HH:i, j-WW:j] += dx_cols[:, idx].reshape((1, C, HH, WW))
                    idx += 1
            dx = dx_padded[:, :, pad:-pad, pad:-pad]
        grad_input = dx
        return grad_input
    
    def get_params(self):
        return [self.W, self.b]

    def get_params_gradients(self):
        return [self.grad_W, self.grad_b]

In [15]:
# this layer transorms images with 3 channels 
# into images with 8 channels 
# by convolving them with kernels of size 3x3
conv_layer = ConvLayer(3, 8, filter_size=3)

5. Pooling Layer

Pooling layer perfroms the pooling operations. The pooling layer reduces the size of image. There are several types of pooling operations in theory but the most commonly used one is the max pooling.

In the following figure $2 \times 2$ pooling is applied on the image which reduced the size of image to half. It can be observed from the figure that the pooling operation does not effect the depth of the image.

During max pooling operation, the image is splitted into windows of size $2 \times 2$ and then just the maximum element is chosen from each window. You can use


In [16]:
def maxpool_forward(x_input):
    """Perform max pooling operation with 2x2 window
    # Arguments
        x_input: np.array of size (2 * W, 2 * H)
    # Output
        output: np.array of size (W, H)
    """
    #################
    ### YOUR CODE ###
    ################# 
    return output

In [121]:
am.test_student_function(username, maxpool_forward, ['x_input'])

In [122]:
# This function is implemented for you
# The implementation is not complicated, 
# so read it to understand the concept

def maxpool_grad_input(x_input, grad_output):
    """Calculate partial derivative of the loss with respect to the input
    # Arguments
        x_input: np.array of size (2 * W, 2 * H)
        grad_output: partial derivative of the loss 
            with respect to the output 
            np.array of size (W, H)
    # Output
        output: partial derivative of the loss 
            with respect to the input
            np.array of size (2 * W, 2 * H) 
    """
    height, width = x_input.shape
    # create the array of zeros of the required size
    grad_input = np.zeros(x_input.shape)
    
    # let's put 1 if the element with this position 
    # is maximal in the window
    for i in range(0, height, 2):
        for j in range(0, width, 2):
            window = x_input[i:i+2, j:j+2]
            i_max, j_max = np.unravel_index(np.argmax(window), (2, 2))
            grad_input[i + i_max, j + j_max] = 1
            
    # put corresponding gradient instead of 1       
    grad_input = grad_input.ravel()
    grad_input[grad_input == 1] = grad_output.ravel()
    grad_input = grad_input.reshape(x_input.shape)
    return grad_input

In [19]:
class MaxPool2x2(Layer):
    
    def forward(self, x_input):
        n_obj, n_ch, h, w = x_input.shape
        self.output = np.zeros((n_obj, n_ch, h // 2, w // 2))
        for i in range(n_obj):
            for j in range(n_ch):
                self.output[i, j] = maxpool_forward(x_input[i, j])
        return self.output 
    
    def backward(self, x_input, grad_output):
        n_obj, n_ch, _, _ = x_input.shape
        grad_input = np.zeros_like(x_input)
        for i in range(n_obj):
            for j in range(n_ch):
                grad_input[i, j] = maxpool_grad_input(x_input[i, j], grad_output[i, j])
        return grad_input

6. Flatten

Convolutional neural networks allow one to perform better image processing than fully connected neural networks. We will combine convolutional layers, wich deal with 4-dimensional tensors, with dense layers, wich work with matrices. To make it work, we should implement Flatten layer.

Flatten layer takes a 4-dimensional tensor of size (n_obj, n_channels, h, w) as an input and reshaes it to a 2-dimensional tensor (matrix) of size (n_obj, n_channels * h * w). The backward pass of this layer is not complicated. Think about it.

Please implement flatten_forward and flatten_grad_input functions using np.reshape. Here is the description of np.reshape.


In [23]:
def flatten_forward(x_input):
    """Perform the reshaping of the tensor of size `(K, L, M, N)` 
        to the tensor of size `(K, L*M*N)`
    # Arguments
        x_input: np.array of size `(K, L, M, N)`
    # Output
        output: np.array of size `(K, L*M*N)`
    """
    #################
    ### YOUR CODE ###
    #################
    return output

In [123]:
am.test_student_function(username, flatten_forward, ['x_input'])

In [26]:
def flatten_grad_input(x_input, grad_output):
    """Calculate partial derivative of the loss with respect to the input
    # Arguments
        x_input: partial derivative of the loss 
            with respect to the output
            np.array of size `(K, L*M*N)`
    # Output
        output: partial derivative of the loss 
            with respect to the input
            np.array of size `(K, L, M, N)`
    """
    #################
    ### YOUR CODE ###
    #################
    return grad_input

In [124]:
am.test_student_function(username, flatten_grad_input, ['x_input', 'grad_output'])

In [28]:
class FlattenLayer(Layer):
    
    def forward(self, x_input):
        self.output = flatten_forward(x_input)
        return self.output
    
    def backward(self, x_input, grad_output):
        output = flatten_grad_input(x_input, grad_output)
        return output

7. Experiments

Now we will conduct several experiments. We will train our neural networks with minibatches. So we split our dataset intro small portions. And feed these portions one-by-one to our neural network.


In [29]:
import sys

def iterate_minibatches(x, y, batch_size=16, verbose=True):
    assert len(x) == len(y)
    
    indices = np.arange(len(x))
    np.random.shuffle(indices)
    
    for i, start_idx in enumerate(range(0, len(x) - batch_size + 1, batch_size)):
        if verbose:
            print('\rBatch: {}/{}'.format(i + 1, len(x) // batch_size), end='')
            sys.stdout.flush()
        
        excerpt = indices[start_idx:start_idx + batch_size]
        yield x[excerpt], y[excerpt]

Let's import the data. Download it first


In [30]:
from dataset_utils import load_mnist

In [31]:
train = list(load_mnist())
train_images = np.array([im[1] for im in train])
train_targets = np.array([im[0] for im in train])
# we will train 0 vs 1 classifier
x_train = train_images[train_targets < 2][:1000]
y_train = train_targets[train_targets < 2][:1000]

y_train = y_train * 2 - 1
y_train = y_train.reshape((-1, 1))

In [32]:
x_train = x_train.astype('float32') / 255.0
x_train = x_train.reshape((-1, 1, 28, 28))

Now we will train simple convolutional neural network


In [38]:
def get_cnn():
    nn = SequentialNN()

    nn.add(ConvLayer(1, 2, filter_size=3)) # The output is of size N_obj 2 28 28
    nn.add(ReLU()) # The output is of size N_obj 2 28 28
    nn.add(MaxPool2x2()) # The output is of size N_obj 2 14 14

    nn.add(ConvLayer(2, 4, filter_size=3)) # The output is of size N_obj 4 14 14
    nn.add(ReLU()) # The output is of size N_obj 4 14 14
    nn.add(MaxPool2x2()) # The output is of size N_obj 4 7 7

    nn.add(FlattenLayer()) # The output is of size N_obj 196
    nn.add(Dense(4 * 7 * 7, 32))
    nn.add(ReLU())
    nn.add(Dense(32, 1))
    return nn

In [34]:
nn = get_cnn()
loss = Hinge()
optimizer = SGD(nn)

In [125]:
# It will train for about 5 minutes
num_epochs = 5 
batch_size = 32
# we will store the results here
history = {'loss': [], 'accuracy': []}

# we will make `num_epochs` iterations
for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch + 1, num_epochs))
    
    # we perform iteration by small batches one-by-one
    for x_batch, y_batch in iterate_minibatches(x_train, y_train, batch_size):
        # predict the target value
        y_pred = nn.forward(x_batch)
        # compute the gradient of the loss
        loss_grad = loss.backward(y_pred, y_batch)
        # perform backprop
        nn.backward(x_batch, loss_grad)
        # update the params
        optimizer.update_params()
        
        # save loss and accuracy value
        history['loss'].append(loss.forward(y_pred, y_batch))
        prediction_is_correct = (y_pred > 0) == (y_batch > 0)
        history['accuracy'].append(np.mean(prediction_is_correct))
        
    print()

In [127]:
# let's plot the results
plt.figure(figsize=(8, 5))

ax_1 = plt.subplot()
ax_1.plot(history['loss'], c='g', lw=2, label='train loss')
ax_1.set_ylabel('loss', fontsize=16)
ax_1.set_xlabel('#batches', fontsize=16)

ax_2 = plt.twinx(ax_1)
ax_2.plot(history['accuracy'], lw=3, label='train accuracy')
ax_2.set_ylabel('accuracy', fontsize=16)
  • Train the model with different batch_size
  • What would be if batch_size=1?
  • What would be if batch_size=1000?
  • Does the speed of the computations depend on this parameter? Why?

  • Train the model with different num_epochs

  • What would be if num_epochs=1?
  • What would be if num_epochs=1000?

Let's visualize the activations of the intermediate layers


In [97]:
viz_images = x_batch[:2]
_ = nn.forward(viz_images)

activations = {
    'conv_1': nn.layers[0].output,
    'relu_1': nn.layers[1].output,
    'pool_1': nn.layers[2].output,
    'conv_2': nn.layers[3].output,
    'relu_2': nn.layers[4].output,
    'pool_2': nn.layers[5].output,
}

Input Images


In [128]:
# Input
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(4, 8))

ax1.imshow(viz_images[0, 0], cmap=plt.cm.gray_r)
ax1.set_xticks([])
ax1.set_yticks([])

ax2.imshow(viz_images[1, 0], cmap=plt.cm.gray_r)
ax2.set_xticks([])
ax2.set_yticks([])

plt.show()

Activations of Conv 1


In [129]:
# Conv 1
f, axes = plt.subplots(2, 2, figsize=(8, 8))

for i in range(2):
    for j in range(2):
        ax = axes[i, j]
        ax.imshow(activations['conv_1'][i, j], cmap=plt.cm.gray_r)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Channel {}'.format(j + 1))

plt.show()

Activations of ReLU 1


In [130]:
# ReLU 1
f, axes = plt.subplots(2, 2, figsize=(8, 8))

for i in range(2):
    for j in range(2):
        ax = axes[i, j]
        ax.imshow(activations['relu_1'][i, j], cmap=plt.cm.gray_r)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Channel {}'.format(j + 1))

plt.show()

Activations of MaxPooling 1


In [131]:
# Max Pooling 1
f, axes = plt.subplots(2, 2, figsize=(8, 8))

for i in range(2):
    for j in range(2):
        ax = axes[i, j]
        ax.imshow(activations['pool_1'][i, j], cmap=plt.cm.gray_r)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Channel {}'.format(j + 1))

plt.show()

Activations of Conv 2


In [132]:
# Conv 2
f, axes = plt.subplots(2, 4, figsize=(16, 8))

for i in range(2):
    for j in range(4):
        ax = axes[i, j]
        ax.imshow(activations['conv_2'][i, j], cmap=plt.cm.gray_r)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Channel {}'.format(j + 1))

plt.show()

Activations of ReLU 2


In [133]:
# ReLU 2
f, axes = plt.subplots(2, 4, figsize=(16, 8))

for i in range(2):
    for j in range(4):
        ax = axes[i, j]
        ax.imshow(activations['relu_2'][i, j], cmap=plt.cm.gray_r)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Channel {}'.format(j + 1))

plt.show()

Activations of MaxPooling 2


In [134]:
# Max Pooling 2
f, axes = plt.subplots(2, 4, figsize=(16, 8))

for i in range(2):
    for j in range(4):
        ax = axes[i, j]
        ax.imshow(activations['pool_2'][i, j], cmap=plt.cm.gray_r)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title('Channel {}'.format(j + 1))

plt.show()

As we go deeper and deeper, images become less locally-corellated (the dependance between two neighbours decreases) and more semantically loaded. Each pixel now stores more information about the object. Very useful information. Which then will be analyzed with several Dense Layers.

  • Change the architecture of the neural network
  • Vary the number of filters
  • Vary the size of the kernels

In [ ]: