The main idea of this assignment is to understand the convolutional neural networks and the basics of image filtering. Matrix convolution and convolutional layer will be implemented from scratch.
Please copy the code from the previous assignment into a separate file Blocks.py
. Put there the implementation of the building blocks.
All functions should be implemented in NumPy.
During the previous assignment, you implemented the main building blocks of the neural networks: Dense Layer, nonlinearities, losses, and optimizers.
It is easier to understand the convolution when you see the image. Here is the image.
We "put" the kernel on the matrix. Each element from the kernel is multiplied by the correcponding element of the source matrix. The results are summed and are written to the new matrix.
The output matrix has smaller size than the source one. It is so because of the border effects.
In order to obtain the matrix of the same size, zero padding could be used.
We have a matrix $X$ of size $N \times M$ and a kernel K of size $(2p+1) \times (2q +1 )$.
We also define $X_{ij} = 0$ for $i > N, i < 1$ and $j > M, j < 1$. It is called zero padding
Therefore the convolution of matrix with the kernel is defined as follows:
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from Blocks import *
%matplotlib inline
In [2]:
import automark as am
username = 'sosnovik'
# if yor are not registered
am.register_id(username, ('ivan sosnovik', 'i.sosnovik@uva.nl'))
Now you should implement matrix convolution
In [3]:
def conv_matrix(matrix, kernel):
"""Perform the convolution of the matrix
with the kernel using zero padding
# Arguments
matrix: input matrix np.array of size `(N, M)`
kernel: kernel of the convolution
np.array of size `(2p + 1, 2q + 1)`
# Output
the result of the convolution
np.array of size `(N, M)`
"""
return output
Let's test the function
$$ X = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 3 & 4 \\ 3 & 4 & 5 \\ \end{bmatrix} \quad K = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 2 \\ \end{bmatrix} \quad X \star K = \begin{bmatrix} 7 & 10 & 3 \\ 10 & 14 & 6 \\ 3 & 6 & 8 \\ \end{bmatrix} $$
In [112]:
X = np.array([
[1, 2, 3],
[2, 3, 4],
[3, 4, 5]
])
K = np.eye(3)
K[-1, -1] = 2
In [113]:
print(conv_matrix(X, K))
In [117]:
am.test_student_function(username, conv_matrix, ['matrix', 'kernel'])
Matrix convolution could be used to process the image: to blur it, to shift the image, to get the edges etc. Here is the very interesting article. It is interactive. So you are able to get the better understanding of convolutions. Here are the examples of some kernels.
Let's play with convolutions
In [114]:
rgb_img = plt.imread('./images/dog.png')
plt.imshow(rgb_img)
We will convert it to grayscale. RGB image is a 3 dimensional tensor. But grayscale image could be represented as a matrix
In [115]:
img = rgb_img.mean(axis=2)
plt.imshow(img, cmap='gray')
First of all, let's blur the image with box blur. It is just a convolution of a matrix with the kernel of size $N \times N$ of the following form:
$$ \frac{1}{N^2} \begin{bmatrix} 1 & \dots & 1\\ \vdots & \ddots & \vdots\\ 1 & \dots & 1\\ \end{bmatrix} $$
In [9]:
def box_blur(image, box_size):
"""Perform the blur of the image
# Arguments
image: input matrix - np.array of size `(N, M)`
box_size: the size of the blur kernel - int > 0
the kernel is of size `(box_size, box_size)`
# Output
the result of the blur
np.array of size `(N, M)`
"""
return output
In [116]:
am.test_student_function(username, box_blur, ['image', 'box_size'])
In [118]:
blur_dog = box_blur(img, box_size=5)
plt.imshow(blur_dog, cmap='gray')
Now we will get the vertical and horizontal gradients. To perform it we just calculate the convolution of the image with the following kernels:
$$ K_h = \begin{bmatrix} -1 & 0 & 1\\ \end{bmatrix} \quad K_v = \begin{bmatrix} 1 \\ 0 \\ -1\\ \end{bmatrix} \\ X_h = X \star K_h \quad X_v = X \star K_v\\ $$And then we just calculate the amplitude of the gradient:
$$ X_\text{grad} = \sqrt{X_h^2 + X_v^2} $$
In [119]:
dog_h = conv_matrix(blur_dog, np.array([[-1, 0, 1]]))
dog_v = conv_matrix(blur_dog, np.array([[-1, 0, 1]]).T)
dog_grad = np.sqrt(dog_h ** 2 + dog_v ** 2)
plt.imshow(dog_grad, cmap='gray')
Now we have the edges we can work with. This is not the only way to get the edges. There are plenty of them:
If you convolve an image with the kernel, you will obtaine the map of responses. The more correlated the patch of the image with the kernel, the higher the reposponse. Let's demonstarte it.
In [120]:
pattern = np.array([
[0, 1, 0],
[1, 1, 1],
[0, 1, 0]
])
# create the image
image = np.pad(pattern, [(12, 12), (10, 14)], mode='constant', constant_values=0)
plt.imshow(image, cmap='gray')
plt.title('original image')
plt.show()
# add some noise
image = 0.5 * image + 0.5 * np.random.random(image.shape)
plt.imshow(image, cmap='gray')
plt.title('noisy image')
plt.show()
# let's find the cross
response = conv_matrix(image, pattern)
plt.imshow(response, cmap='gray')
plt.title('local response')
plt.show()
The brightest pixel is the place where the cross is. We can find the place where the image is locally close to the kernel. It allows one to find different patterns in the images such as eyes, legs, dogs, cats, etc.
We defined the kernels and then just used them. But we can also learn them such by minimizing sime loss and making the processing as effective as it is possible. To do it, we have to define Convolutional layer
Convolutional Layer works with images. Each image is a 3-dimensional object $N_{\text{channels}} \times H \times W$. And therefore, the collection of images is 4-dimensioonal tensor of shape $N_{\text{objects}} \times N_{\text{channels}} \times H \times W$.
For example, 32 RGB images of size $224 \times 224$ are represented as a tensor of shape $32 \times 3 \times 224 \times 224$
Convolutional Layer takes the image as an input. Here the explanation of how it works:
n_in * n_out
kernels. It is also a tensor of size (n_in, n_out, kernel_h, kernel_w)
n_objects, n_in, H, W
as an input. n_objects
images. n_in
channels(H, W)
for i in range(n_out):
out_channel = 0.0
for j in range(n_in):
kernel_2d = K[i, j] # get the kernel from the collection of kernels
input_channel = input_image[j] # get just one channel of the input iamge
out_channel += conv_matrix(input_channel, kernel_2d) # convolve
output_image.append(out_channel) # append the calculated channel to the output
Here we implement convolutional layer for you. The implementation of backward is based on the idea that convolution could be represented as matrix multiplication.
In [14]:
class ConvLayer(Layer):
"""
Convolutional Layer. The implementation is based on
the representation of the convolution as matrix multiplication
"""
def __init__(self, n_in, n_out, filter_size):
super(ConvLayer, self).__init__()
self.W = np.random.normal(size=(n_out, n_in, filter_size, filter_size))
self.b = np.zeros(n_out)
def forward(self, x_input):
n_obj, n_in, h, w = x_input.shape
n_out = len(self.W)
self.output = []
for image in x_input:
output_image = []
for i in range(n_out):
out_channel = 0.0
for j in range(n_in):
out_channel += conv_matrix(image[j], self.W[i, j])
output_image.append(out_channel)
self.output.append(np.stack(output_image, 0))
self.output = np.stack(self.output, 0)
return self.output
def backward(self, x_input, grad_output):
N, C, H, W = x_input.shape
F, C, HH, WW = self.W.shape
pad = int((HH - 1) / 2)
self.grad_b = np.sum(grad_output, (0, 2, 3))
# pad input array
x_padded = np.pad(x_input, ((0,0), (0,0), (pad, pad), (pad, pad)), 'constant')
H_padded, W_padded = x_padded.shape[2], x_padded.shape[3]
# naive implementation of im2col
x_cols = None
for i in range(HH, H_padded + 1):
for j in range(WW, W_padded+1):
for n in range(N):
field = x_padded[n, :, i-HH:i, j-WW:j].reshape((1,-1))
if x_cols is None:
x_cols = field
else:
x_cols = np.vstack((x_cols, field))
x_cols = x_cols.T
d_out = grad_output.transpose(1, 2, 3, 0)
dout_cols = d_out.reshape(F, -1)
dw_cols = np.dot(dout_cols, x_cols.T)
self.grad_W = dw_cols.reshape(F, C, HH, WW)
w_cols = self.W.reshape(F, -1)
dx_cols = np.dot(w_cols.T, dout_cols)
dx_padded = np.zeros((N, C, H_padded, W_padded))
idx = 0
for i in range(HH, H_padded + 1):
for j in range(WW, W_padded + 1):
for n in range(N):
dx_padded[n:n+1, :, i-HH:i, j-WW:j] += dx_cols[:, idx].reshape((1, C, HH, WW))
idx += 1
dx = dx_padded[:, :, pad:-pad, pad:-pad]
grad_input = dx
return grad_input
def get_params(self):
return [self.W, self.b]
def get_params_gradients(self):
return [self.grad_W, self.grad_b]
In [15]:
# this layer transorms images with 3 channels
# into images with 8 channels
# by convolving them with kernels of size 3x3
conv_layer = ConvLayer(3, 8, filter_size=3)
Pooling layer perfroms the pooling operations. The pooling layer reduces the size of image. There are several types of pooling operations in theory but the most commonly used one is the max pooling.
In the following figure $2 \times 2$ pooling is applied on the image which reduced the size of image to half. It can be observed from the figure that the pooling operation does not effect the depth of the image.
During max pooling operation, the image is splitted into windows of size $2 \times 2$ and then just the maximum element is chosen from each window. You can use
In [16]:
def maxpool_forward(x_input):
"""Perform max pooling operation with 2x2 window
# Arguments
x_input: np.array of size (2 * W, 2 * H)
# Output
output: np.array of size (W, H)
"""
#################
### YOUR CODE ###
#################
return output
In [121]:
am.test_student_function(username, maxpool_forward, ['x_input'])
In [122]:
# This function is implemented for you
# The implementation is not complicated,
# so read it to understand the concept
def maxpool_grad_input(x_input, grad_output):
"""Calculate partial derivative of the loss with respect to the input
# Arguments
x_input: np.array of size (2 * W, 2 * H)
grad_output: partial derivative of the loss
with respect to the output
np.array of size (W, H)
# Output
output: partial derivative of the loss
with respect to the input
np.array of size (2 * W, 2 * H)
"""
height, width = x_input.shape
# create the array of zeros of the required size
grad_input = np.zeros(x_input.shape)
# let's put 1 if the element with this position
# is maximal in the window
for i in range(0, height, 2):
for j in range(0, width, 2):
window = x_input[i:i+2, j:j+2]
i_max, j_max = np.unravel_index(np.argmax(window), (2, 2))
grad_input[i + i_max, j + j_max] = 1
# put corresponding gradient instead of 1
grad_input = grad_input.ravel()
grad_input[grad_input == 1] = grad_output.ravel()
grad_input = grad_input.reshape(x_input.shape)
return grad_input
In [19]:
class MaxPool2x2(Layer):
def forward(self, x_input):
n_obj, n_ch, h, w = x_input.shape
self.output = np.zeros((n_obj, n_ch, h // 2, w // 2))
for i in range(n_obj):
for j in range(n_ch):
self.output[i, j] = maxpool_forward(x_input[i, j])
return self.output
def backward(self, x_input, grad_output):
n_obj, n_ch, _, _ = x_input.shape
grad_input = np.zeros_like(x_input)
for i in range(n_obj):
for j in range(n_ch):
grad_input[i, j] = maxpool_grad_input(x_input[i, j], grad_output[i, j])
return grad_input
Convolutional neural networks allow one to perform better image processing than fully connected neural networks. We will combine convolutional layers, wich deal with 4-dimensional tensors, with dense layers, wich work with matrices. To make it work, we should implement Flatten layer.
Flatten layer takes a 4-dimensional tensor of size (n_obj, n_channels, h, w)
as an input and reshaes it to a 2-dimensional tensor (matrix) of size (n_obj, n_channels * h * w)
.
The backward pass of this layer is not complicated. Think about it.
Please implement flatten_forward
and flatten_grad_input
functions using np.reshape
. Here is the description of np.reshape
.
In [23]:
def flatten_forward(x_input):
"""Perform the reshaping of the tensor of size `(K, L, M, N)`
to the tensor of size `(K, L*M*N)`
# Arguments
x_input: np.array of size `(K, L, M, N)`
# Output
output: np.array of size `(K, L*M*N)`
"""
#################
### YOUR CODE ###
#################
return output
In [123]:
am.test_student_function(username, flatten_forward, ['x_input'])
In [26]:
def flatten_grad_input(x_input, grad_output):
"""Calculate partial derivative of the loss with respect to the input
# Arguments
x_input: partial derivative of the loss
with respect to the output
np.array of size `(K, L*M*N)`
# Output
output: partial derivative of the loss
with respect to the input
np.array of size `(K, L, M, N)`
"""
#################
### YOUR CODE ###
#################
return grad_input
In [124]:
am.test_student_function(username, flatten_grad_input, ['x_input', 'grad_output'])
In [28]:
class FlattenLayer(Layer):
def forward(self, x_input):
self.output = flatten_forward(x_input)
return self.output
def backward(self, x_input, grad_output):
output = flatten_grad_input(x_input, grad_output)
return output
In [29]:
import sys
def iterate_minibatches(x, y, batch_size=16, verbose=True):
assert len(x) == len(y)
indices = np.arange(len(x))
np.random.shuffle(indices)
for i, start_idx in enumerate(range(0, len(x) - batch_size + 1, batch_size)):
if verbose:
print('\rBatch: {}/{}'.format(i + 1, len(x) // batch_size), end='')
sys.stdout.flush()
excerpt = indices[start_idx:start_idx + batch_size]
yield x[excerpt], y[excerpt]
Let's import the data. Download it first
In [30]:
from dataset_utils import load_mnist
In [31]:
train = list(load_mnist())
train_images = np.array([im[1] for im in train])
train_targets = np.array([im[0] for im in train])
# we will train 0 vs 1 classifier
x_train = train_images[train_targets < 2][:1000]
y_train = train_targets[train_targets < 2][:1000]
y_train = y_train * 2 - 1
y_train = y_train.reshape((-1, 1))
In [32]:
x_train = x_train.astype('float32') / 255.0
x_train = x_train.reshape((-1, 1, 28, 28))
Now we will train simple convolutional neural network
In [38]:
def get_cnn():
nn = SequentialNN()
nn.add(ConvLayer(1, 2, filter_size=3)) # The output is of size N_obj 2 28 28
nn.add(ReLU()) # The output is of size N_obj 2 28 28
nn.add(MaxPool2x2()) # The output is of size N_obj 2 14 14
nn.add(ConvLayer(2, 4, filter_size=3)) # The output is of size N_obj 4 14 14
nn.add(ReLU()) # The output is of size N_obj 4 14 14
nn.add(MaxPool2x2()) # The output is of size N_obj 4 7 7
nn.add(FlattenLayer()) # The output is of size N_obj 196
nn.add(Dense(4 * 7 * 7, 32))
nn.add(ReLU())
nn.add(Dense(32, 1))
return nn
In [34]:
nn = get_cnn()
loss = Hinge()
optimizer = SGD(nn)
In [125]:
# It will train for about 5 minutes
num_epochs = 5
batch_size = 32
# we will store the results here
history = {'loss': [], 'accuracy': []}
# we will make `num_epochs` iterations
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch + 1, num_epochs))
# we perform iteration by small batches one-by-one
for x_batch, y_batch in iterate_minibatches(x_train, y_train, batch_size):
# predict the target value
y_pred = nn.forward(x_batch)
# compute the gradient of the loss
loss_grad = loss.backward(y_pred, y_batch)
# perform backprop
nn.backward(x_batch, loss_grad)
# update the params
optimizer.update_params()
# save loss and accuracy value
history['loss'].append(loss.forward(y_pred, y_batch))
prediction_is_correct = (y_pred > 0) == (y_batch > 0)
history['accuracy'].append(np.mean(prediction_is_correct))
print()
In [127]:
# let's plot the results
plt.figure(figsize=(8, 5))
ax_1 = plt.subplot()
ax_1.plot(history['loss'], c='g', lw=2, label='train loss')
ax_1.set_ylabel('loss', fontsize=16)
ax_1.set_xlabel('#batches', fontsize=16)
ax_2 = plt.twinx(ax_1)
ax_2.plot(history['accuracy'], lw=3, label='train accuracy')
ax_2.set_ylabel('accuracy', fontsize=16)
batch_size
batch_size=1
?batch_size=1000
?Does the speed of the computations depend on this parameter? Why?
Train the model with different num_epochs
num_epochs=1
?num_epochs=1000
?Let's visualize the activations of the intermediate layers
In [97]:
viz_images = x_batch[:2]
_ = nn.forward(viz_images)
activations = {
'conv_1': nn.layers[0].output,
'relu_1': nn.layers[1].output,
'pool_1': nn.layers[2].output,
'conv_2': nn.layers[3].output,
'relu_2': nn.layers[4].output,
'pool_2': nn.layers[5].output,
}
In [128]:
# Input
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(4, 8))
ax1.imshow(viz_images[0, 0], cmap=plt.cm.gray_r)
ax1.set_xticks([])
ax1.set_yticks([])
ax2.imshow(viz_images[1, 0], cmap=plt.cm.gray_r)
ax2.set_xticks([])
ax2.set_yticks([])
plt.show()
In [129]:
# Conv 1
f, axes = plt.subplots(2, 2, figsize=(8, 8))
for i in range(2):
for j in range(2):
ax = axes[i, j]
ax.imshow(activations['conv_1'][i, j], cmap=plt.cm.gray_r)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Channel {}'.format(j + 1))
plt.show()
In [130]:
# ReLU 1
f, axes = plt.subplots(2, 2, figsize=(8, 8))
for i in range(2):
for j in range(2):
ax = axes[i, j]
ax.imshow(activations['relu_1'][i, j], cmap=plt.cm.gray_r)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Channel {}'.format(j + 1))
plt.show()
In [131]:
# Max Pooling 1
f, axes = plt.subplots(2, 2, figsize=(8, 8))
for i in range(2):
for j in range(2):
ax = axes[i, j]
ax.imshow(activations['pool_1'][i, j], cmap=plt.cm.gray_r)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Channel {}'.format(j + 1))
plt.show()
In [132]:
# Conv 2
f, axes = plt.subplots(2, 4, figsize=(16, 8))
for i in range(2):
for j in range(4):
ax = axes[i, j]
ax.imshow(activations['conv_2'][i, j], cmap=plt.cm.gray_r)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Channel {}'.format(j + 1))
plt.show()
In [133]:
# ReLU 2
f, axes = plt.subplots(2, 4, figsize=(16, 8))
for i in range(2):
for j in range(4):
ax = axes[i, j]
ax.imshow(activations['relu_2'][i, j], cmap=plt.cm.gray_r)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Channel {}'.format(j + 1))
plt.show()
In [134]:
# Max Pooling 2
f, axes = plt.subplots(2, 4, figsize=(16, 8))
for i in range(2):
for j in range(4):
ax = axes[i, j]
ax.imshow(activations['pool_2'][i, j], cmap=plt.cm.gray_r)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Channel {}'.format(j + 1))
plt.show()
As we go deeper and deeper, images become less locally-corellated (the dependance between two neighbours decreases) and more semantically loaded. Each pixel now stores more information about the object. Very useful information. Which then will be analyzed with several Dense Layers.
In [ ]: