In [ ]:
"""This area sets up the Jupyter environment.
Please do not modify anything in this cell.
"""
import os
import sys
import time

# Add project to PYTHONPATH for future use
sys.path.insert(1, os.path.join(sys.path[0], '..'))

# Import miscellaneous modules
from IPython.core.display import display, HTML

# Set CSS styling
with open('../admin/custom.css', 'r') as f:
    style = """<style>\n{}\n</style>""".format(f.read())
    display(HTML(style))

Convolutional Neural Networks

In this notebook we will become familiar with a type of *layer* for artificial neural networks called convolutional layers. The data we will attempt to model using these types of networks will be images.

Images

For a computer an image is a matrix of data, where each pixel is represented by one or more values:

Matrix with one value per pixel = greyscale images

image_source

Matrix with three values per pixel = color images

image_source

The MNIST Dataset

We had a brief look at this dataset in the previous notebook, and here we will go through it again with much more detail. As before, the MNIST database (Modified National Institute of Standards and Technology database) is a multiclass classification problem where we are tasked with classifying a digit ($0-9$) based on a $28\times 28$ greyscale image:

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

source

In the following example we will load data from MNIST.

  • input $\rightarrow$ 70000 samples of vectors
    • Each vector has 784 dimensions
    • Here presented as $28\times 28$ matrices $\rightarrow$ Greyscale images
  • target $\rightarrow$ 70000 integers indicating a digit from 0 to 9
In the following snippet of code we will:
  • Load the MNIST dataset
  • Plot the 5th sample of the training set

In [ ]:
# Plots will be displaying plots within the notebook
%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

# NumPy is a package for manipulating N-dimensional array objects 
import numpy as np

# Pandas is a data analysis package
import pandas as pd

#Library To test/verify some tasks
import problem_unittests as tests # Used to test ouw anwsers

# Mnist wrapper
from keras.datasets import mnist


# Code to load the data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Print data shape
print('Shape of x_train {}'.format(x_train.shape))
print('Shape of y_train {}'.format(y_train.shape))
print('Shape of x_test {}'.format(x_train.shape))
print('Shape of y_test {}'.format(y_train.shape))


# Code to plot the 5th training sample.
fig,ax1 = plt.subplots(1,1, figsize=(7, 7))

ax1.imshow(x_train[5], cmap='gray')
title = 'Target = {}'.format(y_train[5])
ax1.set_title(title)
ax1.grid(which='Major')
ax1.xaxis.set_major_locator(MaxNLocator(28))
ax1.yaxis.set_major_locator(MaxNLocator(28))
fig.canvas.draw()
time.sleep(0.1)

Data Pre-Processing

Before we start classifying digits we need to pre-process the data.

Your first task is to create a function that normalises 8-bit images from [0,255] to [0,1]:

Task I: Implement an Image Normalisation Function

**Task**: Implement a function that normalises the images to the interval [0,1].
  • Inputs are integers in the interval [0,255]
  • Outputs should be floats in the interval [0,1]

In [ ]:
def normalise_images(images):
    """Normalise input images.
    """
    # Normalise image here

    return images

### Do *not* modify the following lines ###
tests.test_normalize_images(normalise_images)

# Normalize the data for future use
x_train = normalise_images(x_train)
x_test = normalise_images(x_test)

Task II: Expand the Dimension of the Input

When we loaded the MNIST dataset each digit was represented by a matrix of size $(28, 28)$. However, the artificial neural network we will be building uses the concept of colour channels and feature maps even for greyscale images. This means that we have to transform $(28, 28)$ to $(28, 28, 1)$.

**Task**: Write a piece of code that add a new dimestion to `x_train` and `x_test`.
  • The shape of `x_train` should be $(60000, 28, 28, 1)$
  • The shape of `x_test` should be $(10000, 28, 28, 1)$
Take a look at [numpy.expand_dims()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html) for how you might do this.

In [ ]:
# Write your code here
x_train = None
x_test = None


### Do *not* modify the following lines ###
print('Shape of x_train {}'.format(x_train.shape))
print('Shape of y_train {}'.format(y_train.shape))
print('Shape of x_test {}'.format(x_test.shape))
print('Shape of y_test {}'.format(y_test.shape))

Target Pre-Processing

To classify our digits we need to use one-hot encoding to represent the target outputs. One-hot enconding is a robust yet simple solution to represent multi-categorical targets.

This enconding is an ideal representation to train a model using gradient descent algorithm with the softmax function we discussed in a previous notebook.

Example of one-hot encoding

Here's an example of how a one-hot encoding scheme looks like:

$$ \begin{equation*} \mathbf{y} = \left[ \begin{array}{c} 2 \\ 8 \\ 0 \\ 6 \\ \vdots\end{array} \right] \Longrightarrow \begin{bmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix} \end{equation*} $$

On the left-hand side we have a vector of target labels $\mathbf{y}$ with $K=10$ number of classes. On the right-hand side we can see the one-hot encoded version of the $\mathbf{y}$ vector where each element $\in \mathbf{y}$ has been transformed to a $K$-dimensional row vector. Only element $\mathbf{y}_i$ has been set to 1, the rest are 0. For example, the first element of $\mathbf{y}$ is 2, which means that the one-hot encoded vector will be all zeros except for position 2 (0-indexing). Similarly, the third element of $\mathbf{y}$ is 0, which means the one-hot encoded vector will be all zeros except for position 0.

The core idea is that you transform multi-categorical data to a combination of several single class(es). By doing this we can, for each example, see whether it belongs to any class, where 1 indicates that it does and 0 otherwise.

Task III: Implement a Function for One-Hot Encoding

**Task**: Implement a function that one-hot encodes a vector of numbers to a matrix of $K$ classes:
  • The first argument is vector with $N$ samples (dimensions)
  • The second argument is a number $K$ signifying the number of classes
  • For each sample of the vector you will create an array with $K$ dimensions
  • The one-hot encoded matrix should have zeros on all positions expect on the position indicated by the current sample in the input vector
  • Try to implement this function by yourself. If you have doubts ask for help
  • If you are running out of time use keras.utils.to_categorical(vector, number_classes) like we did in the previous notebook and come back to this task later

In [ ]:
def one_hot(vector, number_classes):
    """Return a one-hot encoded matrix given the argument vector.
    """
    # Where we will store our one-hots
    one_hot = []

    # One-hot encode `vector` here


    # Transform list to numpy array and return it
    return np.array(one_hot)


### Do *not* modify the following line ###
tests.test_one_hot(one_hot)

# One-hot encode the MNIST target values
y_train = one_hot (y_train, 10)
y_test = one_hot(y_test, 10)

Now that we have both added an extra dimension to the input data as well as one-hot encoded the target values, let's take a look at the shapes of the data matrices.


In [ ]:
print('Shape of x_train {}'.format(x_train.shape))
print('Shape of y_train {}'.format(y_train.shape))
print('Shape of x_test {}'.format(x_train.shape))
print('Shape of y_test {}'.format(y_train.shape))

Build an Artificial Neural Network with Convolutions and Max-Pooling

Convolutions

If you have the task of recognising cats in an image, you might want to recognise / classify the animal regardless of its position. To do that we rely on a statistical fact: natural images are stationary source.

So, if we calculate a statistic for some location in the input image, then that statistic might also be valuable to calculate at some other location. One can exploit this property to define small networks that learn features that can be applied on different parts of an image.

Convolutional neural networks employs these aspects to create very efficient neural networks.

In case you want to watch a short video (has captions) walking through the concepts behind convolutional networks, take a look at the following YouTube link:

Sliding windows

A sliding window defines a small region of interest in an image.

The region of interested is used to scan the whole image as shown in the following animation:

If we use the sliding window to define what is the input seen by a small neural network, we have a so called convolution.

Assuming we have a color image, and a small neural network with $k$ outputs: for every possible position of the sliding window we will have $k$ outputs.

After the sliding window has scanned the whole image you have 3 dimensional matrix that can be investigated further.

Convolution Example

Let's assume that we have an image of $5 \times 5=25$ pixels:

$$ \begin{equation*} \begin{array}{|c|c|c|c|c|} \hline 1 & 1 & 1 & 0 & 0 \\ \hline 0 & 1 & 1 & 1 & 0\\ \hline 0 & 0 & 1 & 1 & 1\\ \hline 0 & 0 & 1 & 1 & 0\\ \hline 0 & 1 & 1 & 0 & 0\\ \hline \end{array} \end{equation*} $$

Assume that we define a small neural network that has $3 \times 3$ weights and a single output. The weight matrix is

$$ \begin{equation*} \begin{array}{|c|c|c|} \hline 1 & 0 & 1 \\ \hline 0 & 1 & 0\\ \hline 1 & 0 & 1\\ \hline \end{array} \end{equation*} $$

By feed-forwarding the network using a $3 \times 3$ sliding window we get the following convolved features (also know as feature map):

$$ \begin{equation*} \begin{array}{|c|c|c|} \hline 4 & 3 & 4 \\ \hline 2 & 4 & 3\\ \hline 2 & 3 & 4\\ \hline \end{array} \end{equation*} $$

source

The number of so-called feature maps produced will depend on the number of outputs of the neural network. In this case we have just one feature map.

Stride and Padding

We can use padding and strides to control the size of feature maps. Below are four animations that showcase the convolution operation on an input matrix using different paddings and strides:

$padding = 0\qquad stride = 1$ $padding = 0\qquad stride = 2$ $padding = 1\qquad stride = 1$ $padding = 1\qquad stride = 2$
source, paper

For images, padding translates to how many new pixels we introduce around the edge of an image, while stride is how far the convolution kernel is shifted after each application.

Minor terminology seen in certain kinds of literature:

  • Valid padding is equivalent to zero padding
  • Same padding means that the padding used is a function of the kernel size so that the output size has the same size as the input

Computing the Size of the Convolutions

To compute the size of the feature map resulting from a convolution we need to know the input size, the size of the kernel (filter), the stride, and the padding:

$$ \begin{equation*} output = \frac{1}{stride} (input - kernel + 2 * padding) + 1 \end{equation*} $$

For 2-d inputs the height and width can be calculated like this:

$$ \begin{equation*} height_{new} = \frac{1}{stride} (height_{input} - height_{kernel} + 2 * padding)+1 \end{equation*} $$


$$ \begin{equation*} widht_{new} = \frac{1}{stride} (widht_{input} - widht_{kernel} + 2 * padding)+1 \end{equation*} $$

Example :

Let us assume that we have an image that is $6 \times 6$. If we pad this image with a single pixel and then convolve it with a $3 \times 3$ kernel using a stride of $2$, we get the following $3 \times 3$ feature map:

We can compute the output size using the equations above:

$$ \begin{equation*} \begin{aligned} height_{new} &= \frac{1}{stride} (height_{input} - height_{kernel} + 2 * padding) + 1 \\ &= \frac{1}{2} (6 - 3 + 2 * 1) + 1 \\ &= \frac{1}{2}(5)+1 \\ &= 3 \end{aligned} \end{equation*} $$
  • Notice that we round floating numbers, so we go from 3.5 to 3.

Pooling

It has become common practice to use pooling layer between convolutional layers.

Sucessful convolutional neural networks like alexnet, VGG16, VGG19 employed this technique.

Pooling layers also depend on sliding windows, but instead of using the window as inputs for neurons, the sliding input data goes through a max, mean, or some other operator.

A Max-Pooling Example

Max-pooling has several advantages:

  • If you have images where the same class has similar images with small shifts in the pixel position, max pooling will mitigate small translations
  • It introduces zero parameters to the model since the max and mean operators are fixed functions that do not depend on weights
  • It reduces the amount of data that need to be processed in the next layer, while assuring some pixel translation invariance.
  • They are normally used with zero padding (aka valid padding).
  • They follow the same dimensional maths as convolutions

You can read more about pooling operators here

Implementing a Convolutional Network

We will implement the following convolution network:

The components of this network can be seen below:

Define an input:

  • input_x = Input(shape=sample_shape)
  • sample_shape is an input parameter

Generate 32 kernel maps using a convolutional layer:

  • The convolution uses a $3 \times 3$ kernel, stride 1, valid padding, and ReLU activation
  • Use Conv2D from Keras
  • output_layer = Conv2D(PARAMETERS)(input_layer)

Generate 64 kernel maps using a convolutional layer:

  • The convolution uses a $3 \times 3$ kernel, stride 1, valid padding, and ReLU activation
  • Use Conv2D
  • output_layer = Conv2D(PARAMETERS)(input_layer)

Reduce the feature maps using max-pooling:

  • The max-pooling should us a $2 \times 2$ kernel, stride None, and valid padding
  • Use MaxPooling2D
  • output_layer = MaxPooling2D(PARAMETERS)(input_layer)

Flatten the feature map:

Fully-connected, i.e. Dense, to 128 dimensions:

Fully-connected, i.e. Dense, to $K$ classes (argument) dimensions:

Task IV: Implement a Convolutional Neural Network Model

It is time to implement our first convolutional neural network.

Task: Create a function `net_1()` that implements the network specified above. Make sure to refer back to earlier notebooks if you are unsure about what to do.

In [ ]:
# Import Keras library
import keras
from keras.models import Model
from keras.layers import *


def net_1(sample_shape, nb_classes):
    # Define the network input to have `sample_shape´ shape
    input_x = None
    
    # Create network internals here
    x = None
    
    # Dense `nb_classes`
    probabilities = Dense(nb_classes, activation='softmax')(x)
    
    # Define the output
    model = Model(inputs=input_x, outputs=probabilities)

    return model
In the following code snippet we will:
  • Create the network using the function you just made
  • Display a summary of the network

In [ ]:
# Shape of sample
sample_shape = x_train[0].shape 

# Construct net
model = net_1(sample_shape, 10)
model.summary()

Task V: Define Hyperparameters and Train the Network

We need to define hyperparameters so our network can learn.

Task: Tune in the hyper-parameters until your `loss` and `val_loss` are both converging to low numbers:
  • Batch size
  • Number of training epochs

Keep in mind that training these kinds of networks will take longer than the ones we have looked at so far.


In [ ]:
# Define hyperparameters
batch_size = None
epochs = None

### Do *not* modify the following lines ###

# There is no learning rate because we are using the recommended
# values for the Adadelta optimiser more information here:
# https://keras.io/optimizers/

# We need to compile our model
model.compile(loss='categorical_crossentropy',
              optimizer='Adadelta',
              metrics=['accuracy'])

# Train
logs = model.fit(x_train, y_train,
                 batch_size=batch_size,
                 epochs=epochs,
                 verbose=2,
                 validation_split=0.1)

# Plot our losses and accuracy
fig, ax = plt.subplots(1,1)

pd.DataFrame(logs.history).plot(ax=ax)
ax.grid(linestyle='dotted')
ax.legend()

plt.show()

# Assess performance
print('='*80)    
print('Assesing Test dataset...')
print('='*80)    

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Should We Use Max-Pooling?

There is a recent discussion if max-pooling is a good solution for reducing the amount of data between layers of a network. Some recent approaches show that a similar, and sometimes better performance, can be achieved by using convolutions with strides larger than 1.

Getting rid of pooling. Many people dislike the pooling operation and think that we can get away without it. For example, Striving for Simplicity: The All Convolutional Net proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers.

source

Task VI: Implement a Convolutional Network Without Max-Pooling

Implement a convolutional neural network without pooling layers:

Task: Replicate the network we made before (`net_1()`), but this time:
  • Remove max pooling and add stride(s) of 2 to the second convolution block (see [Conv2D](https://keras.io/layers/convolutional/))

In [ ]:
def net_2(sample_shape, nb_classes):
    # Define the network input to have `sample_shape` shape
    input_x = None
    
    # Create network internals here
    x = None

    # Dense number_classes
    probabilities = Dense(nb_classes, activation='softmax')(x)

    # Define the output
    model = Model(inputs=input_x, outputs=probabilities)

    return model
In the following code snippet we will:
  • Create the network using the function you just made
  • Display a summary of the network

In [ ]:
# Shape of sample
sample_shape = x_train[0].shape 

# Construct net
model = net_2(sample_shape, 10)
model.summary()

Task VII: Define Hyperparameters and Train the New Network

As before, we need to define some hyperparameters and train the network. Feel free to reuse the hyperparameters you found before.

Task: Tune in the hyper-parameters until your `loss` and `val_loss` are both converging to low numbers:
  • Batch size
  • Number of training epochs

In [ ]:
# Define hyperparameters
batch_size = None
epochs = None

### Do *not* modify the following lines ###

# As always we need to compile our model
model.compile(loss='categorical_crossentropy',
              optimizer='Adadelta',
              metrics=['accuracy'])

# Train
logs = model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=2,
          validation_split = 0.1,)

# Plot our losses and accuracy
fig, ax = plt.subplots(1,1)

pd.DataFrame(logs.history).plot(ax=ax)
ax.grid(linestyle='dotted')
ax.legend()
fig.canvas.draw()


# Assess performance
print('='*80)    
print('Assesing Test dataset...')
print('='*80)    

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

CIFAR

The following explanantion of Cifar 10 comes from official cifar page:

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.

The CIFAR10 Dataset

The CIFAR-10 dataset consists of 60000 $32 \times 32$ colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.</li>

Here are the classes in the dataset, as well as 10 random images from each:

airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

In the following example we will load data from CIFAR10.

  • input $\rightarrow$ 60000 samples of 3072 dimensional vectors.
    • Here presented as $32 \times 32 \times 3$ matrices $\rightarrow$ Colour images
  • target $\rightarrow$ 60000 scalars indicating a class from 0 to 9
In the following code we will:
  • Load the CIFAR10 dataset
  • Plot the 5th sample of training set

In [ ]:
from keras.datasets import cifar10

# The data, shuffled and split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

target_2_class = {0:'airplane',
                  1:'automobile',
                  2:'bird',
                  3:'cat',
                  4:'deer',
                  5:'dog',
                  6:'frog',
                  7:'horse',
                  8:'ship',
                  9:'truck'}

# Code to plot the 5th training sample.
fig,ax1 = plt.subplots(1,1, figsize=(7,7))
ax1.imshow(x_train[5])
target = y_train[5][0]
title = 'Target is {} - Class {}'.format(target_2_class[target],target )
ax1.set_title(title)
ax1.grid(which='Major')
ax1.xaxis.set_major_locator(MaxNLocator(32))
ax1.yaxis.set_major_locator(MaxNLocator(32))
fig.canvas.draw()
time.sleep(0.1)

print('Shape of x_train {}'.format(x_train.shape))
print('Shape of y_train {}'.format(y_train.shape))
print('Shape of x_test {}'.format(x_train.shape))
print('Shape of y_test {}'.format(y_train.shape))

Task VIII: One-Hot Encode the Target Values

Task: Use the `one_hot()` function you created earlier to encode:
  • `y_test`
  • `y_train`

In [ ]:
y_train = None
y_test = None

### Do *not* modify the following line ###
# Print data sizes
print('Shape of x_train {}'.format(x_train.shape))
print('Shape of y_train {}'.format(y_train.shape))
print('Shape of x_test {}'.format(x_train.shape))
print('Shape of y_test {}'.format(y_train.shape))

Task IX: Normalise the Images

Task: Use the `normalise_images()` function you created earlier to normalise the images in:
  • `x_test`
  • `x_train`

In [ ]:
x_train = None
x_test = None

Task X: Create Your Convolutional Neural Network for CIFAR10

Task: Create a neural network model to train on CIFAR using what we have learned so far.
  • Create a new network using either `net_1()` or `net_2()`
  • Display a summary of the network
  • Compile the model using eiter `Adadelta`, `Adagrad`, or `Adam` as your optimiser
Some of the code is filled in already so alter what you need.

In [ ]:
# Shape of samples
sample_shape = x_train[0].shape 

# Construct net
model = None
model.summary()

# We need to compile our model network:
model.compile(loss='categorical_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

Task XI: Train Your Model

Task: : Train the model created in the previous cell on CIFAR10.
  • Train the network using `epochs = 30`
  • Train the network using a `batch_size = 128`
  • Use `validation_split = 0.2` when calling `fit()`
  • Plot the losses and accuracy
  • Assess the performance on the test set
We recommend that you do not copy-paste from the cells above, but rather re-write the code yourself. You can always take a look at earlier cells if you are unsure about what to do.

In [ ]:
# Build the code within this cell

Topics to Think About

  • Which model performs better?
  • Which optimiser performs better?
  • Is there any evidence of overfitting?
  • How can we improve the performance even further?

If You Have Time


In [ ]: