In [ ]:
"""This area sets up the Jupyter environment.
Please do not modify anything in this cell.
import os
import sys

# Add project to PYTHONPATH for future use
sys.path.insert(1, os.path.join(sys.path[0], '..'))

# Import miscellaneous modules
from IPython.core.display import display, HTML

# Set CSS styling
with open('../admin/custom.css', 'r') as f:
    style = """<style>\n{}\n</style>""".format(

import keras
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import as tools
import problem_unittests as tests

Multilayer Perceptron with Keras

In this notebook we will look at how to create a type of artificial neural network called multilayer perceptron (MLP) using [Keras]( To do this we will again work through a few examples, however, this time the focus will be on classification problems.

Similar to regression, in classification we aim to create models that fit a conditional probability distribution $\Pr(y\vert\mathbf{x})$, where $y$ are the target values and $\mathbf{x}$ is the data. However, in classification the model is trained to output a discrete value from a set of values. For example, in binary classification we fit a probability distribution $\Pr(y\vert\mathbf{x})$ where $y \in \{0,1\}$. That is, the model outputs the probability that the input $\mathbf{x}$ belong to the class $y=1$.

There are several different learning algorithms for doing classification, like decision tree learning (DTL) or support vector machine (SVM), however for artificial neural networks we will extend the idea of logistic regression.

The Basics of Multinomial Logistic Regression

Multinomial logistic, or softmax, regression is a generalisation of the classification algorithm logistic regression to multiclass classification problems; these are problems where the target $y$ can be one of $K$ classes, i.e. $y \in \{1, \ldots, K\}$. Whereas in logistic regression where $\Pr(y=1\vert\mathbf{x})$ is captured by the logistic function $\sigma(z)=\frac{1}{1+e^{-z}}$, where $z=\mathbf{w}^\intercal\mathbf{x}$, the multinomial variant uses the softmax function to capture a multiclass categorical probability distribution. The softmax function is defined as:

$$ \begin{equation*} \Pr(y = k\vert\mathbf{x}) = \sigma(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{i=1}^{K}e^{z_i}}, \qquad \text{where} \qquad \mathbf{z} = \mathbf{w}^\intercal\mathbf{x} \end{equation*} $$

That is, the softmax function $\sigma(\mathbf{z})$ outputs a $K$-dimensional vector of real values between 0 and 1. Each value in this distribution signifies the probability that class $K=i$ is true given the input $\mathbf{x}$. The numerator exponentiates a predicted log probability $z_k$, while the denominator normalises over all predicted log probabilities $\mathbf{z}$ to get a valid probability distribution.

Just as with logistic regression, training the softmax function to output meaningful probabilities can be done using maximum likelihood: $\underset{\mathbf{w}}{\arg\max}\mathcal{L}(\mathbf{w})$. Maximising the likelihood is the same as minimising the negative likelihood. Additionally, we typically take the log of the likelihood as it does not change the result but simplifies some of the maths. Doing this yields the following error function when taken over all $N$ training examples:

$$ \begin{equation*} \begin{aligned} E(\mathbf{w}) &= -\frac{1}{N}\log\mathcal{L}(\mathbf{w}) \\ &= -\frac{1}{N}\sum_{i=1}^{N}\log\prod_{k=1}^{K}\Pr(y_i = k\vert\mathbf{x}_i)^{\mathbb{1}\{y_i = k\}} \\ &= -\frac{1}{N}\sum_{i=1}^{N}\log\prod_{k=1}^{K}\sigma(\mathbf{z}_i)^{\mathbb{1}\{y_i = k\}} \\ &= -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K}\mathbb{1}\{y_i = k\}\log\sigma(\mathbf{z}_i) \end{aligned} \end{equation*} $$

$\mathbb{1}$ is the indicator function where $\mathbb{1}\{y_i = k\}$ is 1 if and only if example $i$ belongs to class $k$. This version of the error function is typically called the categorical cross entropy error function. When $K=2$ multinomial logistic regression reduces to simple logistic regression.

The Forward Pass: Inference

Let us now focus on feedforward neural networks which for all intents and purposes is synonymous with multilayer perceptron. An example of a feedforward network can be seen below:

This network consist of two hidden layers $\mathbf{u}$ and $\mathbf{v}$, where the input $\mathbf{x}$ is connected to $\mathbf{u}$, and $\mathbf{v}$ is connected to the output $\mathbf{y}$. These types of hidden layers are typically referred to as fully-connected because every neuron in a layer is connected to every neuron in the preceding layer. Fully-connected layers are sometimes called densely-connected.

The output of a single neuron typically comprise of two steps: (i) integrate the input by taking a linear combination of the input and associated weights, and (ii) feeding this scalar into an activation function. This can be summarised as a dot product between inputs and weights wrapped in a function, such as the logistic function. To simplify all of these dot products, we can gather all weights linking neurons from one layer to another into a matrix called $\mathbf{W}^{(l)}$ for layer $l$, where element $\mathbf{W}^{(l)}_{ij}$ indicates the weight from neuron $i$ to neuron $j$ for layer $l$.

With this representation, the output activation $\mathbf{a}^{(l)}$ for layer $l$ is simply the matrix-vector product $\sigma(\mathbf{a}^{(l-1)}\mathbf{W}^{(l)})$, where $\sigma(.)$ is an activation function and is applied element-wise. That is, before we apply the activation function we take the activations from the preceding layer $(l-1)$ and multiply it with the weight matrix for the current layer. For example, for the network topology above we can calculate the activations for layer $\mathbf{u}$ by computing $\sigma(\mathbf{x}\mathbf{W}^{(1)})$, assuming $\mathbf{x}$ is a row vector. This makes sense, because $\mathbf{x}$ is a $1 \times 4$ row vector and $\mathbf{W}^{(1)}$ is $4 \times 3$ matrix. Thus, the output activation for layer $\mathbf{u}$ has the size $1 \times 3$, i.e. there are three neurons with one value each.

Task I: Implementing the Forward Pass

Using the matrix multiplication approach outlined above, perform the forward pass (inference step) on the network topology we discussed in the previous section using the weight matrices below on the input vector $\mathbf{x}=[1,-4,0,7]$. The input vector is a row vector.

$$ \begin{equation*} \mathbf{W}^{(1)} = \begin{bmatrix} -0.1 & 0.2 & 0.1 \\ 0.1 & 0.4 & 0.1 \\ 0.0 & -0.7 & 0.2 \\ 0.6 & 0.3 & -0.4 \end{bmatrix} \quad \mathbf{W}^{(2)} = \begin{bmatrix} 0.3 & -0.8 & 0.1 & 0.0 \\ 0.0 & 0.1 & 0.2 & 0.8 \\ -0.2 & 0.7 & 0.4 & 0.1 \end{bmatrix} \quad \mathbf{W}^{(3)} = \begin{bmatrix} 0.2 \\ 0.1 \\ 0.5 \\ 0.4 \end{bmatrix} \end{equation*} $$

Every layer - that is $\mathbf{u}$, $\mathbf{v}$, and $\mathbf{y}$ - uses the logistic function as its activation function, i.e. $\sigma(z)=\frac{1}{1+e^{-z}}$.

**Task**: Implement the logistic function $\sigma(\mathbf{z})$ and compute the output activations for each layer. Here are a two functions that will be useful for solving this task:
  • `np.exp(x)` - exponentiates the argument `x`
  • `, b)` - performs matrix multiplications between matrix `a` and `b`
If you are unfamiliar with how to work with matrices in Python using NumPy, then it may be a good a idea to take a look at the following guide: [Getting started with Python](

In [ ]:
# Create input signal
x = np.array([[ 1, -4, 0, 7]])

# Create weight matrices for the three layers
W1 = np.array([[-0.1,  0.2,  0.1],
               [ 0.1,  0.4,  0.1],
               [ 0.0, -0.7,  0.2],
               [ 0.6,  0.3, -0.4]])

W2 = np.array([[ 0.3, -0.8,  0.1, 0.0],
               [ 0.0,  0.1,  0.2, 0.8],
               [-0.2,  0.7,  0.4, 0.1]])

W3 = np.array([[0.2],

# Create the logistic function
def logistic(z):
    out = None

    return out

# Compute activation of hidden layer `U`
U = None

# Compute activation of hidden layer `V`
V = None

# Compute activation of output layer `y`
y = None

### Do *not* modify the following line ###
# Test and see that the calculations have been done correctly
tests.test_forward_pass(logistic, U, V, y)

The Backward Pass: Backpropagation

Backpropagation is the name of one learning algorithm for training artificial neural networks. While one of several, it is currently the most popular due to how efficiently it can be applied on modern hardware. As with other learning algorithms, the goal of backpropagation is to minimise an error or loss function.

After having run the forward, or inference, step for an artificial neural network and calculated the error we would like to figure out how to perturbe each of the weights in the network in order to reduce the error. Backpropagation does this by applying the all too familiar gradient descent algorithm we have seen earlier with the chain rule of calculus. Using notation from the first notebook, we would like to apply the following update rule to every weight in the network: $\mathbf{W}^{(l)}(k+1)\leftarrow\mathbf{W}^{(l)}(k) - \eta\frac{\partial E}{\partial\mathbf{W}^{(l)}}$, where $k$ is the current iteration or epoch, $\eta$ is the learning rate, and $\mathbf{W}^{(l)}$ is the weight matrix for layer $l$. This means that for every weight matrix backpropagation finds $\frac{\partial E}{\partial\mathbf{W}^{(l)}}$ by applying the chain rule of calculus.

In order to find $\frac{\partial E}{\partial\mathbf{W}^{(l)}}$ let's start by dividing the total amount of error amongst the neurons. The division depends on the linear combination $\mathbf{z}^{(l)}$, i.e. before the activation function, and is typically defined as $\delta^{(l)}$ for layer $l$. When $l$ is the last layer, which we will call $L$, $\delta^{(L)}$ for a specific output neuron $i$ is defined like so:

$$ \begin{equation*} \begin{aligned} \delta_{i}^{(L)} &= \frac{\partial E}{\partial \mathbf{z}_{i}^{(L)}} \\ &= \frac{\partial E}{\partial \mathbf{a}_{i}^{(L)}}\frac{\partial \mathbf{a}_{i}^{(L)}}{\partial \mathbf{z}_{i}^{(L)}} \\ &= \frac{\partial E}{\partial \mathbf{a}_{i}^{(L)}}\sigma'(\mathbf{z}_{i}^{(L)}) \end{aligned} \end{equation*} $$

Here, $\sigma'$ is defined as the derivative of the relevant activation function. Similarly, the error for a specific neuron $i$ in an arbitrary layer $l$ is defined as:

$$ \begin{equation*} \begin{aligned} \delta_{i}^{(l)} &= \frac{\partial E}{\partial \mathbf{z}_{i}^{(l)}} \\ &= \sum_{j}\frac{\partial E}{\partial \mathbf{z}_{j}^{(l+1)}}\frac{\partial \mathbf{z}_{j}^{(l+1)}}{\partial \mathbf{z}_{i}^{(l)}} \\ &= \sum_{j}\frac{\partial \mathbf{z}_{j}^{(l+1)}}{\partial \mathbf{z}_{i}^{(l)}}\delta_{j}^{(l+1)} \\ &= \sum_{j}\mathbf{W}_{ij}^{(l+1)}\delta_{j}^{(l+1)}\sigma'(\mathbf{z}_{i}^{l}) \end{aligned} \end{equation*} $$

That is, the error attributed to neuron $i$ in layer $l$ is the error over all neurons in layer $l+1$ weighted and multiplied by $\sigma'$ for the neuron we are interested in. Now that we know the gist of how errors are distributed amongst the neurons we can define $\frac{\partial E}{\partial\mathbf{W}_{ij}^{(l)}}$ as:

$$ \begin{equation*} \frac{\partial E}{\partial\mathbf{W}_{ij}^{(l)}} = a_{i}^{l-1}\delta_{j}^{l} \end{equation*} $$

In other words, the amount we alter weight $\mathbf{W}_{ij}^{(l)}$ by, i.e. edge from $i$ to $j$, is the error $\delta_{j}$ of layer $l$ weighted by the activation $a_{i}$ at layer $l-1$. Notice that $l-1$ may be the input layer $\mathbf{x}$.

We will not dwell more on the intricacies of backpropagation here as there are many packages that implement it; Keras, or rather TensorFlow / Theano, being one of them.

Task II: Learning the XOR Function

The XOR function (exclusive or) is a logical operation over two binary inputs that return 1 (true) when exactly one of its inputs is 1, otherwise it returns 0 (false). In other words, learning the XOR function can be viewed as a binary classification problem. The truth table for the XOR function can be seen in below.

$x_1$ $x_2$ Output
0 0 0
0 1 1
1 0 1
1 1 0

It is inherently a nonlinear problem because a line cannot separate the two output classes.

In the follow code snippet we will:
  • Create the XOR dataset
  • Plot the data

In [ ]:
# Create data as NumPy arrays
DATA_X = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=np.float32)
DATA_y = np.array([[0], [1], [1], [0]], dtype=np.float32)

# Plot data

plt.scatter(DATA_X[np.where(DATA_y==0)[0], 0],
            DATA_X[np.where(DATA_y==0)[0], 1],
plt.scatter(DATA_X[np.where(DATA_y==1)[0], 0],
            DATA_X[np.where(DATA_y==1)[0], 1],

plt.xticks([0, 1])
plt.yticks([0, 1])

As we can see a simple line definitely cannot separate the blue and red points. Thankfully, a simple feedforward neural network with one hidden layer containing two neurons can solve the problem. The artificial neural network we will implement in Keras can be seen below. The two bias neurons are not necessary, but have been added for completeness.

The network above has a single hidden layer with two neurons and all activation functions are assumed to be logistic functions. This network consist of two steps: (i) transform the input to a different feature space, this being the hidden layer; and (ii) perform classification via this new feature space. Our hope is that the feature space made by the hidden layer enables us to linearly separate the points.

**Task**: Build the model in the figure above using the [Keras functional guide](
  • Input()
  • Dense() - Bias nodes are by default on but can be removed by setting the `use_bias` to `False`
  • Model()
All layers have to use the logistic function as their activation function. This is achieved by giving `Dense()` the argument `activation='sigmoid'`.

In [ ]:
# Import what we need
from keras.layers import (Input, Dense)
from keras.models import Model

def build_xor_model():
    """Return a Keras Model.
    model = None

    return model

### Do *not* modify the following line ###
# Test and see that the model has been created correctly

Now that we have successfully created the artificial neural network model, let's train it on the XOR dataset we defined earlier and see if we can learn the function. Similarly to the previous notebooks, notice that we are using the same stochastic gradient descent optimiser as before; however, this time we are using the binary cross entropy function we briefly discussed earlier.

In the following code snippet we will:
  • Create a model using the `build_xor_model()` function we made earlier
  • Train the network with backpropagation
  • Print the final predications
  • Plot the decision boundary learned by the artificial neural network
Due to how important initialisation of weight matrices are, we may be unlucky and get a solution which does not solve the XOR problem. If this is the case, then simply re-run the cell.

In [ ]:
"""Do not modify the following code. It is to be used as a refence for future tasks.

# Create a XOR model
model = build_xor_model()

# Define hyperparameters
lr = 1.0
nb_epochs = 500

# Define optimiser
optimizer = keras.optimizers.sgd(lr=lr)

# Compile model, use binary_crossentropy
model.compile(loss='binary_crossentropy', optimizer=optimizer)

# Print model

# Train model, DATA_y,

# Print predictions
y_hat = model.predict(DATA_X)
print('\nFinal predictions:')
for idx in range(len(y_hat)):
    print('{} ~> {:.2f} (Ground truth = {})'.format(DATA_X[idx], y_hat[idx][0], DATA_y[idx][0]))

# Plot XOR data and decision boundary
xx, yy = np.meshgrid(np.arange(-0.1, 1.1, 0.01), np.arange(-0.1, 1.1, 0.01))
grid = np.vstack((xx.ravel(), yy.ravel())).T

preds = model.predict(grid)[:, 0].reshape(xx.shape)

f, ax = plt.subplots()
ax.contour(xx, yy, preds, levels=[0.5], colors='k', linewidths=1.0)

ax.scatter(DATA_X[np.where(DATA_y==0)[0], 0],
            DATA_X[np.where(DATA_y==0)[0], 1],
ax.scatter(DATA_X[np.where(DATA_y==1)[0], 0],
            DATA_X[np.where(DATA_y==1)[0], 1],

plt.xticks([0, 1])
plt.yticks([0, 1])

Let's visualise the output of the hidden layer to better understand the feature space representation. It just so happens that the layer only has two neurons, which means we can safely plot it as an image.

In the following code snippet we will:
  • Plot the data after it has been transformed to the feature space

In [ ]:
# Get the output activation given the XOR data for the hidden layer
intermediate_layer = Model(inputs=model.input,
feature_space = intermediate_layer.predict(DATA_X)

# Visualise representation as a scatter plot

plt.scatter(feature_space[np.where(DATA_y==0)[0], 0],
            feature_space[np.where(DATA_y==0)[0], 1],
plt.scatter(feature_space[np.where(DATA_y==1)[0], 0],
            feature_space[np.where(DATA_y==1)[0], 1],

plt.xticks([0, 1])
plt.yticks([0, 1])

As we can see, the red and blue points are now linearly separable.

Task III: The Circle Dataset

The second dataset we will take take a look at is also a nonlinear one. The circle dataset is a synthetic dataset with two output classes, which means we yet again are dealing with a binary classification problem.

In the follow code snippets we will:
  • Load the circle dataset
  • Plot the data: Testing data has a darker colour compared to the training data

In [ ]:
# Load the circle data set
X_train, y_train, X_test, y_test = tools.load_csv_data(
    'resources/cl-train.csv', 'resources/cl-test.csv')

# Plot both the training and test set

tools.plot_2d_train_test(X_train, y_train, X_test, y_test)

As we can see from the plot above the coloured points cannot be separated by a single line, which means that the dataset is not linearly separable. While the dataset can be solved quite easily by, for example, applying the transformation $\phi(\mathbf{x})=(\mathbf{x} - 0.5)^2$, we will be using an artificial neural network to do the job for us.

This task is experimentation-based which means you will have to come up with most of the network topology yourself.

**Task**: Build an artificial neural network model that can solve the circle dataset. As before, you only need to use fully-connected layers (Dense()) with logistic activation to solve the problem.

The model input and model output has been created for you, so you can focus on the hidden representation(s).

In [ ]:
def build_circle_model():
    """Return a Keras Model.
    model = None

    # Define model inputs
    inputs = Input(shape=(2,))

    # Come up with your hidden representation here (one or more layers)
    hidden = None
    # Define model output (make sure to link it with the correct previous layer)
    outputs =  Dense(1, activation='sigmoid')(inputs)
    # Build model
    model = Model(inputs=inputs, outputs=outputs)

    return model

Now that we have a model, let's continue on and train it using Keras. This time however, you will have to set up everything yourself.

**Task**: Train the model you just created:
  • Call `build_circle_model()` to build the model
  • Set up an optimiser
  • Compile and fit the model
We have picked some sensible hyperparameters for you, but you are free to experiment. It is strongly recommended that you set `verbose` to $0$ just like we did in the XOR example when calling the `fit()` method.

In [ ]:
# Create a circle model
model = None

# Define hyperparameters
lr = 3.0
nb_epochs = 500
batch_size = 10

# Define optimiser
optimizer = None

# Compile model, use binary_crossentropy

# Train model (make sure you input the correct data)

"""Do not modify the following code. It is to be used as a refence for future tasks.
# Plot circle dataset and decision boundary
xx, yy = np.meshgrid(np.arange(-0.1, 1.1, 0.01), np.arange(-0.1, 1.1, 0.01))
grid = np.vstack((xx.ravel(), yy.ravel())).T

preds = model.predict(grid)[:, 0].reshape(xx.shape)

plt.contour(xx, yy, preds, levels=[0.5], colors='k', linewidths=1.0)

tools.plot_2d_train_test(X_train, y_train, X_test, y_test)

plt.xticks([0, 1])
plt.yticks([0, 1])

The task is considered finished when you have managed to separate the red and blue points by the decision boundary.

Task IV: Digit Classification with MNIST

Before we end this notebook we will take a look at a multiclass classification problem called the MNIST database (Modified National Institute of Standards and Technology database):

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.


We will go more in-depth on images and how to train on images in the next notebook on convolutional networks, so for now, we will hand-wave most of the details and explanations.

In the follow code snippets we will:
  • Load the MNIST dataset
  • Normalise the images between 0 and 1 to simplify the training
  • Ensure that the target outputs are one-hot encoded (see next notebook)
  • Plot a few images from the dataset

In [ ]:
# Load data using Keras
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()

# Normalise images
X_train = X_train / 255
X_test = X_test / 255

# One-hot encode targets
y_train = keras.utils.to_categorical(y_train, num_classes=10)
y_test = keras.utils.to_categorical(y_test, num_classes=10)

# Plot a few of images
fig, ax = plt.subplots(1, 4)

The images in MNIST are fairly small, just $28\times 28$ pixels big. To be able to train a multilayer perceptron model the images first have to be flattened to $28\times 28 = 784$ values between 0 and 1. There are a total of 10 classes (digits 0 through 9) which means our neural network will have 784 inputs and 10 outputs. Your programming task will focus on coming up with a sensible hidden representation - one or more layers - to be able to classify the digits well.

This is the first task where the number of output classes is more than two. Thus, for this problem we will be using the softmax function as the activation function for the output layer.

**Task**: Experiment with different hidden representations for solving the MNIST classification problem. As usual, you only need to use fully-connected layers (Dense()), but will free to try out different [activation functions](

For this task you will need to set up the input layer, hidden representation, output layer, and model yourself. Refer back to earlier code and notebooks if you are unsure about how to do this. The most important thing to keep in mind is the output layer must use `activation='softmax'`.

In [ ]:
def mlp_mnist_model(nb_inputs, nb_outputs):
    """Return a Keras Model.
    model = None

    return model

Now that have our model the next thing we have to do is test out and see we are able to learn something useful from the MNIST dataset.

In the follow code snippets we will:
  • Train MNIST with the categorical cross entropy error function we discussed in the beginning of this notebook
The hyperparameters we have set below are fairly sensible, and you should be able to get decent performance with them; however, feel free to experiment.

In [ ]:
"""Do not modify the following code.

# Create flattened version for the MLP
X_trainf = X_train.reshape(X_train.shape[0], -1)
X_testf = X_test.reshape(X_test.shape[0], -1)

# Create MNIST MLP model
model = mlp_mnist_model(X_trainf.shape[1], 10)

# Define hyperparameters
lr = 0.01
nb_epochs = 20
batch_size = 128

# Define optimiser
optimizer = keras.optimizers.sgd(lr=lr, nesterov=True)

# Compile model, use categorical_crossentropy
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

# Print model

# Train and record history
logs =, y_train,
                 validation_data=(X_testf, y_test),

# Plot the error
fig, ax = plt.subplots(1,1)

ax.set_ylabel('Loss / Error')

We can get fairly good performance on MNIST by using just a simple MLP, however, we may be able to get even better performance by using a type of artificial neural network called a convolutional network. This will be the topic of the next Jupyter notebook.

Digression: What is Deep Learning?

The core idea of deep learning is to build complex concepts from simple concepts. We typically think of it as learning hierarchical representations of concepts, i.e. each new representation builds on the previous representation.

In "shallow" learning, such as linear and logistic regression, we construct simple models over the feature domain. These are either just the raw inputs, or some hand-engineered features found via feature extraction. In deep learning, the feature extraction procedure is a part of the learning process, and as we saw above: the features are learned as a hierarchy of simple-to-complex representations.

Common misunderstanding: Deep learning is not synonymous with artificial neural networks (ANNs). While ANNs are the quintessential deep learning models, they are not the only choice. For example, probabilistic graphical models are also a common choice.

In [ ]: