Neural Networks

The Perceptron

To get an intuitive idea about Neural Networks, let us code an elementary perceptron. In this example we will illustrate some of the concepts we have seen, build a small perceptron and make a link between Perceptron and linear classification.

Learning Activity 1: Generating some data

Before working with the MNIST dataset, you'll first test your perceptron implementation on a "toy" dataset with just a few data points. This allows you to test your implementations with data you can easily inspect and visualise without getting lost in the complexities of the dataset itself.

Start by loading two basic libraries: matplotlib for plotting graphs (http://matplotlib.org/contents.html) and numpy for numerical computing with vectors, matrices, etc. (http://docs.scipy.org/doc/).


In [ ]:
# Load the libraries

import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Then let us generate some points in 2-D that will form our dataset:


In [ ]:
# Create some data points

Let's visualise these points in a scatterplot using the plot function from matplotlib


In [ ]:
# Visualise the points in a scatterplot

Here, imagine that the purpose is to build a classifier that for a given new point will return whether it belongs to the crosses (class 1) or circles (class 0).

Learning Activity 2: Computing the output of a Perceptron

Let’s now define a function which returns the output of a Perceptron for a single input point.


In [ ]:
# Now let's build a perceptron for our points

def outPerceptron(x,w,b):
    innerProd = np.dot(x,w)    # computes the weighted sum of input
    output    = 0
    if innerProd > b:
        output = 1
    return output

It’s useful to define a function which returns the sequence of outputs of the Perceptron for a sequence of input points:


In [ ]:
# Define a function which returns the sequence of outputs for a sequence of input points

def multiOutPerceptron(X,w,b):
    nInstances = X.shape[0]
    outputs    = np.zeros(nInstances)
    for i in range(0,nInstances):
        outputs[i] = outPerceptron(X[i,:],w,b)
    return outputs

Bonus Activity: Efficient coding of multiOutPerceptron

In the above implementation, the simple outPerceptron function is called for every single instance. It is cleaner and more efficient to code everything in one function using matrices:


In [ ]:
# Optimise the multiOutPerceptron function

In the above implementation, the simple outPerceptron function is called for every single instance. It is cleaner and more efficient to code everything in one function using matrices.

Learning Activity 4: Playing with weights and thresholds

Let’s try some weights and thresholds, and see what happens:


In [ ]:
# Try some initial weights and thresholds

So this is clearly not great! it classifies the first point as in one category and all the others in the other one. Let's try something else (an educated guess this time).


In [ ]:
# Try an "educated guess"

This is much better! To obtain these values, we found a separating hyperplane (here a line) between the points. The equation of the line is

y = 0.5x-0.2

Quiz

  • Can you explain why this line corresponds to the weights and bias we used?
  • Is this separating line unique? what does it mean?

Can you check that the perceptron will indeed classify any point above the red line as a 1 (cross) and every point below as a 0 (circle)?

Learning Activity 5: Illustration of the output of the Perceptron and the separating line


In [ ]:
# Visualise the separating line

Now try adding new points to see how they are classified:


In [ ]:
# Add new points and test

Visualise the new test points in the graph and plot the separating lines.


In [ ]:
# Visualise the new points and line

Note here that the two sets of parameters classify the squares identically but not the triangle. You can now ask yourself, which one of the two sets of parameters makes more sense? How would you classify that triangle? These type of points are frequent in realistic datasets and the question of how to classify them "accurately" is often very hard to answer...

Gradient Descent

Learning Activity 6: Coding a simple gradient descent

Definition of a function and it's gradient

$f(x) = \exp(-\sin(x))x^2$

$f'(x) = -x \exp(-\sin(x)) (x\cos(x)-2)$

It is convenient to define python functions which return the value of the function and its gradient at an arbitrary point $x$


In [ ]:
def function(x):
    return np.exp(-np.sin(x))*(x**2)

def gradient(x):
    return -x*np.exp(-np.sin(x))*(x*np.cos(x)-2) # use wolfram alpha!

Let's see what the function looks like


In [ ]:
# Visualise the function

Now let us implement a simple Gradient Descent that uses constant stepsizes. We define two functions, the first one is the most simple version which doesn't store the intermediate steps that are taken. The second one does store the steps which is useful to visualize what is going on and explain some of the typical behaviour of GD.


In [ ]:
def simpleGD(x0,stepsize,nsteps):
    x    = x0
    for k in range(0,nsteps):
        x -= stepsize*gradient(x)
    return x

def simpleGD2(x0,stepsize,nsteps):
    x    = np.zeros(nsteps+1)
    x[0] = x0
    for k in range(0,nsteps):
        x[k+1] = x[k]-stepsize*gradient(x[k])
    return x

Let's see what it looks like. Let's start from $x_0 = 3$, use a (constant) stepsize of $\delta=0.1$ and let's go for 100 steps.


In [ ]:
# Try the first given values

Simple inspection of the figure above shows that that is close enough to the actual true minimum ($x^\star=0$)

A few standard situations:


In [ ]:
# Try the second given values

Ok! so that's still alright


In [ ]:
# Try the third given values

That's not... Visual inspection of the figure above shows that we got stuck in a local optimum.

Below we define a simple visualization function to show where the GD algorithm brings us. It can be overlooked.


In [ ]:
def viz(x,a=-10,b=10):
    xx  = np.linspace(a,b,100)
    yy  = function(xx)
    ygd = function(x)
    plt.plot(xx,yy)
    plt.plot(x,ygd,color='red')
    plt.plot(x[0],ygd[0],marker='o',color='green',markersize=10)
    plt.plot(x[len(x)-1],ygd[len(x)-1],marker='o',color='red',markersize=10)
    plt.show()

Let's show the steps that were taken in the various cases that we considered above


In [ ]:
# Visualise the steps taken in the previous cases

To summarise these three cases:

  • In the first case, we start from a sensible point (not far from the optimal value $x^\star = 0$ and on a slope that leads directly to it) and we get to a very satisfactory point.
  • In the second case, we start from a less sensible point (on a slope that does not lead directly to it) and yet the algorithm still gets us to a very satisfactory point.
  • In the third case, we also start from a bad location but this time the algorithm gets stuck in a local minima.

Attacking MNIST

Learning Activity 7: Loading the Python libraries

Import statements for KERAS library


In [ ]:
from keras.datasets import mnist 
from keras.models import Sequential 
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD, RMSprop
from keras.utils import np_utils

# Some generic parameters for the learning process
batch_size = 100   # number of instances each noisy gradient will be evaluated upon
nb_classes = 10    # 10 classes 0-1-...-9
nb_epoch   = 10    # computational budget: 10 passes through the whole dataset

Learning Activity 8: Loading the MNIST dataset

Keras does the loading of the data itself and shuffles the data randomly. This is useful since the difficulty of the examples in the dataset is not uniform (the last examples are harder than the first ones)


In [ ]:
# Load the MNIST data

You can also depict a sample from either the training or the test set using the imshow() function:


In [ ]:
# Display the first image

Ok the label 5 does indeed seem to correspond to that number! Let's check the dimension of the dataset

Learning Activity 9: Reshaping the dataset

Each image in MNIST has 28 by 28 pixels, which results in a $28\times 28$ array. As a next step, and prior to feeding the data into our NN classifier, we needd to flatten each array into a $28\times 28$=784 dimensional vector. Each component of the vector holds an integer value between 0 (black) and 255 (white), which we need to normalise to the range 0 and 1.


In [ ]:
# Reshaping of vectors in a format that works with the way the layers are coded

Remember, it is always good practice to check the dimensionality of your train and test data using the shape command prior to constructing any classification model:


In [ ]:
# Check the dimensionality of train and test

So we have 60,000 training samples, 10,000 test samples and the dimension of the samples (instances) are 28x28 arrays. We need to reshape these instances as vectors (of 784=28x28 components). For storage efficiency, the values of the components are stored as Uint8, we need to cast that as float32 so that Keras can deal with them. Finally we normalize the values to the 0-1 range.

The labels are stored as integer values from 0 to 9. We need to tell Keras that these form the output categories via the function to_categorical.


In [ ]:
# Set y categorical

Learning Activity 10: Building a NN classifier

A neural network model consists of artificial neurons arranged in a sequence of layers. Each layer receives a vector of inputs and converts these into some output. The interconnection pattern is "dense" meaning it is fully connected to the previous layer. Note that the first hidden layer needs to specify the size of the input which amounts to implicitly having an input layer.


In [ ]:
# First, declare a model with a sequential architecture

# Then add a first layer with 500 nodes and 784 inputs (the pixels of the image)

# Define the activation function to use on the nodes of that first layer

# Second hidden layer with 300 nodes

# Output layer with 10 categories (+using softmax)

Learning Activity 11: Training and testing of the model

Here we define a somewhat standard optimizer for NN. It is based on Stochastic Gradient Descent with some standard choice for the annealing.


In [ ]:
# Definition of the optimizer.

Finding the right arguments here is non trivial but the choice suggested here will work well. The only parameter we can explain here is the first one which can be understood as an initial scaling of the gradients.

At this stage, launch the learning (fit the model). The model.fit function takes all the necessary arguments and trains the model. We describe below what these arguments are:

  • the training set (points and labels)
  • global parameters for the learning (batch size and number of epochs)
  • whether or not we want to show output during the learning
  • the test set (points and labels)

In [ ]:
# Fit the model

Obviously we care far more about the results on the validation set since it is the data that the NN has not used for its training. Good results on the test set means the model is robust.


In [ ]:
# Display the results, the accuracy (over the test set) should be in the 98%

Bonus: Does it work?


In [ ]:
def whatAmI(img):
    score = model.predict(img,batch_size=1,verbose=0)
    for s in range(0,10):
        print ('Am I a ', s, '? -- score: ', np.around(score[0][s]*100,3))

In [ ]:
index = 1004 # here use anything between 0 and 9999
test  = np.reshape(images_train[index,],(1,784))
plt.imshow(np.reshape(test,(28,28)), cmap="gray")
whatAmI(test)

Does it work? (experimental Pt2)


In [ ]:
from scipy import misc

In [ ]:
test  = misc.imread('data/ex7.jpg')
test  = np.reshape(test,(1,784))
test  = test.astype('float32')
test /= 255.
plt.imshow(np.reshape(test,(28,28)), cmap="gray")
whatAmI(test)

In [ ]: