To get an intuitive idea about Neural Networks, let us code an elementary perceptron. In this example we will illustrate some of the concepts we have seen, build a small perceptron and make a link between Perceptron and linear classification.
Before working with the MNIST dataset, you'll first test your perceptron implementation on a "toy" dataset with just a few data points. This allows you to test your implementations with data you can easily inspect and visualise without getting lost in the complexities of the dataset itself.
Start by loading two basic libraries: matplotlib
for plotting graphs (http://matplotlib.org/contents.html) and numpy
for numerical computing with vectors, matrices, etc. (http://docs.scipy.org/doc/).
In [ ]:
# Load the libraries
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
Then let us generate some points in 2-D that will form our dataset:
In [ ]:
# Create some data points
Let's visualise these points in a scatterplot using the plot
function from matplotlib
In [ ]:
# Visualise the points in a scatterplot
Here, imagine that the purpose is to build a classifier that for a given new point will return whether it belongs to the crosses (class 1) or circles (class 0).
Let’s now define a function which returns the output of a Perceptron for a single input point.
In [ ]:
# Now let's build a perceptron for our points
def outPerceptron(x,w,b):
innerProd = np.dot(x,w) # computes the weighted sum of input
output = 0
if innerProd > b:
output = 1
return output
It’s useful to define a function which returns the sequence of outputs of the Perceptron for a sequence of input points:
In [ ]:
# Define a function which returns the sequence of outputs for a sequence of input points
def multiOutPerceptron(X,w,b):
nInstances = X.shape[0]
outputs = np.zeros(nInstances)
for i in range(0,nInstances):
outputs[i] = outPerceptron(X[i,:],w,b)
return outputs
In [ ]:
# Optimise the multiOutPerceptron function
In the above implementation, the simple outPerceptron
function is called for every single instance. It is cleaner and more efficient to code everything in one function using matrices.
Let’s try some weights and thresholds, and see what happens:
In [ ]:
# Try some initial weights and thresholds
So this is clearly not great! it classifies the first point as in one category and all the others in the other one. Let's try something else (an educated guess this time).
In [ ]:
# Try an "educated guess"
This is much better! To obtain these values, we found a separating hyperplane (here a line) between the points. The equation of the line is
y = 0.5x-0.2
Quiz
Can you check that the perceptron will indeed classify any point above the red line as a 1 (cross) and every point below as a 0 (circle)?
In [ ]:
# Visualise the separating line
Now try adding new points to see how they are classified:
In [ ]:
# Add new points and test
Visualise the new test points in the graph and plot the separating lines.
In [ ]:
# Visualise the new points and line
Note here that the two sets of parameters classify the squares identically but not the triangle. You can now ask yourself, which one of the two sets of parameters makes more sense? How would you classify that triangle? These type of points are frequent in realistic datasets and the question of how to classify them "accurately" is often very hard to answer...
Definition of a function and it's gradient
$f(x) = \exp(-\sin(x))x^2$
$f'(x) = -x \exp(-\sin(x)) (x\cos(x)-2)$
It is convenient to define python functions which return the value of the function and its gradient at an arbitrary point $x$
In [ ]:
def function(x):
return np.exp(-np.sin(x))*(x**2)
def gradient(x):
return -x*np.exp(-np.sin(x))*(x*np.cos(x)-2) # use wolfram alpha!
Let's see what the function looks like
In [ ]:
# Visualise the function
Now let us implement a simple Gradient Descent that uses constant stepsizes. We define two functions, the first one is the most simple version which doesn't store the intermediate steps that are taken. The second one does store the steps which is useful to visualize what is going on and explain some of the typical behaviour of GD.
In [ ]:
def simpleGD(x0,stepsize,nsteps):
x = x0
for k in range(0,nsteps):
x -= stepsize*gradient(x)
return x
def simpleGD2(x0,stepsize,nsteps):
x = np.zeros(nsteps+1)
x[0] = x0
for k in range(0,nsteps):
x[k+1] = x[k]-stepsize*gradient(x[k])
return x
Let's see what it looks like. Let's start from $x_0 = 3$, use a (constant) stepsize of $\delta=0.1$ and let's go for 100 steps.
In [ ]:
# Try the first given values
Simple inspection of the figure above shows that that is close enough to the actual true minimum ($x^\star=0$)
A few standard situations:
In [ ]:
# Try the second given values
Ok! so that's still alright
In [ ]:
# Try the third given values
That's not... Visual inspection of the figure above shows that we got stuck in a local optimum.
Below we define a simple visualization function to show where the GD algorithm brings us. It can be overlooked.
In [ ]:
def viz(x,a=-10,b=10):
xx = np.linspace(a,b,100)
yy = function(xx)
ygd = function(x)
plt.plot(xx,yy)
plt.plot(x,ygd,color='red')
plt.plot(x[0],ygd[0],marker='o',color='green',markersize=10)
plt.plot(x[len(x)-1],ygd[len(x)-1],marker='o',color='red',markersize=10)
plt.show()
Let's show the steps that were taken in the various cases that we considered above
In [ ]:
# Visualise the steps taken in the previous cases
To summarise these three cases:
Import statements for KERAS library
In [ ]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.optimizers import SGD, RMSprop
from keras.utils import np_utils
# Some generic parameters for the learning process
batch_size = 100 # number of instances each noisy gradient will be evaluated upon
nb_classes = 10 # 10 classes 0-1-...-9
nb_epoch = 10 # computational budget: 10 passes through the whole dataset
Keras does the loading of the data itself and shuffles the data randomly. This is useful since the difficulty of the examples in the dataset is not uniform (the last examples are harder than the first ones)
In [ ]:
# Load the MNIST data
You can also depict a sample from either the training or the test set using the imshow()
function:
In [ ]:
# Display the first image
Ok the label 5 does indeed seem to correspond to that number! Let's check the dimension of the dataset
Each image in MNIST has 28 by 28 pixels, which results in a $28\times 28$ array. As a next step, and prior to feeding the data into our NN classifier, we needd to flatten each array into a $28\times 28$=784 dimensional vector. Each component of the vector holds an integer value between 0 (black) and 255 (white), which we need to normalise to the range 0 and 1.
In [ ]:
# Reshaping of vectors in a format that works with the way the layers are coded
Remember, it is always good practice to check the dimensionality of your train and test data using the shape
command prior to constructing any classification model:
In [ ]:
# Check the dimensionality of train and test
So we have 60,000 training samples, 10,000 test samples and the dimension of the samples (instances) are 28x28 arrays. We need to reshape these instances as vectors (of 784=28x28 components). For storage efficiency, the values of the components are stored as Uint8, we need to cast that as float32 so that Keras can deal with them. Finally we normalize the values to the 0-1 range.
The labels are stored as integer values from 0 to 9. We need to tell Keras that these form the output categories via the function to_categorical
.
In [ ]:
# Set y categorical
A neural network model consists of artificial neurons arranged in a sequence of layers. Each layer receives a vector of inputs and converts these into some output. The interconnection pattern is "dense" meaning it is fully connected to the previous layer. Note that the first hidden layer needs to specify the size of the input which amounts to implicitly having an input layer.
In [ ]:
# First, declare a model with a sequential architecture
# Then add a first layer with 500 nodes and 784 inputs (the pixels of the image)
# Define the activation function to use on the nodes of that first layer
# Second hidden layer with 300 nodes
# Output layer with 10 categories (+using softmax)
In [ ]:
# Definition of the optimizer.
Finding the right arguments here is non trivial but the choice suggested here will work well. The only parameter we can explain here is the first one which can be understood as an initial scaling of the gradients.
At this stage, launch the learning (fit the model). The model.fit
function takes all the necessary arguments and trains the model. We describe below what these arguments are:
In [ ]:
# Fit the model
Obviously we care far more about the results on the validation set since it is the data that the NN has not used for its training. Good results on the test set means the model is robust.
In [ ]:
# Display the results, the accuracy (over the test set) should be in the 98%
In [ ]:
def whatAmI(img):
score = model.predict(img,batch_size=1,verbose=0)
for s in range(0,10):
print ('Am I a ', s, '? -- score: ', np.around(score[0][s]*100,3))
In [ ]:
index = 1004 # here use anything between 0 and 9999
test = np.reshape(images_train[index,],(1,784))
plt.imshow(np.reshape(test,(28,28)), cmap="gray")
whatAmI(test)
In [ ]:
from scipy import misc
In [ ]:
test = misc.imread('data/ex7.jpg')
test = np.reshape(test,(1,784))
test = test.astype('float32')
test /= 255.
plt.imshow(np.reshape(test,(28,28)), cmap="gray")
whatAmI(test)
In [ ]: