A neuron is a cell which can be electrically excited. It processes and transmits information in the brain. Typically, a neuron is made of a cell body, dendrites, and an axon. Networks of neurons are interconnected in the brain by membranes called synapses. Synaptic signals from other neurons are received at the dendrites while signals to other neurons are transmitted by the axon.
Perceptrons were first introduced in the 1950s and 1960s as an artificial neuron.
It is easy to see the analogy between the two : in an artificial neuron (perceptron), several inputs are combined (mimicking the role of the dendrites) to produce an output (mimicking the role of the axon).
A simple way to combine the inputs into a single output is to compute a weighted sum. The value of this sum with respect to a given threshold gives the output.
\begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right. \end{eqnarray}We an actually replace the threshold by introducing what we call a bias term :
$$\begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right. \end{eqnarray}$$This will prove handy in the following.
It is quite easy to visualise the perceptron as a way of weighing information. Assuming we want to answer the question : will it rain tomorrow ?. We can gather some data about the weather today :
The perceptron combines all this information with varying importance (the weights) and eventually fires a yes or no answer.
The weights can be determined using existing data and minimising the prediction error on this data with gradient descent.
Why stop at a single neuron ? After all, the brain is made up of about 100 billion neurons! We can combine several perceptrons together (see fig) in the hope of creating greater levels of abstractions which will allow for more complex decisions.
In the following, we will now see how the weights and biases in a complicated network can be tuned to adapt automatically to any data.
Recall that perceptrons take decisions on hard thresholding. This makes it difficult to tune our network. Let's give an example. Sat we want to improve the network's performance on misclassified data points. We can make a small change on one of the weight and see if classification improves. However, with hard thresholding, small changes may completely alter the behaviour of the network as the output may brutally fall on the other side of the threshold. This problem is alleviated by the use of a continuous activation function, like the sigmoid (soft thresholding).
With the sigmoid, the output of the neural network becomes :
$$\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}. \end{eqnarray}$$The differences between hard and soft thresholding are illutrated below
In fact, any function with a shape similar to the sigmoid (a smooth transition from 0 to 1 between) would suit our need of soft thresholding. It just so happens that the sigmoid has properties which make it particularly attractive.
Now that we have the base unit of our network figured out, how do we build the full network ? There are arbitrarily many complicated ways of building a network (think feedback loops, arbitrary connections from one layer to another ...). We are going to focus on a subclass of networks, which will prove handy to parameterize : feedforward fully connected neural networks.
Such networks are organised in layers, each unit of each layer being connected to each unit of the following layer.
From now on, we'll work with the MNIST data set to provide an illustration to our discussion. This dataset is organised in the following way: It is split into two parts. There is a training part (60 000 images, represented as 28*28 arrays with the grayscale value of the relevant pixel) and a test part (10 000 similar images). Each image corresponds to a single handwritten digit.
We will use a simple design : 3 layers (inputs + one hidden layer
+ one output layer).
One such network is shown below, with 15 hidden layer cells
x is the input, a 28*28 = 784 dimensional vector y is the output, a 10-dimensional vector such that (1,0,0,0,0,0,0,0,0,0) represents 0, (0,1,0,0,0,0,0,0,0,0) represents 1 and so on. a is the 10-dimensional output of the neural network
We will quantify the success of our model with the well known squared loss error :
$$\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \end{eqnarray}$$The reason behind this choice is that the square loss function is smooth in the parameters of the networks (i.e. its weights and biases) which makes it easier to understand how small changes in the weights and biases may affect the classification performance.
In order to minimise the error $C$ identified above, we are going to use gradient descent.
In fact, we will make use of a slightly modified version of gradient descent to alleviate computing constraints. Indeed, vanilla gradient descent updates the gradient of $C$ by computing the derivatives with respect to all data points at each update. If we have a large training set, this may prove untractable.
That's why we'll implement stochastic batch gradient descent :
The size of the batch can vary, with a small size meaning fast computations at the cost of a loss of accuracy in the computation of the gradient.
Let us recapitulate : our goal is to tailor the weights $w$ and the biases $b$ of the neural network so that we minimise the cost $C$.
Gradient descent helps us achieve this goal by providing a way to progressively update all $w$ and $b$ to their ideal values.
Mathematically :
$$ \begin{eqnarray} w & \rightarrow & w' = w-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w} \\ b & \rightarrow & b' = b-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b}, \end{eqnarray}$$where the sum is on a mini-batch.
Let's move on to the more difficult bit.
The backpropagation algorithm provides us with an efficient way of computing the gradient of $C$. Let's see how.
$b^l_j$ is the bias of the jth neuron in the lth layer.
$a^l_j$ is the activation of the jth neuron in the lth layer.
$w^l_{jk}$ is the weight for the connection from the kth neuron in the (l−1)th layer to the jth neuron in the lth layer.
Let us write down a first equation to write the output of a layer :
\begin{eqnarray} a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \end{eqnarray}This equation expresses the architecture of our network: the sum is on all the neurons of the previous layer, thereby making our network fully connected.
We will use the following shorthand notation as well :
This immediately gives :
$$a^l = \sigma(z^l)$$Now that the notations are in place, recall that we want to compute the derivatives of $C$: $\hspace{2mm}\partial C / \partial b^l_j$ and $\partial C / \partial w^l_{jk}$.
Using the chain rule, we write :
\begin{equation} \frac{\partial C}{\partial b^l_j} = \sum_k \frac{\partial C}{\partial z^l_k}\frac{\partial z^l_k}{\partial b^l_j } \end{equation}which reduces to :
\begin{equation} \frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z_l^j} \equiv \delta^l_j \end{equation}where we have introduced the intermediate quantity $\delta^l_j \equiv \frac{\partial C}{\partial z_l^j}$.
Let's compute it in two times.
For the final layer $L$, using the chain rule,
\begin{eqnarray} \frac{\partial C}{\partial z^L_j} &= \sum_i \frac{\partial C}{\partial a^L_i}\frac{\partial a^L_i}{\partial z^L_{j} }\\ &= \frac{\partial C}{\partial a^L_j}\frac{\partial a^L_j}{\partial z^L_{j}} \\ & = \frac{\partial C}{\partial a^L_j} \sigma^{\prime}(z^L_j) \end{eqnarray}For any other layer $l$, using the chain rule:
\begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \end{eqnarray}note that :
\begin{eqnarray} z^{l+1}_k & = & \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k\\ & = & \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k \end{eqnarray}hence :
\begin{eqnarray} \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \end{eqnarray}thus :
\begin{eqnarray} \delta^l_j & = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j) \\ \delta^l & = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber \end{eqnarray}The last line expresses the result using the Hadamard product.
All that's left is to compute $\partial C / \partial w^l_{jk}$. Once again, we make use of the chain rule :
\begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} & = \sum_i \frac{\partial C}{\partial z^l_i}\frac{\partial z^l_i}{\partial w^l_{jk} }\\ & = a^{l-1}_k \delta^l_j \end{eqnarray}The formulas we established guide us towards the backpropagation algorithm :
The code below is an implementation of neural networks trained by backpropagation in pure python.
In [5]:
# Let's start by importing relevant packages :
import numpy as np
import os
import cPickle as pickle
import gzip
import subprocess
In [10]:
# Define a function to unzip the mnist data and load it
def get_mnist_data():
# Unzip and load the data set
f = gzip.open("../data/mnist.pkl.gz", "rb")
train, val, test = pickle.load(f)
f.close()
return train, val, test
# Format the mnist target function
def format_mnist(y):
""" Convert the 1D to 10D """
# convert to 10 categeories
y_new = np.zeros((10, 1))
y_new[y] = 1
return y_new
# Let's load and format the data for our neural network
train, test, valid = get_mnist_data()
training_inputs = [np.reshape(x, (784, 1)) for x in train[0]]
training_results = [format_mnist(y) for y in train[1]]
training_data = zip(training_inputs, training_results)
test_fm = []
for i in range(len(test[0])):
test_fm.append((test[0][i], test[1][i]))
test = test_fm
In [11]:
# Lets check the dimensions of our data
print len(training_data), len(test)
In [12]:
# Neural Network code per se
# All documentation is inline
class ActivationFunction(object):
""" Activation function class
We define the activation function and its derivative.
Only one choice : the sigmoid.
Feel free to add more !
"""
@staticmethod
def sigmoid():
return lambda x: 1. / (1 + np.exp(-x))
@staticmethod
def dsigmoid():
return lambda x: 1. / (1 + np.exp(-x)) * (1 - 1. / (1 + np.exp(-x)))
class CostFunction(object):
""" Cost function class
We define one function : the mse
Feel free to add your own (like the cross entropy)
"""
@staticmethod
def mse():
return lambda y,a: 0.5 * np.power(y-a, 2)
@staticmethod
def dmse():
return lambda y,a: a - y
class BasicNeuralNet(object):
"""
Neural network class
Implemented in python
Main parts :
__init__ = initialisation of the neural networks
the weights and biases are randomly initialised
_forward_pass() = operates the feedforward pass discussed above
_backward_pass() = operates the backpropagation pass discussed above
_update_gradient() = computes dCdw and dCdb for a mini batch as explained above
fit_SGD() = fit the neural network to the data
it loops over epochs, create mini batches of the data
and minimises the cost function by gradient descent
score() = evaluate the classification performance on specified test_data
"""
def __init__(
self,
sizes,
lmbda = 0,
actfuncname="sigmoid",
costfuncname="mse",
verbose=False):
self._nlayers = len(sizes)
self._sizes = sizes
self._lmbda = lmbda
# Random initialisation of biases and weights.
#For the weights, use gaussian with std = sqrt(# of weights connecting to a neuron)
# So that by the CLT, their sum is gaussian with std = 1
# Add [0] for clearer indexing
self._biases = [np.array([0])] + [np.random.randn(size, 1)
for size in self._sizes[1:]]
self._weights = [np.array([0])] + [np.random.randn(self._sizes[i], self._sizes[i - 1])/ \
np.sqrt(self._sizes[i-1])
for i in range(1, self._nlayers)]
# Initialisation of z
self._z = [np.array([0])] + [np.zeros((size, 1))
for size in self._sizes[1:]]
# Activation function
self._actfuncname = actfuncname
if self._actfuncname == "sigmoid":
self._actfunc = ActivationFunction.sigmoid()
self._dactfunc = ActivationFunction.dsigmoid()
# Cost function
self._costfuncname = costfuncname
if self._costfuncname == "mse":
self._costfunc = CostFunction.mse()
self._dcostfunc = CostFunction.dmse()
def _forward_pass(self, x):
# Initialisation of activation matrix
self._a = [x]
for layer in range(1, self._nlayers):
self._z[layer] = np.dot(self._weights[layer], self._a[layer-1]) \
+ self._biases[layer]
a = self._actfunc(self._z[layer])
self._a.append(a)
# For scoring
return self._a[-1]
def _backward_pass(self, y):
# Initialisation of error matrix
delta_L = self._dcostfunc(y, self._a[-1]) \
* self._dactfunc(self._z[-1])
self._delta = [delta_L]
for layer in range(1, self._nlayers - 1)[::-1]:
delta_l = np.dot(
self._weights[layer + 1].T, self._delta[self._nlayers - layer -2]) \
* self._dactfunc(self._z[layer])
self._delta = [delta_l] + self._delta
self._delta = [np.array([0])] + self._delta
def _update_gradient(self, batch, n_training):
n = len(batch)
# Initialise derivative of cost wrt bias and weights
dCdb = [np.array([0])] + [np.zeros((size,1)) for size in self._sizes[1:]]
dCdw = [np.array([0])] + [np.zeros((self._sizes[i], self._sizes[i - 1]))
for i in range(1, self._nlayers)]
# Loop over batch
for X, y in batch:
self._forward_pass(X)
self._backward_pass(y)
# Loop over layers
for layer in range(1, self._nlayers):
dCdb[layer] += self._delta[layer]/float(n)
dCdw[layer] += np.dot(self._delta[layer], self._a[layer - 1].T)/float(n) + self._lmbda * self._weights[layer]/float(n_training)
return dCdb, dCdw
def fit_SGD(self, training_data, learning_rate, batch_size, epochs, test_data = None):
n_samples = len(training_data)
# Loop over epochs
for ep in range(epochs):
#Shuffle data
np.random.shuffle(training_data)
for k in xrange(0, n_samples, batch_size):
batch = training_data[k:k+batch_size]
dCdb, dCdw = self._update_gradient(batch, n_samples)
# Update bias and weights
self._biases = [self._biases[layer] - learning_rate * dCdb[layer]
for layer in range(self._nlayers)]
self._weights = [self._weights[layer] - learning_rate * dCdw[layer]
for layer in range(self._nlayers)]
print "Epoch %s:" %ep, self.score(test_data), "/", len(test_data)
def score(self, test_data):
""" Score """
test_results = [(np.argmax(self._forward_pass(np.reshape(x, (len(x), 1)))), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)
In [13]:
# Define the parameters of the neural network
d_NN = {"sizes": [784, 30, 10],
"actfuncname": "sigmoid",
"costfuncname": "mse",
"batch_size": 10,
"learning_rate": 3,
"epochs": 30,
"lambda":0,
"verbose": True}
# Set the seed for reproducibility
np.random.seed(10)
As advertised above, we use the sigmoid as our activation function and the mean squared error as our cost function. The other important parameters are :
In [14]:
# Create NN model and fit
NeuralNet = BasicNeuralNet(
d_NN["sizes"],
lmbda = d_NN["lambda"],
actfuncname=d_NN["actfuncname"],
costfuncname=d_NN["costfuncname"],
verbose=d_NN["verbose"])
NeuralNet.fit_SGD(
training_data,
d_NN["learning_rate"],
d_NN["batch_size"],
d_NN["epochs"],
test_data=test)
# This may take a while to train ...
You can now play around with the parameters and see whether you can improve the classification accuracy ! (Change number of layers, number of units, learning_rate ...)