A neural network from first principles

The code below was adpated from the code supplied in Andrew Ng's Coursera course on machine learning. The original code was written in Matlab/Octave and in order to further my understanding and enhance my Python skills, I ported it over to Python.

To start, various libraries are imported.

Specifically, the fmin_l_bfgs_b function will be used to optimize the cost function.


In [3]:
# Import libraries
import numpy as np
import scipy.io
import matplotlib.pyplot as plt
import math
from scipy.optimize import fmin_l_bfgs_b
from sklearn.metrics import accuracy_score
import pickle

First, we load the data. For details, please see the accompanying notebook MNIST-loader.ipynb for details.


In [4]:
# Load data
with open('./data/pickled/xtrain.pickle', 'rb') as f:
    xtrain = pickle.load(f)

with open('./data/pickled/ytrain.pickle', 'rb') as f:
    ytrain = pickle.load(f)

with open('./data/pickled/xtest.pickle', 'rb') as f:
    xtest = pickle.load(f)

with open('./data/pickled/ytest.pickle', 'rb') as f:
    ytest = pickle.load(f)

with open('./data/pickled/xval.pickle', 'rb') as f:
    xval = pickle.load(f)

with open('./data/pickled/yval.pickle', 'rb') as f:
    yval = pickle.load(f)

Now let's define some useful functions for the neural network to use. First is the sigmoid activation function:


In [5]:
# Sigmoid function
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

The neural network

Below is the real meat of this notebook: the neural network funcion. The function, as defined, takes only one argument: theta. These are the weights as they are known in the moment when the function is called. At the start, these will be randomly initialized, however as the network learns these values will be changed and improved in order to minimize the cost function.

Let's walk through the basics. Firstly, the weights are provided in a "rolled" format, in other words instead of two matrices of weights, we have a single long list of weights. The first job is to "unroll" the weights into two matrices that can be efficiently used by the numpy matrix multiplication methods. To do this, we take the first (n+1)hidden (i.e. 40120 = 820) values, and reshape them into a 20X401 matrix. Next, we take the remaining 210 values (i.e. classes(hidden+1)) and reshape them into a 10X21 matrix. The n+1 and hidden+1* take into account the bias term, which I'll discuss below.

Forward propagation

Next, we perform forward propagation. This is relatively simple: we multiply the input values in the layer by the weights for that layer, sum the total and add the bias term, and then and apply the sigmoid function. Recall from the reading in week one:

$$ z = w.x+b $$$$ a = \sigma(z) = \frac{1}{(1+e^{-z})} $$

Since w.x is the dot product, it implies the sum. Basic matrix multiplication says that when multiplying two matrices that have the same internal dimension (ie number of columns in matrix one is the same as number of rows in matrix two), each element in row i of matrix one is multiplied by each element in column i of matrix two, and all of those products are summed. This value goes into the first row of the resulting matrix. Subsequently, the same is repeated for the remaining columns of the second matrix, and the first row of the output matrix is completed. This process goes on for all remaining rows in the first matrix.

In our example, the first matrix contains MNIST images in a row-wise format. The second matrix contains the weights for each connection between the input layers and the hidden layers. So following the logic from above, the first row of the input matrix (i.e. the first image in the data set) is multiplied by each of 10 sets of weights (10 columns in the weights matrix), one for each hidden layer. Because it's matrix mulitplication, all of these products are automatically summed.

A quick note about bias

If you look at the code below (and elsewhere in this notebook) you'll find a number of n+1's and hidden+1's, etc. These account for bias, the b term in the equation above. Every time forward propagation is run, and extra column of ones is appended onto the end of the matrix (these are not a part of the actual data). When the weights are randomly initialized, they too have an extra weight included for this bias term (i.e. the dimensions are n+1Xhidden). These two values, bias in the input matrix and bias in the weights matrix, are multiplied during matrix multiplication and their product is added to the total sum for that neuron. Because the value for bias in the input matrix is always 1, the actual value of the bias is thus coded in the weights and can be learned just like a regular weight.

So to sum it all up, for each connection between a node in the input (i.e. a feature, a pixel in the image) and a node in the hidden layer, the input value is multiplied by the weight of each connection and these products for all features are added. To incorporate bias, we include an extra input value of 1 and multiply is by it's own weight. The sigmoid function is applied to this sum, and generates the value of the hidden layer for this particular data point.

Continuing on with forward propagation

Now we have the values of the hidden layer, we repeat this process once again with the weights for the connections between the hidden layer and the output layer. Nothing changes here, except for the sizes of the matrices. Recall that we had n input nodes and, say, 20 hidden layer nodes. That means we had n+1 weights (adding 1 for the bias term), so here we will have hidden+1 weights.

At the end of the second forward propagation, we will have a matrix with a row for each example in the data set and a column for each output class in the neural network (i.e. 10). The columns will contain the value the neural network determined for each class. If the network learned how to identify handwritten digits, the highest of these values will correspond with the correct output. At this point, however, our network has done no learning so we wouldn't expect anything better than a random guess (since the weights were randomly initialized!)

The cost function

Next comes the cost function. Here, we implement the cross entropy cost function with weight decay or L2 regularization. I implemented this in two lines for clarity's sake. First, the unregularized cost is determined and subsequently the regularization term is added.

And finally, back propagation

First, we find the difference between the output values in the 10 classes and the real value. In this case, the real value for each of the 10 possible digits is a vector or length 10, with a 1 in the position representing the number in question and 0's everywhere else. For example, the number 3 is represented by [0 0 0 1 0 0 0 0 0 0]. Since the outputs are from sigmoid neurons, the values will be between 0 and 1 with the highest value indicating the number the model predicted. Sticking with the above example, we might expect our model to output something like [0.1 0.2 0.1 0.6 0.1 0.2 0.2 0.1 0.2 0.3]. Subtracting these two will give a measure of the error. The larger the value, the more incorrect that class prediction was.

To perform backpropagation, first we find the delta for the final layer (in this case, d3). This is simply the actual value (which is one-hot encoded) subtracted from the neural networks prediction for that value.

Next, we multiple the error from layer 3 by the weights that produced layer three's activation values. This is a matrix multiplication which automatically sums the totals. In this case, the matrices have the dimensions [3750X10] and [10X11], for a dot product of [3750X11]. We can simply perform an elementwise multiplication with the derivative of the sigmoid function with respect to the activations in layer 2 to get the error at layer 2.

Since we only have three layers here, we're done. There is no error on layer 1 since this was the input layer. We can't find the error on the raw values that are input to the network!

Typically, now we would use these two delta values to update the weights and biases and then run the network again. Rinse and repeat until the cost function is appreciably minimized. But instead of going that route, I decided to implement an optimizer from the scipy package, specifically fmin_l_bfgs_b, which requires gradients of the objective function and a set of initial weights. As such, instead of actually updating the weights, we'll simply have the function return rate of change of the cost with respect to the biases and weights. The optimizer will handle the process of actually updating the weights (more on this later).

Wrapping this function up

As you can see, the nn function takes in a set of weights, performs forward propagation to predict the output, calculates the regularized cost using cross entropy and L2 regularization, and then performs backpropagation to determine the rate of change of the cost function with respect to the biases and weights. The gradients are rolled into a vector which can be passed directly to the optimizer. We skipped past this at the beginning of the function, but you can see that the first thing done in the function is to unroll the weights and reshape them into matrices for efficient matrix multiplication during forward and backpropagation.


In [6]:
def nn(weights,x,y):
    
    n = x.shape[0]
    bias = np.ones((n,1))

    # Unroll weights from a single vector
    w2 = weights[:(m+1)*hidden].reshape((hidden,(m+1))) # [Hiddden X m + 1]
    w3 = weights[(m+1)*hidden:].reshape((classes,(hidden+1))) # [classes X Hidden + 1]

    ### Forward propagation ###

    a1 = np.concatenate((bias,x),axis=1) # [nXm+1]

    a2 = sigmoid(np.dot(a1,w2.T)) # [nXm+1] . [m+1XHidden] = [nXHidden]
    a2 = np.concatenate((bias,a2),axis=1) # [nXHidden+1]

    a3 = sigmoid(np.dot(a2,w3.T)) # [nXHidden+1] . [Hidden+1Xclasses] = [nXclasses]

    # Cost function: regularized cross entropy
    C = np.sum(np.nan_to_num(-y*np.log(a3) - (1-y)*(np.log(1-a3))))/n
    C += ((Lambda/(2*n)) * (np.sum(w2[:,1:]**2) + np.sum(w3[:,1:]**2))) # Add regularization to the cost function

    ### Backpropagation ###

    d3 = (a3 - y) # [nXclasses]
    d2 = np.dot(d3,w3) * (a2*(1-a2)) # [nXclasses] . [classesXHidden+1] = [nXHidden+1]
    
    # We don't need to do this for the input layer, obviously.
    # There were no calculations that produced the input layer, that is raw data

    # Find the gradients for the weights
    grad2 = np.dot(d2[:,1:].T,a1)/n # [HiddenXn] . [nXm+1] = [HiddenXm+1]
    grad3 = np.dot(d3.T,a2)/n # [classesXn] . [nXHidden+1] = [classesXHidden+1]

    # Regularize the weights only
    if Lambda !=0:
        grad2[:,1:] -= (Lambda/n * w2[:,1:])
        grad3[:,1:] -= (Lambda/n * w3[:,1:])

    # Roll weights and biases back into a single vector
    grads = np.concatenate((grad2.reshape(((m+1)*hidden)),grad3.reshape(((hidden+1)*classes))))
    
    return C, grads

Next is the predict function. This function takes the learned weights and performs forward propagation through the netwwork using the x values supplied in the arguments. The effect of this is essentially to predict the output class of the given data using the weights that have been learned. This function will only be called by the accuracy tools at the end, and thus does not perform backpropagation to actually learn the weights.


In [7]:
# Predict function
def predict(weights,x):
    
    # Establish some useful variables
    n = np.int(x.shape[0])
    bias = np.ones((n,1))
    
    # Unroll weights
    w2 = weights[:(m+1)*hidden].reshape((hidden,(m+1)))
    w3 = weights[(m+1)*hidden:].reshape((classes,(hidden+1)))
    
    ### Forward propagation ###
    a1 = np.concatenate((bias,x),axis=1)

    a2 = sigmoid(np.dot(a1,w2.T))
    a2 = np.concatenate((bias,a2),axis=1)

    a3 = sigmoid(np.dot(a2,w3.T))
    
    return np.argmax(a3, axis=1)

We initialize theta with a set of random weights with a standard deviation of $ 1/\sqrt{n} $


In [8]:
def weight_init(L_in,L_out):
    return np.random.normal(scale=1/np.sqrt(L_in), size=(L_out,L_in+1))

Next, we specify the model parameters. Lambda is the regularization parameter, which protects against overfitting.

The variable hidden specifies the number of nodes in the hidden layer. The input layer is, of course, defined by the number of features. The variable classes specifies the number of nodes in the output layer. Finally, maxiter is the maximum number of iterations the optimizer will perform. Keeping this small simply lets the model run more quickly during the testing phase.


In [9]:
# Model parameters
hidden = int(50) # Number of hidden layers
Lambda = 3 # Lambda regularization paramter
classes = int(10) # Number of output classes
maxiter = int(800) # Maximum number of iterations the optimizer is allowed to perform
m = np.int(xtrain.shape[1]) # Number of features in each example

Finally, we train the model

Now that all of the various elements have been coded, and the parameters have been set, we're ready to train the model using the training set. Below, the weights are initialized and the model is passed to the fmin_l_bfgs_b optimizer. Finally, the learned weights (stored in model) are passed to the predict function to see how well the model is predicting the correct digits, and accuracy_score from sklearn is used to quickly assess the accuracy of the model.


In [10]:
# Initialize weights
weights = np.concatenate((weight_init(m,hidden).reshape(((m+1)*hidden)),weight_init(hidden,classes).reshape(((hidden+1)*classes))))

# Train the model
model = fmin_l_bfgs_b(nn, x0=weights, args=(xtrain,ytrain),maxiter=maxiter)

print("The accuracy is on the training set is %f" %(accuracy_score(np.argmax(ytrain,axis=1), predict(model[0],xtrain))))
print("The accuracy is on the test set is %f" %(accuracy_score(np.argmax(ytest,axis=1), predict(model[0],xtest))))


The accuracy is on the training set is 0.963060
The accuracy is on the test set is 0.952700

Visualizing the data

Here are two quick functions to visualize the data. First, we randomly select 100 data points and plot them. The second function grabs a single random data point, plots the image and uses the model above to predict the output.


In [11]:
# Visualize the data
def drawplot(draw,x,y):
    if draw:
        n = x.shape[0]
        idx = np.random.randint(0,n,size=100) # Make an array of random integers between 0 and n
        fig, ax = plt.subplots(10, 10) # make the plots
        img_size = math.sqrt(m) # Specify the image size (in these case sqrt(m) = 28)
        for i in range(10):
            for j in range(10):
                Xi = x[idx[i*10+j],:].reshape(int(img_size), int(img_size)) # get each example and resize
                ax[i,j].set_axis_off() # Turns off the axes for all the subplots for clarity
                ax[i,j].imshow(Xi, aspect='auto',cmap='gray') # plots the current image in the correct position
        plt.show()
        
drawplot(True,xtrain,ytrain)



In [14]:
# Interactive printer function
def printer(x,y,weights):
    idx = np.random.randint(len(x),size=1)
    img_size = int(math.sqrt(m))
    xi = x[idx,:].reshape(img_size,img_size)
    yi = predict(weights,x[idx,:])
    plt.title('The predicted value is %i\n The true value is %i' %(yi,np.argmax(y[idx,:],axis=1)))
    plt.imshow(xi, aspect='auto',cmap='gray')
    plt.axis('off')
    plt.show()

In [15]:
printer(xtrain,ytrain,model[0])