Neural Nets for Digit Classification

by Khaled Nasr as a part of a GSoC 2014 project mentored by Theofanis Karaletsos and Sergey Lisitsyn

This notebook illustrates how to use the NeuralNets module to teach a neural network to recognize digits. It also explores the different optimization and regularization methods supported by the module. Convolutional neural networks are also discussed.

Introduction

An Artificial Neural Network is a machine learning model that is inspired by the way biological nervous systems, such as the brain, process information. The building block of neural networks is called a neuron. All a neuron does is take a weighted sum of its inputs and pass it through some non-linear function (activation function) to produce its output. A (feed-forward) neural network is a bunch of neurons arranged in layers, where each neuron in layer i takes its input from all the neurons in layer i-1. For more information on how neural networks work, follow this link.

In this notebook, we'll look at how a neural network can be used to recognize digits. We'll train the network on the USPS dataset of handwritten digits.

We'll start by loading the data and dividing it into a training set, a validation set, and a test set. The USPS dataset has 9298 examples of handwritten digits. We'll intentionally use just a small portion (1000 examples) of the dataset for training . This is to keep training time small and to illustrate the effects of different regularization methods.



In [ ]:

    
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from scipy.io import loadmat
from shogun import RealFeatures, MulticlassLabels, Math

# load the dataset
dataset = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))

Xall = dataset['data']
# the usps dataset has the digits labeled from 1 to 10 
# we'll subtract 1 to make them in the 0-9 range instead
Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1 

# 1000 examples for training
Xtrain = RealFeatures(Xall[:,0:1000])
Ytrain = MulticlassLabels(Yall[0:1000])

# 4000 examples for validation
Xval = RealFeatures(Xall[:,1001:5001])
Yval = MulticlassLabels(Yall[1001:5001])

# the rest for testing
Xtest = RealFeatures(Xall[:,5002:-1])
Ytest = MulticlassLabels(Yall[5002:-1])

# initialize the random number generator with a fixed seed, for repeatability
Math.init_random(10)

Creating the network

To create a neural network in shogun, we'll first create an instance of NeuralNetwork and then initialize it by telling it how many inputs it has and what type of layers it contains. To specifiy the layers of the network a DynamicObjectArray is used. The array contains instances of NeuralLayer-based classes that determine the type of neurons each layer consists of. Some of the supported layer types are: NeuralLinearLayer, NeuralLogisticLayer and NeuralSoftmaxLayer.

We'll create a feed-forward, fully connected (every neuron is connected to all neurons in the layer below) neural network with 2 logistic hidden layers and a softmax output layer. The network will have 256 inputs, one for each pixel (16*16 image). The first hidden layer will have 256 neurons, the second will have 128 neurons, and the output layer will have 10 neurons, one for each digit class. Note that we're using a big network, compared with the size of the training set. This is to emphasize the effects of different regularization methods. We'll try training the network with:

No regularization
L2 regularization
L1 regularization
Dropout regularization

Therefore, we'll create 4 versions of the network, train each one of them differently, and then compare the results on the validation set.



In [ ]:

    
from shogun import NeuralNetwork, NeuralInputLayer, NeuralLogisticLayer, NeuralSoftmaxLayer
from shogun import DynamicObjectArray

# setup the layers
layers = DynamicObjectArray()
layers.append_element(NeuralInputLayer(256)) # input layer, 256 neurons
layers.append_element(NeuralLogisticLayer(256)) # first hidden layer, 256 neurons
layers.append_element(NeuralLogisticLayer(128)) # second hidden layer, 128 neurons
layers.append_element(NeuralSoftmaxLayer(10)) # output layer, 10 neurons

# create the networks
net_no_reg = NeuralNetwork(layers)
net_no_reg.quick_connect()
net_no_reg.initialize_neural_network()

net_l2 = NeuralNetwork(layers)
net_l2.quick_connect()
net_l2.initialize_neural_network()

net_l1 = NeuralNetwork(layers)
net_l1.quick_connect()
net_l1.initialize_neural_network()

net_dropout = NeuralNetwork(layers)
net_dropout.quick_connect()
net_dropout.initialize_neural_network()

We can also visualize what the network would look like. To do that we'll draw a smaller network using networkx. The network we'll draw will have 8 inputs (labeled X), 8 neurons in the first hidden layer (labeled H), 4 neurons in the second hidden layer (labeled U), and 6 neurons in the output layer (labeled Y). Each neuron will be connected to all neurons in the layer that precedes it.



In [ ]:

    
# import networkx, install if necessary
try:
    import networkx as nx
except ImportError:
    import pip
    pip.main(['install', '--user', 'networkx'])
    import networkx as nx
  
G = nx.DiGraph()
pos = {}

for i in range(8):
    pos['X'+str(i)] = (i,0) # 8 neurons in the input layer
    pos['H'+str(i)] = (i,1) # 8 neurons in the first hidden layer
    
    for j in range(8): G.add_edge('X'+str(j),'H'+str(i))
        
    if i<4:
        pos['U'+str(i)] = (i+2,2) # 4 neurons in the second hidden layer
        for j in range(8): G.add_edge('H'+str(j),'U'+str(i))
    
    if i<6:
        pos['Y'+str(i)] = (i+1,3) # 6 neurons in the output layer
        for j in range(4): G.add_edge('U'+str(j),'Y'+str(i))

nx.draw(G, pos, node_color='y', node_size=750)

Training

NeuralNetwork supports two methods for training: LBFGS (default) and mini-batch gradient descent.

LBFGS is a full-batch optimization methods, it looks at the entire training set each time before it changes the network's parameters. This makes it slow with large datasets. However, it works very well with small/medium size datasets and is very easy to use as it requires no parameter tuning.

Mini-batch Gradient Descent looks at only a small portion of the training set (a mini-batch) before each step, which it makes it suitable for large datasets. However, it's a bit harder to use than LBFGS because it requires some tuning for its parameters (learning rate, learning rate decay,..)

Training in NeuralNetwork stops when:

Number of epochs (iterations over the entire training set) exceeds max_num_epochs
The (percentage) difference in error between the current and previous iterations is smaller than epsilon, i.e the error is not anymore being reduced by training

To see all the options supported for training, check the documentation

We'll first write a small function to calculate the classification accuracy on the validation set, so that we can compare different models:



In [ ]:

    
from shogun import MulticlassAccuracy

def compute_accuracy(net, X, Y):
    predictions = net.apply_multiclass(X)

    evaluator = MulticlassAccuracy()
    accuracy = evaluator.evaluate(predictions, Y)
    return accuracy*100

Training without regularization

We'll start by training the first network without regularization using LBFGS optimization. Note that LBFGS is suitable because we're using a small dataset.



In [ ]:

    
net_no_reg.set_epsilon(1e-6)
net_no_reg.set_max_num_epochs(600)

# uncomment this line to allow the training progress to be printed on the console
#from shogun import MSG_INFO; net_no_reg.io.set_loglevel(MSG_INFO)

net_no_reg.set_labels(Ytrain)
net_no_reg.train(Xtrain) # this might take a while, depending on your machine

# compute accuracy on the validation set
print "Without regularization, accuracy on the validation set =", compute_accuracy(net_no_reg, Xval, Yval), "%"

Training with L2 regularization

We'll train another network, but with L2 regularization. This type of regularization attempts to prevent overfitting by penalizing large weights. This is done by adding $\frac{1}{2} \lambda \Vert W \Vert_2$ to the optimization objective that the network tries to minimize, where $\lambda$ is the regularization coefficient.



In [ ]:

    
# turn on L2 regularization
net_l2.set_l2_coefficient(3e-4)

net_l2.set_epsilon(1e-6)
net_l2.set_max_num_epochs(600)

net_l2.set_labels(Ytrain)
net_l2.train(Xtrain) # this might take a while, depending on your machine

# compute accuracy on the validation set
print "With L2 regularization, accuracy on the validation set =", compute_accuracy(net_l2, Xval, Yval), "%"

Training with L1 regularization

We'll now try L1 regularization. It works by by adding $\lambda \Vert W \Vert_1$ to the optimzation objective. This has the effect of penalizing all non-zero weights, therefore pushing all the weights to be close to 0.



In [ ]:

    
# turn on L1 regularization
net_l1.set_l1_coefficient(3e-5)

net_l1.set_epsilon(e-6)
net_l1.set_max_num_epochs(600)

net_l1.set_labels(Ytrain)
net_l1.train(Xtrain) # this might take a while, depending on your machine

# compute accuracy on the validation set
print "With L1 regularization, accuracy on the validation set =", compute_accuracy(net_l1, Xval, Yval), "%"

Training with dropout

The idea behind dropout is very simple: randomly ignore some neurons during each training iteration. When used on neurons in the hidden layers, it has the effect of forcing each neuron to learn to extract features that are useful in any context, regardless of what the other hidden neurons in its layer decide to do. Dropout can also be used on the inputs to the network by randomly omitting a small fraction of them during each iteration.

When using dropout, it's usually useful to limit the L2 norm of a neuron's incoming weights vector to some constant value.

Due to the stochastic nature of dropout, LBFGS optimization doesn't work well with it, therefore we'll use mini-batch gradient descent instead.



In [ ]:

    
from shogun import NNOM_GRADIENT_DESCENT

# set the dropout probabilty for neurons in the hidden layers
net_dropout.set_dropout_hidden(0.5)
# set the dropout probabilty for the inputs
net_dropout.set_dropout_input(0.2)
# limit the maximum incoming weights vector lengh for neurons
net_dropout.set_max_norm(15)

net_dropout.set_epsilon(1e-6)
net_dropout.set_max_num_epochs(600)

# use gradient descent for optimization
net_dropout.set_optimization_method(NNOM_GRADIENT_DESCENT)
net_dropout.set_gd_learning_rate(0.5)
net_dropout.set_gd_mini_batch_size(100)

net_dropout.set_labels(Ytrain)
net_dropout.train(Xtrain) # this might take a while, depending on your machine

# compute accuracy on the validation set
print "With dropout, accuracy on the validation set =", compute_accuracy(net_dropout, Xval, Yval), "%"

Convolutional Neural Networks

Now we'll look at a different type of network, namely convolutional neural networks. A convolutional net operates on two principles:

Local connectivity: Convolutional nets work with inputs that have some sort of spacial structure, where the order of the inputs features matter, i.e images. Local connectivity means that each neuron will be connected only to a small neighbourhood of pixels.
Weight sharing: Different neurons use the same set of weights. This greatly reduces the number of free parameters, and therefore makes the optimization process easier and acts as a good regularizer.

With that in mind, each layer in a convolutional network consists of a number of feature maps. Each feature map is produced by convolving a small filter with the layer's inputs, adding a bias, and then applying some non-linear activation function. The convolution operation satisfies the local connectivity and the weight sharing constraints. Additionally, a max-pooling operation can be performed on each feature map by dividing it into small non-overlapping regions and taking the maximum over each region. This adds some translation invarience and improves the performance.

Convolutional nets in Shogun are handled through the CNeuralNetwork class along with the CNeuralConvolutionalLayer class. A CNeuralConvolutionalLayer represents a convolutional layer with multiple feature maps, optional max-pooling, and support for different types of activation functions

Now we'll creates a convolutional neural network with two convolutional layers and a softmax output layer. We'll use the rectified linear) activation function for the convolutional layers:



In [ ]:

    
from shogun import NeuralConvolutionalLayer, CMAF_RECTIFIED_LINEAR

# prepere the layers
layers_conv = DynamicObjectArray()

# input layer, a 16x16 image single channel image
layers_conv.append_element(NeuralInputLayer(16,16,1)) 

# the first convolutional layer: 10 feature maps, filters with radius 2 (5x5 filters)
# and max-pooling in a 2x2 region: its output will be 10 8x8 feature maps
layers_conv.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 10, 2, 2, 2, 2))

# the first convolutional layer: 15 feature maps, filters with radius 2 (5x5 filters)
# and max-pooling in a 2x2 region: its output will be 15 4x4 feature maps
layers_conv.append_element(NeuralConvolutionalLayer(CMAF_RECTIFIED_LINEAR, 15, 2, 2, 2, 2))

# output layer
layers_conv.append_element(NeuralSoftmaxLayer(10))

# create and initialize the network
net_conv = NeuralNetwork(layers_conv)
net_conv.quick_connect()
net_conv.initialize_neural_network()

Now we can train the network. Like in the previous section, we'll use gradient descent with dropout and max-norm regularization:



In [ ]:

    
# 50% dropout in the input layer
net_conv.set_dropout_input(0.5)

# max-norm regularization
net_conv.set_max_norm(1.0)

# set gradient descent parameters
net_conv.set_optimization_method(NNOM_GRADIENT_DESCENT)
net_conv.set_gd_learning_rate(0.01)
net_conv.set_gd_mini_batch_size(100)
net_conv.set_epsilon(0.0)
net_conv.set_max_num_epochs(100)

# start training
net_conv.set_labels(Ytrain)
net_conv.train(Xtrain)

# compute accuracy on the validation set
print "With a convolutional network, accuracy on the validation set =", compute_accuracy(net_conv, Xval, Yval), "%"

Evaluation

According the accuracy on the validation set, the convolutional network works best in out case. Now we'll measure its performance on the test set:



In [ ]:

    
print "Accuracy on the test set using the convolutional network =", compute_accuracy(net_conv, Xtest, Ytest), "%"

We can also look at some of the images and the network's response to each of them:



In [ ]:

    
predictions = net_conv.apply_multiclass(Xtest)

_=figure(figsize=(10,12))
# plot some images, with the predicted label as the title of each image
# this code is borrowed from the KNN notebook by Chiyuan Zhang and Sören Sonnenburg 
for i in range(100):
    ax=subplot(10,10,i+1)
    title(int(predictions[i]))
    ax.imshow(Xtest[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)
    ax.set_xticks([])
    ax.set_yticks([])



In [ ]: