My previous post on implementing a basic Neural Network on python got a lot of attention staying one whole day on HN front page. I was very happy about that but more about the feedback I got. The community gave me a lot of tips and tricks on how to improve. So now I am presenting an improved version which supports multiple hidden layers, more optimization options using minibatches and a more maintainable/understandable code (or so I believe).

I kept reading a lot about Neural Networks mainly by watching some videos from Geoffrey's Hinton Neural Network course on coursera. Also reading more on and trying to read some papers. That last task was definitely the hardest one because of the complexity of the papers. If I cannot even read them I can just imagine how hard is to write them.

About the Neural networks course I didn't like it as much as the Machine Learning course. The main reason is that I had to watch most videos 3 or 4 times before understanding (something). This is definitely my fault because the material is great but it focuses more on the theory (math) which I am not that good at and not on the implementation which I am more interested. But I take it as a learning experience and I will try to finish those videos.

Some people criticized (besides my orthography, sorry about my spanglish :P) that I didn't use any cross-validation and that the iris and digits examples are way to simple datasets. I agree that both things are necessary but I am not trying to make this post as a scientific paper with complete benchmarks of different optimization algorithms I leave that to the researchers who have spent years studying neural networks. But in on this case I am using the MNIST dataset to compare.

My objective was to learn about them and to have some personal implementation of the algorithm, just to say that I have done it :P


I was very tempted to use theano and but at the end I decided to do a pure numpy implementation.

It is defenitly possible (and not that hard neither that easy) to create a theano implementation and I took some of the ideas from their site such as creating a class for each layer because it makes easier to combine and create different implementations and makes the code more mantainable. But other than the optimization I still cannot see the value of using theano, of course the speed up is important but I thing I prefer the Numba approach.

The implementation is very similar to the previous post, the differences are:

  1. Support for multiple hidden layers. I still train the network using a big 1d-array of weights because that allows to use the scipy optimizations.
  2. A more "smart" random initialization for the arrays.
  3. Support for mini-batch optimization, which is a must have with bigger datasets.
  4. Created a Layer class which weights are mapped to slices of the big weight array that is optimized so they are sharing that part of memory. That allows to a more maintainable code and probably to include pre-training in a future.

One feature that I think is nice is that when fitting one can actually interrupt (not restart) the ipython kernel and the last weights are going to be saved in the class (NN.coef_) so one doesn't have to start the training from nothing, also I added a callback on each epoch so one can for example see the score in a validation dataset.

Basic classes and functions

In [1]:
import math
import numpy as np
from scipy import optimize

In [2]:
class Layer(object):
    def __init__(self, n_in=None, n_out=None, W=None, random_state=None, activation=None):
        if random_state is None:
            rnd = np.random.RandomState()
            rnd = random_state
        if W is None:
            self.W = rnd.uniform(size=(n_out, n_in + 1))
            self.W = W
        self.activation = activation
    def output(self, input):
        data = np.insert(input, 0, 1, axis=1)
        linear_output =, self.W)
        return self.activation(linear_output)

In [3]:
def sigmoid(z):
    return np.divide(1., (1 + np.exp(-z)))

In [4]:
class SigmoidLayer(Layer):
    def __init__(self, n_in=None, n_out=None, W=None, random_state=None):
        Layer.__init__(self, n_in, n_out, W, random_state, activation=sigmoid)

Cost function and gradient

In [5]:
def unpack_weigths(weights, weights_meta):
    start_pos = 0
    for layer in weights_meta:
        end_pos = start_pos + layer[0] * (layer[1])
        yield weights[start_pos:end_pos].reshape((layer[0], layer[1]))
        start_pos = end_pos

In [6]:
def cost(weights, X, y, weights_meta, num_labels):
    # Forward
    act_prev = np.insert(X, 0, 1, axis=1)
    for weight in unpack_weigths(weights, weights_meta):
        z =, weight)
        activation = sigmoid(z)
        act_prev = np.insert(activation, 0, 1, axis=1)
    Y = np.eye(num_labels)[y]
    h = activation
    costPositive = -Y * np.log(h)
    costNegative = (1 - Y) * np.log(1 - h)
    J = np.sum(costPositive - costNegative) / X.shape[0]
    return J

In [7]:
def unpack_weigths_inv(weights, weights_meta):
    end_pos = len(weights)
    for layer in reversed(weights_meta):
        start_pos = end_pos - layer[0] * (layer[1])
        yield weights[start_pos:end_pos].reshape((layer[0], layer[1]))
        end_pos = start_pos

In [8]:
def cost_prime(weights, X, y, weights_meta, num_labels):
    Y = np.eye(num_labels)[y]
    Deltas = [np.zeros(shape) for shape in weights_meta]
    data = np.insert(X, 0, 1, axis=1)
    for i, row in enumerate(data):
        # Forward
        #row = np.array([row])
        act_prev = row
        activations = (act_prev, )
        for weight in unpack_weigths(weights, weights_meta):
            z =, weight)
            activation = sigmoid(z)
            act_prev = np.append(1, activation)
            activations = activations + (act_prev, )
        # Backprop
        prev_delta = activations[-1][1:] - Y[i, :].T  # last delta
        deltas = (prev_delta, )  # deltas[0] == delta2
        for act, weight in zip(reversed(activations[1:-1]), unpack_weigths_inv(weights, weights_meta)):
            delta =, prev_delta)[1:] * (act[1:] * (1 - act[1:])).T
            deltas = (delta, ) + deltas
            prev_delta = delta

        # Accumulate errors
        for delta, act, i in zip(deltas, activations[:-1], range(len(Deltas))):
            Deltas[i] = Deltas[i] +[np.newaxis].T, act[np.newaxis]).T
    for i in range(len(Deltas)):
        Deltas[i] = Deltas[i] / X.shape[0]
    return np.concatenate(tuple([D.reshape(-1) for D in Deltas]))


This class is simulating a MinibatchOpti module.

In [9]:
class MinibatchOpti(object):
    def minibatches(X, y=None, batch_size=50, random_state=None):
        if random_state is None:
            rnd = np.random.RandomState()
        elif isinstance(random_state, int):
            rnd = np.random.RandomState(random_state)
           rnd = random_state

        m = X.shape[0]
        batch_size = batch_size if batch_size >= 1 else int(math.floor(m * batch_size))
        max_batchs = int(math.floor(m / batch_size))
        while True:
            random_indices = rnd.choice(np.arange(m), m, replace=False)
            for i in range(max_batchs):
                batch_indices = np.arange(i * batch_size, (i + 1) * batch_size)
                indices = random_indices[batch_indices]
                if y is None:
                    yield X[indices]
                    yield X[indices], y[indices]
    def GD(fun, weights, jac, X, y, options, args=()):
        weights -= options['learning_rate'] * jac(weights, X, y, *args)
        options['learning_rate'] = options['learning_rate'] * options['learning_rate_decay']
    def GD_momentum(fun, weights, jac, X, y, options, args=()):
        bigjump = options['momentum'] * options['step']
        weights -= bigjump
        correction = options['learning_rate'] * jac(weights, X, y, *args)
        weights -= correction
        options['step'] = bigjump + correction
        options['learning_rate'] = options['learning_rate'] * options['learning_rate_decay']
        options['momentum'] = options['momemtum_decay'] * options['momentum']
    def RMSPROP(fun, weights, jac, X, y, options, args=()):
        gradient = jac(weights, X, y, *args)
        options['moving_mean_squared'] = options['decay'] * options['moving_mean_squared'] \
                                         + (1 - options['decay']) * gradient ** 2
        weights -= gradient / np.sqrt(options['moving_mean_squared'] + 1e-8)
    def CG(fun, weights, jac, X, y, options, args=()):
        ans = optimize.minimize(fun, weights, jac=jac, method='CG', args=(X, y) + args, options={'maxiter': options['mb_maxiter']})
        weights[:] = ans.x
    def LBFGSB(fun, weights, jac, X, y, options, args=()):
        ans = optimize.minimize(fun, weights, jac=jac, method='L-BFGS-B', args=(X, y) + args, options={'maxiter': options['mb_maxiter']})
        weights[:] = ans.x
    def minimize(fun, weights, jac, X, y, method, batch_size=50, tol=1e-6, maxiter=100, args=None, 
                 verbose=False, options=None, random_state=None, callback=None):
        if method == 'GD':
            assert 'learning_rate' in options, 'GD needs a learning rate'
            if 'learning_rate_decay' not in options:
                options['learning_rate_decay'] = 1
            if 'momentum' in options:
                if 'momemtum_decay' not in options:
                    options['momemtum_decay'] = 1
                options['step'] = 0
                update = MinibatchOpti.GD_momentum
                update = MinibatchOpti.GD
        elif method == 'RMSPROP':
            options['moving_mean_squared'] = 1
            update = MinibatchOpti.RMSPROP
        elif method == 'CG':
            update = MinibatchOpti.CG
        elif method == 'L-BFGS-B':
            update = MinibatchOpti.LBFGSB
            raise Exception('Optimization method not found')

        i = 1
        prev_cost = 1e8
        for _X, _y in MinibatchOpti.minibatches(X, y, batch_size, random_state=random_state):
            update(fun, weights, jac, _X, _y, options, args=args)
            new_cost = fun(weights, X, y, *args)
            diff = new_cost - prev_cost
            if np.abs(diff) < tol:
                if verbose >= 1:
                    print 'Minimum tolerance reached in %i iterations' % i
            if i >= maxiter:
                if verbose >= 1 :
                    print 'Maximum number of iterations reached'
            if verbose >= 2:
                print i, newJ    
            if callback is not None:
                stop = callback(i, weights)
                if stop == True:
            prev_cost = new_cost
            i = i + 1

The sklearn-stlye class

In [10]:
class NN(object):
    def __init__(self, hidden_layers, coef0=None, random_state=None,
                 opti_method='GD', batch_size=50, maxiter=100, tol=1e-6, verbose=1, 
                 opti_options=None, callback=None):
        self.hidden_layers = hidden_layers
        self.coef_ = None if coef0 is None else np.copy(coef0)
        if random_state is None:
            self.rnd = np.random.RandomState()
        elif isinstance(random_state, int):
            self.rnd = np.random.RandomState(random_state)
            self.rnd = random_state
        self.opti_method = opti_method
        self.batch_size = batch_size
        self.verbose = verbose
        self.maxiter = maxiter
        self.tol = tol
        self.opti_options = {} if opti_options is None else opti_options
        self.callback = callback
    def rand_init(self, weights_info, random_state):
        w_sizes = []
        for layer_info in weights_info:
            w_sizes.append(layer_info[0] * layer_info[1])
        ans = np.zeros(sum(w_sizes))
        # "Smart" random initialization
        start_pos = 0
        for i, layer in enumerate(weights_info):
            end_pos = start_pos + layer[0] * (layer[1])
            gap = 4 * np.sqrt(6. / (layer[0] + layer[1]))
            ans[start_pos:end_pos] = random_state.uniform(low=-gap, high=gap, size=w_sizes[i])
            start_pos = end_pos 
        return ans
    def predict_proba(self, X):
        output = self.layers[0].output(X)
        for layer in self.layers[1:]:
            output = layer.output(output)
        return output
    def predict(self, X):
        return self.predict_proba(X).argmax(1)
    def fit(self, X, y):
        layers = list(self.hidden_layers)  # Copy
        layers.insert(0, X.shape[1])
        layers.insert(len(layers), len(np.unique(y)))
        self.weights_info = [(layers[i] + 1, layers[i + 1]) for i in range(len(layers) - 1)]
        self.opti_options = self.opti_options.copy()
        if self.coef_ is None:
            self.coef_ = self.rand_init(self.weights_info, self.rnd)

        # Unpack the weights and assign them to the layers
        self.layers = []
        start_pos = 0
        for w_info in self.weights_info:
            end_pos = start_pos + w_info[0] * (w_info[1])
            weight = self.coef_[start_pos:end_pos].reshape((w_info[0], w_info[1]))
            start_pos = end_pos
        args = (self.weights_info, len(np.unique(y)))
        MinibatchOpti.minimize(cost, self.coef_, cost_prime, X, y, method=self.opti_method,
                               random_state=self.rnd, batch_size=self.batch_size, maxiter=self.maxiter, 
                               tol=self.tol, args=args, verbose=self.verbose, options=self.opti_options,


Lets see how it good it does the using the MINST dataset: 50k rows for training and 10k validations, 748 features.

For all the optimization algorithms I am going to use 100 iterations using a batch size of 100. Also I only timed the fitting once because it takes a long time and I am lazy, also the difference between iterations when the times are that long are not that significant.

My setup is: Macbook Pro 2.5 GHz Inter Core i5. 16 GB RAM.

In [11]:
import cPickle, gzip, numpy
f ='mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)

In [12]:
X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]

In [13]:
from sklearn.metrics import accuracy_score

Gradient decent

First let's try using a simple gradient decent, the bad part is that one needs to input that pesky learning rate.

In [14]:
options = {}
options['learning_rate'] = 0.3
options['learning_rate_decay'] = 1

In [15]:
nn = NN([25], opti_method='GD', maxiter=100, opti_options=options, random_state=1234)

In [16]:
%timeit -n1 -r1, y_train)

Maximum number of iterations reached
1 loops, best of 1: 284 s per loop

In [17]:
accuracy_score(nn.predict(X_valid), y_valid)



To remove the learning rate an easy solution is to use RMSPROP.

In [18]:
options = {}
options['decay'] = 0.9

In [19]:
nn = NN([25], opti_method='RMSPROP', maxiter=100, opti_options=options, random_state=1234)

In [20]:
%timeit -n1 -r1, y_train)

Maximum number of iterations reached
1 loops, best of 1: 297 s per loop

In [21]:
accuracy_score(nn.predict(X_valid), y_valid)


Conjugate gradient

Finally my personal favorite is just to use the scipy optimization methods, one has to be careful though because most are for full-batch optimization. But the Conjugate Gradient and L-BFGS-B works on mini-batches. Personally I prefer CG because gave me better results.

In [22]:
options = {}
options['mb_maxiter'] = 10

In [23]:
nn = NN([25], opti_method='CG', maxiter=100, opti_options=options, random_state=1234)

In [24]:
%timeit -n1 -r1, y_train)

Maximum number of iterations reached
1 loops, best of 1: 349 s per loop

In [25]:
accuracy_score(nn.predict(X_valid), y_valid)



Let's look at the results in a beautiful pandas DataFrame.

In [26]:
import pandas as pd

In [27]:
pd.DataFrame([[284, 0.8316], [297, 0.1863], [314, 0.8607]], index=['GD', 'RMSPROP', 'CG'], columns=['Time [s]', 'Accuracy'])

Time [s] Accuracy
GD 284 0.8316
RMSPROP 297 0.1863
CG 314 0.8607


Even though gradient decent did well I still prefer the scipy optimization methods, maybe because they are just more robust and don't have a learning rate. It takes more time but is not the end of the world. I don't know what is wrong with my RMSPROP implementation I looked at some implementations online and couldn't found the error, if someone sees anything suspicious let me know.

Some improvements I would love to do and ideas I want to test:

  1. Use Numba to speed up the process, in theory having the cost function, its gradient and optimization in different "well" defined functions it should be easy.
  2. Try more optimization methods or why not use an existing implementation because there are many python implementations of these and more algorithms online, just to name a few I found: python-rl and climin.
  3. Do pretraining using RBMs: I am just starting to read about them.
  4. Use minibatches to some point and then switch to bigger minibatches or even full batch optimization.
  5. People suggested me to use a lookup table for the sigmoid. I tried but the accuracy I got was really bad, I have to check that code again.

I have to admit that I cheated a little bit on the post. Because the neural network supports multiple hidden layers but I only used one! Why?: Mainly because it takes a long time to train those networks and I am lazy. Second, I couldn't find a way to make those networks converge without using the scipy advanced optimization methods.

From what I understand the problem are the initial weights, and as a side note I spent almost 2 days trying to find a bug while training the MNIST dataset and it was that I started all the layers weights with similar values. The initial weights are very important.

I think I have learned enough about neural networks for my personal satisfaction, so this is probably the last iteration of my Neural network. I will definitely keep looking at the updates of NN research but not as almost fulltime as I been doing the past weeks. Is an imposible mission to catch up that amount of information in my free time. Having spent almost a month reading about NN my (obvious) conclusion is that researchers are just scratching the surface of what is possible, and the next years are going to be exciting.

If you are looking for a more complete implementation of a deep belif network there is free python (gpu) implementation by one person on Geoffrey Hinton group which I haven't look but seams promising and hopefully he will release more code.