Is really difficult be part of the data science / machine learning community and have not heard about Deep Neural Networks, everybody is talking about them. Is even harder for a person like me without a PhD and without a deep computer science or mathematics education to learn about them, because 1. Machine learning uses a quite heavy math and 2. There are no Neural Networks on sklearn.

Libraries like sklearn hide all the stuff and let you use machine learning and get amazing results. But sometimes you need more, and also understanding how an algorithm is implemented can help you understand how to improve results. The learning curve is hard, you can easily spend hours on DeepLearning.net and not understand anything. But there is hope.

I took Andrew's Ng Machine Learning course on Coursera wanting to get a better undesrtanding of how the algorithms I use almost everyday work. I learned a lot of usefull tricks and learned a lot more about simple machine learning implementations. Is important to start with the easy stuff or you get overwhelmed easy and eventually give up. Hopefully in a few more weeks I would be able to understand two or three more words on DeepLearning.net.

The course is amazing, probably the best I have taken on Coursera (this was my 5th course). Amazing videos, amazing content, the course is quite hard but gives enough help to make it hard enough to NOT give up.

The "bad" part is that the course uses Matlab/Octave. There is nothing like it for matrix stuff (specially Matlab) but Python is my to go language for almost everything so I decide to translate some Matlab code hoping to undestand more about the Neural Network implementation. And in the process obtain some reusable code for a neural network.

The implementation follows the sklearn sintax: fit, predict, predict_proba.

Also note that this is a fixed 1 hidden layer neural network, trained via back propagation. Basicly is the same Neural Network Andrew teached on the course but on python, nothing deep here.

Implementation


In [1]:
import numpy as np
from scipy import optimize
from __future__ import division

In [2]:
class NN_1HL(object):
    
    def __init__(self, reg_lambda=0, epsilon_init=0.12, hidden_layer_size=25, opti_method='TNC', maxiter=500):
        self.reg_lambda = reg_lambda
        self.epsilon_init = epsilon_init
        self.hidden_layer_size = hidden_layer_size
        self.activation_func = self.sigmoid
        self.activation_func_prime = self.sigmoid_prime
        self.method = opti_method
        self.maxiter = maxiter
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def sigmoid_prime(self, z):
        sig = self.sigmoid(z)
        return sig * (1 - sig)
    
    def sumsqr(self, a):
        return np.sum(a ** 2)
    
    def rand_init(self, l_in, l_out):
        return np.random.rand(l_out, l_in + 1) * 2 * self.epsilon_init - self.epsilon_init
    
    def pack_thetas(self, t1, t2):
        return np.concatenate((t1.reshape(-1), t2.reshape(-1)))
    
    def unpack_thetas(self, thetas, input_layer_size, hidden_layer_size, num_labels):
        t1_start = 0
        t1_end = hidden_layer_size * (input_layer_size + 1)
        t1 = thetas[t1_start:t1_end].reshape((hidden_layer_size, input_layer_size + 1))
        t2 = thetas[t1_end:].reshape((num_labels, hidden_layer_size + 1))
        return t1, t2
    
    def _forward(self, X, t1, t2):
        m = X.shape[0]
        ones = None
        if len(X.shape) == 1:
            ones = np.array(1).reshape(1,)
        else:
            ones = np.ones(m).reshape(m,1)
        
        # Input layer
        a1 = np.hstack((ones, X))
        
        # Hidden Layer
        z2 = np.dot(t1, a1.T)
        a2 = self.activation_func(z2)
        a2 = np.hstack((ones, a2.T))
        
        # Output layer
        z3 = np.dot(t2, a2.T)
        a3 = self.activation_func(z3)
        return a1, z2, a2, z3, a3
    
    def function(self, thetas, input_layer_size, hidden_layer_size, num_labels, X, y, reg_lambda):
        t1, t2 = self.unpack_thetas(thetas, input_layer_size, hidden_layer_size, num_labels)
        
        m = X.shape[0]
        Y = np.eye(num_labels)[y]
        
        _, _, _, _, h = self._forward(X, t1, t2)
        costPositive = -Y * np.log(h).T
        costNegative = (1 - Y) * np.log(1 - h).T
        cost = costPositive - costNegative
        J = np.sum(cost) / m
        
        if reg_lambda != 0:
            t1f = t1[:, 1:]
            t2f = t2[:, 1:]
            reg = (self.reg_lambda / (2 * m)) * (self.sumsqr(t1f) + self.sumsqr(t2f))
            J = J + reg
        return J
        
    def function_prime(self, thetas, input_layer_size, hidden_layer_size, num_labels, X, y, reg_lambda):
        t1, t2 = self.unpack_thetas(thetas, input_layer_size, hidden_layer_size, num_labels)
        
        m = X.shape[0]
        t1f = t1[:, 1:]
        t2f = t2[:, 1:]
        Y = np.eye(num_labels)[y]
        
        Delta1, Delta2 = 0, 0
        for i, row in enumerate(X):
            a1, z2, a2, z3, a3 = self._forward(row, t1, t2)
            
            # Backprop
            d3 = a3 - Y[i, :].T
            d2 = np.dot(t2f.T, d3) * self.activation_func_prime(z2)
            
            Delta2 += np.dot(d3[np.newaxis].T, a2[np.newaxis])
            Delta1 += np.dot(d2[np.newaxis].T, a1[np.newaxis])
            
        Theta1_grad = (1 / m) * Delta1
        Theta2_grad = (1 / m) * Delta2
        
        if reg_lambda != 0:
            Theta1_grad[:, 1:] = Theta1_grad[:, 1:] + (reg_lambda / m) * t1f
            Theta2_grad[:, 1:] = Theta2_grad[:, 1:] + (reg_lambda / m) * t2f
        
        return self.pack_thetas(Theta1_grad, Theta2_grad)
    
    def fit(self, X, y):
        num_features = X.shape[0]
        input_layer_size = X.shape[1]
        num_labels = len(set(y))
        
        theta1_0 = self.rand_init(input_layer_size, self.hidden_layer_size)
        theta2_0 = self.rand_init(self.hidden_layer_size, num_labels)
        thetas0 = self.pack_thetas(theta1_0, theta2_0)
        
        options = {'maxiter': self.maxiter}
        _res = optimize.minimize(self.function, thetas0, jac=self.function_prime, method=self.method, 
                                 args=(input_layer_size, self.hidden_layer_size, num_labels, X, y, 0), options=options)
        
        self.t1, self.t2 = self.unpack_thetas(_res.x, input_layer_size, self.hidden_layer_size, num_labels)
    
    def predict(self, X):
        return self.predict_proba(X).argmax(0)
    
    def predict_proba(self, X):
        _, _, _, _, h = self._forward(X, self.t1, self.t2)
        return h

How good is it?

Let's test it with the allways good iris dataset using a single split for cross validation


In [3]:
import sklearn.datasets as datasets
from sklearn import cross_validation

In [4]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [5]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4)

In [6]:
nn = NN_1HL()
nn.fit(X_train, y_train)


-c:64: RuntimeWarning: divide by zero encountered in log
-c:64: RuntimeWarning: invalid value encountered in multiply

In [7]:
from sklearn.metrics import accuracy_score

In [8]:
accuracy_score(y_test, nn.predict(X_test))


Out[8]:
0.96666666666666667

Nice 96% on the cross validation dataset!

But wait the iris dataset is quite simple how well other algorithms do? What about my other favorite algorithm: Random Forest?


In [9]:
from sklearn.ensemble import RandomForestClassifier

In [10]:
rfc = RandomForestClassifier(n_estimators=50)

In [11]:
rfc.fit(X_train, y_train)


Out[11]:
RandomForestClassifier(bootstrap=True, compute_importances=False,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=0.1, min_samples_leaf=1, min_samples_split=2,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)

In [12]:
accuracy_score(y_test, rfc.predict(X_test))


Out[12]:
0.96666666666666667

OK, the same result that is good, sometimes you get 96% or 98% because of the random state of the algorithms, but is fine.

Really, How good is it?!!

Lets try again with a more complex example, the same example of the coursera course: recognize hand writter numbers [5000 rows, 400 dimmensions].


In [13]:
from scipy.io import loadmat
data = loadmat('ex3data1.mat')
X, y = data['X'], data['y']
y = y.reshape(X.shape[0], )
y = y - 1  # Fix notation # TODO: Automaticlly fix that on the class

In [14]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4)

In [15]:
nn = NN_1HL(maxiter=200)
nn.fit(X_train, y_train)

In [16]:
accuracy_score(y_test, nn.predict(X_test))


Out[16]:
0.91349999999999998

In [17]:
rfc = RandomForestClassifier(n_estimators=50)
rfc.fit(X_train, y_train)


Out[17]:
RandomForestClassifier(bootstrap=True, compute_importances=False,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=0.1, min_samples_leaf=1, min_samples_split=2,
            n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)

In [18]:
accuracy_score(y_test, rfc.predict(X_test))


Out[18]:
0.91600000000000004

Conclusions

Basiclly the same result for both algorithms. Lets decide a winner by the time they took to train.

It took a long time to train the neural network on the second example, don't have and exact number but ... ok let's get that number (god I love blogging with ipython notebooks):


In [19]:
nn = NN_1HL(maxiter=200)
%timeit nn.fit(X_train, y_train)


1 loops, best of 3: 119 s per loop

So, it took 119s, thats it 1.98 minutes using a maximum number of iterations of 200, more will probably improve a little bit the results.

On the other hand the Random Forest took:


In [20]:
rfc = RandomForestClassifier(n_estimators=50)
%timeit rfc.fit(X_train, y_train)


1 loops, best of 3: 7.44 s per loop

Just 7.38 seconds definitely an improvement and the accuracy of both algorithms was very similar on the validation dataset, so Random Forest is a winner on this round.

I am not completlly sure if the delay is because of my implementation or python. Probably a combination of both but my guess is that my implementation has some of the fault. But honestly I don't think that a lot, the Matlab implementation I did (and compare to some I found online) is pretty vectorized and I don't see any obvious mistake on it. I honestly believe that Neural Networks just take longer to train. But as it should be obvious from this post I am not an expert.

Probably the next step is to generallize the implementation to a different number of hidden layers and test the algorith in another dataset, both iris and digit recognition are pretty standard.

Also to try a tool like numba to see if it speeds up the code with some magic. I don't have a lot of expectation because of the vectorized implementation I did should be pretty optimized via numpy, but who knows.

As a final conclusion I was pretty amazed by the results, a really simple implementation got some pretty decent results.

I was also amazed by the state of the scipy library, the optimization routines are state of the are from my point of view, amazing work.