Neural Networks is a widely used tool in many fields. It incorporates a collection of neurons and connects them into a network to learn and predict patterns. This notebook will walk through the basics of Artificial Neural Network, and built a typical three layer network in Python from scratch.

First, let's discuss about the architecture of the Neural Networks. Below is a diagram depicting a three layer network.

Neural Networks usually have three types of layers. The first layer is called the input layer. The last layer is named the output layer, which produces values computed by the hypothesis. The layers in the middle are referred to as the hidden layers. We can identify more than 1 hidden layer.

The nodes in the input layer basically represents the features of the model, i.e. $X_1, X_2,..., X_n$. It is also good practice to add a bias node, $X_0$, to the input layer.

For each node in the hidden layers, it 1) takes the inputs from either the input layer or the previous hidden layer, 2) computes the output value based on weights and activation function associated with the node, and 3) sends output to the next layer.

The nodes in the outpyut layer calculate final values for the hypothesis. In a classification problem, the number of nodes in the output layer is equal to number of classes.

Now, let's define some notations.

$L$: total number of layers

$S_l$: total number of nodes in layer $l$

$\Theta^{(l)}$: weight matrix controlling the mapping from layer $l$ to layer $l+1$

We can write down the elements of this metrix $\Theta^{(l)}$:

$$\begin{bmatrix} \Theta_{10}^{(l)} & \Theta_{11}^{(l)} & \dots & \Theta_{1S_l}^{(l)} \\ \Theta_{20}^{(l)} & \Theta_{21}^{(l)} & \dots & \Theta_{2S_l}^{(l)} \\ \vdots & \vdots & \ddots & \vdots \\ \Theta_{S_{l+1}0}^{(l)} & \Theta_{S_{l+1}1}^{(l)} & \dots & \Theta_{S_{l+1}S_l}^{(l)} \end{bmatrix}$$

Specifically, element $\Theta_{ji}^{(l)}$ stands for the weight (parameter) from $i^{th}$ node in layer $l$ to $j^{th}$ node in layer $l+1$

We can then define the "activation" of the node:

$a_i^{(l)}$: activation of node $i$ in layer $l$

Here the "activation" means the computed value output by each node.

To calculate the activation of each node, we need three things:

  • The inputs
  • The weights
  • The activation function

For a random node $j$ in layer $l+1$, we can plot this individual node and its input/ouput wires:

We can then define the linear combination of the inputs and weights to be $z_j^{(l+1)}$:

$$z_j^{(l+1)}=\sum_{i=0}^{S_l}a_i^{(l)}\Theta_{ji}^{(l)}$$

Note that $a_0^{(l)}$ here is the added bias unit (which takes the values of 1)

Thus the activation $a_j^{(l+1)}$ can be expressed as:

$$a_j^{(l+1)}=g\big(z_j^{(l+1)}\big)$$

Here $g(z)$ is called the activation function. It can take various forms. We will use the same Sigmoid Function as in the Logistic Regression:

$$g(z)=\frac{1}{1+e^{-z}}$$

Now we can put all these notations together in matrix (vectorized) form.

Let's define:

$$a^{(1)}=\begin{bmatrix} X_0 \\ X_1 \\ \vdots \\ X_n \end{bmatrix}$$

For a three layer neural network, we can calculate the activation of each node as follows:

$$z^{(2)}=\Theta^{(1)}a^{(1)}$$
$$a^{(2)}=g\big(z^{(2)}\big)$$
$$a^{(2)}=[a_0^{(2)};\ a^{(2)}]$$
$$z^{(3)}=\Theta^{(2)}a^{(2)}$$
$$a^{(3)}=g\big(z^{(3)}\big)$$

For a three layer network, the third layer is aslo the output layer. We can define model hypothesis output as:

$$h_\Theta(X)=a^{(3)}$$

This process of calculating the activation for each node is also known as "forward propagation".

With forward propagation, we can compute the output values of the hypothesis given the features and the weight matrix. Now let's discuss how to formulate the cost function.

Generally, we want to minimize the "difference" between predicted and actual targets. For classification problems, we've used "log-loss" in the Logistic Regression. This idea can be implemented similarly in Neural Networks.

Suppose we have a total of $K$ varying classes. The output layer would produces $K$ outputs:

$$h_\Theta(X)=\begin{bmatrix} (h_\Theta(X))_1 \\ (h_\Theta(X))_2 \\ \vdots \\ (h_\Theta(X))_K \end{bmatrix} $$

To match the format, we must convery the target $y$ to the same $K$-dimensional vector. For example, if $y$ belongs the third class among a total of four classes, we could convert it to:

$$\begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \end{bmatrix} $$

Given $m$ training examples, we can formulate the cost function as follows (with L2 regularization):

$$ J(\Theta)=-\frac{1}{m}\Big[\sum_{i=1}^{m}\sum_{k=1}^{K}y_k^{(i)}log\Big(\big(h_\Theta(X^{(i)})\big)_k\Big)+(1-y_k^{(i)})log\Big(1-\big(h_\Theta(X^{(i)})\big)_k\Big)\Big]$$$$+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{S_l}\sum_{j=1}^{S_{l+1}}(\Theta_{ji}^{(l)})^2 $$

The first term is the sum of log-loss over K output nodes. The second term is the sum of square of the weights(parameters) for regularization purpose.

As we discussed in Linear Regression and Logistic Regression, we want to calculate the partial derivative term $\frac{\partial J(\Theta)}{\partial \Theta_{ji}^{(l)}}$ for cost minimization algorithms.

Back propagation is designed to compute this partial derivative term.

Now let's define the key notation in "back propagation": $\delta_j^{(l)}$

For forward propagation, we've computed the activation of each node. We would certainly expect some error regarding those activations compared to the "real" numbers. Thus, the delta term $\delta_j^{(l)}$ intuitively captures this error for the activation of node $j$ in layer $l$.

For the output layer $L$, this delta is straightforward. It is the difference between the activaton and the actual observed value:

$$\delta_j^{(L)}=a_j^{(L)}-y_j$$

Let's use matrix format:

$$\delta^{(L)}=\begin{bmatrix} a_1^{(L)}-y_1 \\ a_2^{(L)}-y_2 \\ \vdots \\ a_K^{(L)}-y_K \end{bmatrix} $$

With $\delta^{(L)}$, we can then calculate $\delta^{(L-1)}, \delta^{(L-2)}, ...$ as follows:

$$\delta^{(L-1)}=(\Theta^{(L-1)})^T\delta^{(L)}.*g'(z^{(L-1)})$$
$$\delta^{(L-2)}=(\Theta^{(L-2)})^T\delta^{(L-1)}.*g'(z^{(L-2)})$$
$$\vdots$$

With the above "back propagation" process, we can obtain delta terms for all the nodes. This is critical in computing the gradient. It turns out, for a single training example, the gradient is:

$$\frac{\partial}{\partial \Theta_{ji}^{(l)}}J(\Theta)=a_j^{(l)}\delta_i^{(l+1)}+\lambda\Theta_{ji}^{(l)}$$
Please note that I did not include the detailed calculation process on how to reduce the complex expression to its simplest form used with back propagation. It's pure Calculus. But if you are really interested in that aspect, please feel free to shoot me an email at FeifeiYu1024@gmail.com. I will send you a copy of my notes.

To put it together, for each example, we first use forward propagation to calculate the activation of each node. Then can perform back propagation to move back from the output layer to previous layers to compute the delta ternms for all the nodes. Since the partial derivative of cost function is determined by the corresponding activation and delta terms, we can then loop through all training examples to get the gradient of the cost function. Finally, with cost function and its gradient, we can perform optimization algorithms like Gradient Descent or BFGS to learn the values of the weights (parameters).

The following class ThreeLayerNeuralNetwork() implement the neural network with input, hidden, and output layers.


In [1]:
import numpy as np
import scipy

class ThreeLayerNeuralNetwork():

    def __init__(self, layer1_size, layer2_size, layer3_size, lamda):

        self.layer1_size = layer1_size
        self.layer2_size = layer2_size
        self.layer3_size = layer3_size
        self.lamda = lamda
        self._mu = None
        self._sigma = None
        self._params = None

    def _feature_norm(self, X):

        # it is good practice to normalize features before training a neural network
        mu = np.mean(X, axis=0)
        sigma = np.std(X, axis=0)
        X_norm = (X - mu) / sigma

        return X_norm, mu, sigma

    def _param_reshape(self, params):

        # the input of this function is 1d array with all the parameters for all layers
        # the output will be two 2d array with parameters from layer1 to layer2, 
        # and layer 2 to layer 3
        theta1_num = (self.layer1_size + 1) * self.layer2_size
        theta1_size = (self.layer2_size, self.layer1_size + 1)
        theta2_size = (self.layer3_size, self.layer2_size + 1)
        theta1 = params[:theta1_num].reshape(theta1_size)
        theta2 = params[theta1_num:].reshape(theta2_size)
        return theta1, theta2

    def _sigmoid(self, z):

        # formulate sigmoid function
        return 1.0 / (1.0 + np.exp(-z))

    def _sigmoid_grad(self, z):

        # formulate gradient of the sigmoid function
        g = self._sigmoid(z)
        return g * (1 - g)

    def _label_expansion(self, y):

        # the input of this function is the label with one single value for each example
        # the output is a matrix with each column corresponding to a training example
        # the size of the output matrix is (num_classes, num_examples)
        m = len(y)
        num_classes = len(np.unique(y))
        y_mat = np.zeros((num_classes, m))
        for i in range(m):
            y_mat[y[i], i] = 1

        return y_mat

    def _forward_prop(self, theta1, theta2, X):

        # forward propagation given the current weights
        a1 = np.r_[np.ones((1, X.shape[0])), X.T]
        z2 = theta1.dot(a1)
        a2 = self._sigmoid(z2)
        a2 = np.r_[np.ones((1, a2.shape[1])), a2]
        z3 = theta2.dot(a2)
        a3 = self._sigmoid(z3)

        return a1, a2, a3, z2, z3

    def _back_prop(self, theta2, a3, z2, y):

        # back propagation
        m = len(y)
        y_mat = self._label_expansion(y)
        sigma3 = a3 - y_mat
        sigma2 = theta2.T.dot(sigma3) * self._sigmoid_grad(np.r_[np.ones((1, m)), z2])
        sigma2 = sigma2[1:, :]

        return sigma2, sigma3

    def _cost_calc(self, params, X, y):

        # formulate the cost function
        m = X.shape[0]
        theta1, theta2 = self._param_reshape(params)
        y_mat = self._label_expansion(y)
        _, _, a3, _, _ = self._forward_prop(theta1, theta2, X)
        term1 = -1.0 / m * sum((y_mat * np.log(a3) + (1 - y_mat) * np.log(1 - a3)).ravel())
        term2 = 1.0 * self.lamda / (2.0 * m) * (sum((theta1[:, 1:] ** 2).ravel()) + 
                                                sum((theta2[:, 1:] ** 2).ravel()))

        return term1 + term2

    def _grad_calc(self, params, X, y):

        # formulate gradient
        m = X.shape[0]
        theta1, theta2 = self._param_reshape(params)
        a1, a2, a3, z2, _ = self._forward_prop(theta1, theta2, X)
        sigma2, sigma3 = self._back_prop(theta2, a3, z2, y)
        grad1 = sigma2.dot(a1.T) / (1.0 * m)
        grad2 = sigma3.dot(a2.T) / (1.0 * m)
        grad1[:, 1:] = grad1[:, 1:] + theta1[:, 1:] * float(self.lamda) / m
        grad2[:, 1:] = grad2[:, 1:] + theta2[:, 1:] * float(self.lamda) / m

        return np.r_[grad1.ravel(), grad2.ravel()]

    def _rand_initialize_param(self, layer_in, layer_out):

        # initialize parameters (weights) for neural networks
        epsilon = 0.12
        num = layer_out * (layer_in + 1)
        params = np.array([np.random.random() * 2 * epsilon - epsilon for i in range(num)])
        params = params.reshape((layer_out, layer_in + 1))

        return params

    def _numeric_grad_calc(self, params, X, y):

        # calculate numerical gradient
        # used to check the gradient
        # plz turn it off after confirm gradient is correct
        numeric_grad = np.zeros(params.shape)
        perturb = np.zeros(params.shape)
        e = 1e-4
        for i in range(numeric_grad.shape[0]):
            perturb[i] = e
            loss1 = self._cost_calc(params - perturb, X, y)
            loss2 = self._cost_calc(params + perturb, X, y)
            numeric_grad[i] = (loss2 - loss1) / (2.0 * e)
            perturb[i] = 0

        return numeric_grad

    def _debug_matrix_init(self, num_row, num_col):

        # used to randomly initialize X and y for debug purposes
        mat = np.array([np.sin(x) / 10 for x in range(1, num_row * num_col + 1)])
        mat = mat.reshape(num_row, num_col)

        return mat

    def grad_check(self):

        # check the correctness of the gradient calculation
        # use 10 examples for debug purpose only
        m = 10
        theta1 = self._rand_initialize_param(self.layer1_size, self.layer2_size)
        theta2 = self._rand_initialize_param(self.layer2_size, self.layer3_size)
        X = self._debug_matrix_init(m, self.layer1_size)
        tempY = np.array([x for x in range(1, m + 1)])
        y = np.mod(tempY, self.layer3_size)
        y = y.reshape((m, 1))
        params = np.r_[theta1.ravel(), theta2.ravel()]
        grad = self._grad_calc(params, X, y)
        numeric_grad = self._numeric_grad_calc(params, X, y)

        # if gradient calculation is correct, 
        # the difference between gradient and numerical gradient should be small
        diff = np.linalg.norm(numeric_grad - grad) / np.linalg.norm(numeric_grad + grad)
        print diff

    def fit(self, X, y):

        # train the model
        X, self._mu, self._sigma = self._feature_norm(X)
        theta1 = self._rand_initialize_param(self.layer1_size, self.layer2_size)
        theta2 = self._rand_initialize_param(self.layer2_size, self.layer3_size)
        init_params = np.r_[theta1.ravel(), theta2.ravel()]
        result = scipy.optimize.minimize(fun=self._cost_calc, x0=init_params, args=(X, y),
                                         method='BFGS', jac=self._grad_calc,
                                         options={"maxiter": 200, "disp": False})
        self._params = result.x

        return self

    def predict(self, X):

        # predict with the trained model
        theta1, theta2 = self._param_reshape(self._params)
        X = (X - self._mu) / self._sigma

        # case 1: predict for a single example
        if len(X.shape) ==1:
            a1 = np.r_[np.ones((1, 1)), X.reshape(X.shape[0], 1)]
            a2 = self._sigmoid(theta1.dot(a1))
            a2 = np.r_[np.ones((1, 1)), a2]
            a3 = self._sigmoid(theta2.dot(a2))

            return np.argmax(a3)

        # case 2: predict for multiple examples
        else:
            _, _, a3, _, _ = self._forward_prop(theta1, theta2, X)

            return np.argmax(a3, axis=0)

There is one more thing in practice. As we can see, the gradient calculation is rather complex. We should make sure it's correct before we move to the parameters learning / model fitting process. For the checking purposes, we can use "numerical graident".

Specifically, the cost function $J$ is a function of paramter vector $\Theta$. We can numerically compute partial derivative as below

$$\frac{\partial}{\partial \Theta}J(\Theta)=\frac{J(\Theta+\varepsilon)-J(\Theta-\varepsilon)}{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 2\varepsilon}$$

Here elements in $\varepsilon$ need to be really small to be able to approximate the derivative with slope.

In practice, we can first calculate the partial derivatives using back propagation, and compare it with the numerial gradients. If they are approximately the same, we can continue to use back propagation to perform gradient-based learning.

Also, we need to be aware that the computation cost for numeric gradient is high. Thus we should turn off this graident checking after we confirm the correctness of the back propagation.

Let's do the gradient checking by initializing a simple $3*5*3$ network:


In [2]:
debug_nn = ThreeLayerNeuralNetwork(layer1_size=3, layer2_size=5, layer3_size=3, lamda=1)
debug_nn.grad_check()


3.42567634726e-11

We can see that the difference between numerial and back-propagation-calculated gradient are very tiny. Thus we can clonclude that our back propagation is working correctly.

Let's now load the real data for demo:


In [3]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris['data']
y = iris['target']

print X.shape
print y.shape
print "Number of Classes: {}".format(len(np.unique(y)))


(150L, 4L)
(150L,)
Number of Classes: 3

Split into train and test sets:


In [4]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=26)

Train the neural network with 10 nodes in the hidden layer:


In [5]:
nn = ThreeLayerNeuralNetwork(layer1_size=X.shape[1], layer2_size=10, layer3_size=len(np.unique(y)), lamda=1)
nn = nn.fit(X_train, y_train)
y_predict = nn.predict(X_test)
print "True Values:      {}".format(y_test)
print "Predicted Values: {}".format(y_predict)
print "Prediction Accuracy: {:.2%}".format(np.mean((y_predict == y_test).astype(float)))


True Values:      [1 1 0 0 2 1 0 2 2 2 0 1 1 0 1 0 2 1 2 0 0 2 2 2 2 2 1 0 0 2 1 1 1 2 0 0]
Predicted Values: [1 1 0 0 2 1 0 2 2 2 0 1 1 0 1 0 2 1 2 0 0 2 2 2 2 1 1 0 0 2 1 1 1 2 0 0]
Prediction Accuracy: 97.22%

Other things to consider:

  • Deep learning
  • GPU computing to expedite the model learning process