Neural Networks is a widely used tool in many fields. It incorporates a collection of neurons and connects them into a network to learn and predict patterns. This notebook will walk through the basics of Artificial Neural Network, and built a typical three layer network in Python from scratch.
First, let's discuss about the architecture of the Neural Networks. Below is a diagram depicting a three layer network.
Neural Networks usually have three types of layers. The first layer is called the input layer. The last layer is named the output layer, which produces values computed by the hypothesis. The layers in the middle are referred to as the hidden layers. We can identify more than 1 hidden layer.
The nodes in the input layer basically represents the features of the model, i.e. $X_1, X_2,..., X_n$. It is also good practice to add a bias node, $X_0$, to the input layer.
For each node in the hidden layers, it 1) takes the inputs from either the input layer or the previous hidden layer, 2) computes the output value based on weights and activation function associated with the node, and 3) sends output to the next layer.
The nodes in the outpyut layer calculate final values for the hypothesis. In a classification problem, the number of nodes in the output layer is equal to number of classes.
Now, let's define some notations.
$L$: total number of layers
$S_l$: total number of nodes in layer $l$
$\Theta^{(l)}$: weight matrix controlling the mapping from layer $l$ to layer $l+1$
We can write down the elements of this metrix $\Theta^{(l)}$:
Specifically, element $\Theta_{ji}^{(l)}$ stands for the weight (parameter) from $i^{th}$ node in layer $l$ to $j^{th}$ node in layer $l+1$
We can then define the "activation" of the node:
$a_i^{(l)}$: activation of node $i$ in layer $l$
Here the "activation" means the computed value output by each node.
To calculate the activation of each node, we need three things:
For a random node $j$ in layer $l+1$, we can plot this individual node and its input/ouput wires:
We can then define the linear combination of the inputs and weights to be $z_j^{(l+1)}$:
Note that $a_0^{(l)}$ here is the added bias unit (which takes the values of 1)
Thus the activation $a_j^{(l+1)}$ can be expressed as:
Here $g(z)$ is called the activation function. It can take various forms. We will use the same Sigmoid Function as in the Logistic Regression:
Let's define:
For a three layer neural network, we can calculate the activation of each node as follows:
For a three layer network, the third layer is aslo the output layer. We can define model hypothesis output as:
This process of calculating the activation for each node is also known as "forward propagation".
With forward propagation, we can compute the output values of the hypothesis given the features and the weight matrix. Now let's discuss how to formulate the cost function.
Generally, we want to minimize the "difference" between predicted and actual targets. For classification problems, we've used "log-loss" in the Logistic Regression. This idea can be implemented similarly in Neural Networks.
Suppose we have a total of $K$ varying classes. The output layer would produces $K$ outputs:
To match the format, we must convery the target $y$ to the same $K$-dimensional vector. For example, if $y$ belongs the third class among a total of four classes, we could convert it to:
The first term is the sum of log-loss over K output nodes. The second term is the sum of square of the weights(parameters) for regularization purpose.
As we discussed in Linear Regression and Logistic Regression, we want to calculate the partial derivative term $\frac{\partial J(\Theta)}{\partial \Theta_{ji}^{(l)}}$ for cost minimization algorithms.
Now let's define the key notation in "back propagation": $\delta_j^{(l)}$
For forward propagation, we've computed the activation of each node. We would certainly expect some error regarding those activations compared to the "real" numbers. Thus, the delta term $\delta_j^{(l)}$ intuitively captures this error for the activation of node $j$ in layer $l$.
For the output layer $L$, this delta is straightforward. It is the difference between the activaton and the actual observed value:
Let's use matrix format:
With $\delta^{(L)}$, we can then calculate $\delta^{(L-1)}, \delta^{(L-2)}, ...$ as follows:
To put it together, for each example, we first use forward propagation to calculate the activation of each node. Then can perform back propagation to move back from the output layer to previous layers to compute the delta ternms for all the nodes. Since the partial derivative of cost function is determined by the corresponding activation and delta terms, we can then loop through all training examples to get the gradient of the cost function. Finally, with cost function and its gradient, we can perform optimization algorithms like Gradient Descent or BFGS to learn the values of the weights (parameters).
The following class ThreeLayerNeuralNetwork() implement the neural network with input, hidden, and output layers.
In [1]:
import numpy as np
import scipy
class ThreeLayerNeuralNetwork():
def __init__(self, layer1_size, layer2_size, layer3_size, lamda):
self.layer1_size = layer1_size
self.layer2_size = layer2_size
self.layer3_size = layer3_size
self.lamda = lamda
self._mu = None
self._sigma = None
self._params = None
def _feature_norm(self, X):
# it is good practice to normalize features before training a neural network
mu = np.mean(X, axis=0)
sigma = np.std(X, axis=0)
X_norm = (X - mu) / sigma
return X_norm, mu, sigma
def _param_reshape(self, params):
# the input of this function is 1d array with all the parameters for all layers
# the output will be two 2d array with parameters from layer1 to layer2,
# and layer 2 to layer 3
theta1_num = (self.layer1_size + 1) * self.layer2_size
theta1_size = (self.layer2_size, self.layer1_size + 1)
theta2_size = (self.layer3_size, self.layer2_size + 1)
theta1 = params[:theta1_num].reshape(theta1_size)
theta2 = params[theta1_num:].reshape(theta2_size)
return theta1, theta2
def _sigmoid(self, z):
# formulate sigmoid function
return 1.0 / (1.0 + np.exp(-z))
def _sigmoid_grad(self, z):
# formulate gradient of the sigmoid function
g = self._sigmoid(z)
return g * (1 - g)
def _label_expansion(self, y):
# the input of this function is the label with one single value for each example
# the output is a matrix with each column corresponding to a training example
# the size of the output matrix is (num_classes, num_examples)
m = len(y)
num_classes = len(np.unique(y))
y_mat = np.zeros((num_classes, m))
for i in range(m):
y_mat[y[i], i] = 1
return y_mat
def _forward_prop(self, theta1, theta2, X):
# forward propagation given the current weights
a1 = np.r_[np.ones((1, X.shape[0])), X.T]
z2 = theta1.dot(a1)
a2 = self._sigmoid(z2)
a2 = np.r_[np.ones((1, a2.shape[1])), a2]
z3 = theta2.dot(a2)
a3 = self._sigmoid(z3)
return a1, a2, a3, z2, z3
def _back_prop(self, theta2, a3, z2, y):
# back propagation
m = len(y)
y_mat = self._label_expansion(y)
sigma3 = a3 - y_mat
sigma2 = theta2.T.dot(sigma3) * self._sigmoid_grad(np.r_[np.ones((1, m)), z2])
sigma2 = sigma2[1:, :]
return sigma2, sigma3
def _cost_calc(self, params, X, y):
# formulate the cost function
m = X.shape[0]
theta1, theta2 = self._param_reshape(params)
y_mat = self._label_expansion(y)
_, _, a3, _, _ = self._forward_prop(theta1, theta2, X)
term1 = -1.0 / m * sum((y_mat * np.log(a3) + (1 - y_mat) * np.log(1 - a3)).ravel())
term2 = 1.0 * self.lamda / (2.0 * m) * (sum((theta1[:, 1:] ** 2).ravel()) +
sum((theta2[:, 1:] ** 2).ravel()))
return term1 + term2
def _grad_calc(self, params, X, y):
# formulate gradient
m = X.shape[0]
theta1, theta2 = self._param_reshape(params)
a1, a2, a3, z2, _ = self._forward_prop(theta1, theta2, X)
sigma2, sigma3 = self._back_prop(theta2, a3, z2, y)
grad1 = sigma2.dot(a1.T) / (1.0 * m)
grad2 = sigma3.dot(a2.T) / (1.0 * m)
grad1[:, 1:] = grad1[:, 1:] + theta1[:, 1:] * float(self.lamda) / m
grad2[:, 1:] = grad2[:, 1:] + theta2[:, 1:] * float(self.lamda) / m
return np.r_[grad1.ravel(), grad2.ravel()]
def _rand_initialize_param(self, layer_in, layer_out):
# initialize parameters (weights) for neural networks
epsilon = 0.12
num = layer_out * (layer_in + 1)
params = np.array([np.random.random() * 2 * epsilon - epsilon for i in range(num)])
params = params.reshape((layer_out, layer_in + 1))
return params
def _numeric_grad_calc(self, params, X, y):
# calculate numerical gradient
# used to check the gradient
# plz turn it off after confirm gradient is correct
numeric_grad = np.zeros(params.shape)
perturb = np.zeros(params.shape)
e = 1e-4
for i in range(numeric_grad.shape[0]):
perturb[i] = e
loss1 = self._cost_calc(params - perturb, X, y)
loss2 = self._cost_calc(params + perturb, X, y)
numeric_grad[i] = (loss2 - loss1) / (2.0 * e)
perturb[i] = 0
return numeric_grad
def _debug_matrix_init(self, num_row, num_col):
# used to randomly initialize X and y for debug purposes
mat = np.array([np.sin(x) / 10 for x in range(1, num_row * num_col + 1)])
mat = mat.reshape(num_row, num_col)
return mat
def grad_check(self):
# check the correctness of the gradient calculation
# use 10 examples for debug purpose only
m = 10
theta1 = self._rand_initialize_param(self.layer1_size, self.layer2_size)
theta2 = self._rand_initialize_param(self.layer2_size, self.layer3_size)
X = self._debug_matrix_init(m, self.layer1_size)
tempY = np.array([x for x in range(1, m + 1)])
y = np.mod(tempY, self.layer3_size)
y = y.reshape((m, 1))
params = np.r_[theta1.ravel(), theta2.ravel()]
grad = self._grad_calc(params, X, y)
numeric_grad = self._numeric_grad_calc(params, X, y)
# if gradient calculation is correct,
# the difference between gradient and numerical gradient should be small
diff = np.linalg.norm(numeric_grad - grad) / np.linalg.norm(numeric_grad + grad)
print diff
def fit(self, X, y):
# train the model
X, self._mu, self._sigma = self._feature_norm(X)
theta1 = self._rand_initialize_param(self.layer1_size, self.layer2_size)
theta2 = self._rand_initialize_param(self.layer2_size, self.layer3_size)
init_params = np.r_[theta1.ravel(), theta2.ravel()]
result = scipy.optimize.minimize(fun=self._cost_calc, x0=init_params, args=(X, y),
method='BFGS', jac=self._grad_calc,
options={"maxiter": 200, "disp": False})
self._params = result.x
return self
def predict(self, X):
# predict with the trained model
theta1, theta2 = self._param_reshape(self._params)
X = (X - self._mu) / self._sigma
# case 1: predict for a single example
if len(X.shape) ==1:
a1 = np.r_[np.ones((1, 1)), X.reshape(X.shape[0], 1)]
a2 = self._sigmoid(theta1.dot(a1))
a2 = np.r_[np.ones((1, 1)), a2]
a3 = self._sigmoid(theta2.dot(a2))
return np.argmax(a3)
# case 2: predict for multiple examples
else:
_, _, a3, _, _ = self._forward_prop(theta1, theta2, X)
return np.argmax(a3, axis=0)
There is one more thing in practice. As we can see, the gradient calculation is rather complex. We should make sure it's correct before we move to the parameters learning / model fitting process. For the checking purposes, we can use "numerical graident".
Specifically, the cost function $J$ is a function of paramter vector $\Theta$. We can numerically compute partial derivative as below
Here elements in $\varepsilon$ need to be really small to be able to approximate the derivative with slope.
Also, we need to be aware that the computation cost for numeric gradient is high. Thus we should turn off this graident checking after we confirm the correctness of the back propagation.
Let's do the gradient checking by initializing a simple $3*5*3$ network:
In [2]:
debug_nn = ThreeLayerNeuralNetwork(layer1_size=3, layer2_size=5, layer3_size=3, lamda=1)
debug_nn.grad_check()
We can see that the difference between numerial and back-propagation-calculated gradient are very tiny. Thus we can clonclude that our back propagation is working correctly.
In [3]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = iris['target']
print X.shape
print y.shape
print "Number of Classes: {}".format(len(np.unique(y)))
Split into train and test sets:
In [4]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=26)
Train the neural network with 10 nodes in the hidden layer:
In [5]:
nn = ThreeLayerNeuralNetwork(layer1_size=X.shape[1], layer2_size=10, layer3_size=len(np.unique(y)), lamda=1)
nn = nn.fit(X_train, y_train)
y_predict = nn.predict(X_test)
print "True Values: {}".format(y_test)
print "Predicted Values: {}".format(y_predict)
print "Prediction Accuracy: {:.2%}".format(np.mean((y_predict == y_test).astype(float)))