# Applications of Machine Learning with Artificial Neural Networks and Supervised Regression

Lucas Barbosa

### Training Time

Neural networks are like lawyers, useless unless taught how to do good hence the training algorithms have to be pretty good to round up a decent combination of weights. The most optimal way as mentioned before is using a cost function, that cost function again being

$\eta = \sum \frac{1}{2} (y - \hat{y})^{2}$

The one half coefficient is there to omit the power when differentiation the cost function later for simplicity. The squared also exploits the convex nature of the cost function to avoid having to use Stochastic Gradient Descent as opposed to Batch Gradient Descent to train the ANN.

 Stochastic Gradient Descent(SGD) Batch Gradient Descent(BGD) Error is computed randomly through each individual input and its corresponding output’s error. Error is computed through a summation of the total error of all the inputs and corresponding outputs.

The advantage to using Gradient Descent is that it doesn’t matter how many dimensions our data has, the curse of dimensionality doesn’t effect the overall algorithm in anyway since it knows which was is right, which way convergence lies. The only thing that actually needs to be done is for an algorithm to be written which computes the gradient of the cost function in respects to both the weights W1 and W2 being used. The way to compute all these numbers is to start with the initial error which is in our output, then work backward from there hence why applying Gradient Descent to ANN’s is referred to back propagation.

Since computing the gradient of the cost function involves more than one term (both weights) partial derivatives must be taken in respect to each weight individually then applied individually too. Since we’re working backwards here the first term in respect to the cost will be W2. The gradient descent algorithm will be derived using the chain rule in calculus, which makes it much simples to connect all the dots.

$\frac{\partial \eta}{\partial W^{(2)}} = \sum \frac{\partial \eta}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial W^{(2)}}$

$\frac{\partial \eta}{\partial W^{(2)}} = \sum \frac{\partial \eta}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial W^{(2)}}$

where:

$\frac{\partial \eta}{\partial \hat{y}} = -(y - \hat{y})$

As yhat is the result of the activation function on z3, its rate of change will equate to its derivative.

$\hat{y} = \varphi(z^{(3)})$

$\frac{\partial \hat{y}}{\partial z^{(3)}} = {\varphi}'(z^{(3)})$

This is where z3 was a product of a2 and W2.

$z^{(3)} = a^{(2)} W^{(2)}$

$\frac{\partial z^{(3)}}{\partial W^{(2)}} = a^{(2)}$

For the sake of simplity, concider:

$\delta^{(3)} = -(y - \hat{y}) \ {\varphi}'(z^{(3)})$

This putting all of the parts of the puzzle together and transposing matrix a2 to make sure all of its entries are mutiplied by delta3 to yield the total back propagation error for the second set of weights.

$\frac{\partial \eta}{\partial W^{(2)}} = \sum -(y - \hat{y}) \ \varphi(z^{(3)}) \ (a^{(2)})^{T}$

$\frac{\partial \eta}{\partial W^{(2)}} = \sum \delta^{(3)} \ (a^{(2)})^{T}$

The same partial derivative needs to be computed for the first set of weights W1. The same approach with the chain rule will be taken, however there will be some different aspects to it.

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \frac{\partial \eta}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial W^{(1)}}$

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \frac{\partial \eta}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial W^{(1)}}$

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \frac{\partial \eta}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{(3)}} \frac{\partial a^{(2)}}{\partial W^{(1)}} \frac{\partial z^{(3)}}{\partial a^{(2)}}$

where:

$\delta^{(3)} = \frac{\partial \eta}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{(3)}}$

Making sure things stay simple and kept in as few terms as possible:

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \delta^{(3)} \frac{\partial a^{(2)}}{\partial W^{(1)}} \frac{\partial z^{(3)}}{\partial a^{(2)}}$

where also:

$z^{(3)} = a^{(2)} W^{(2)}$

$\frac{\partial z^{(3)}}{\partial a^{(2)}} = W^{(2)}$

For the last piece of the puzzle, In order to make sense of the rest, the activity of the second layer must be used to connect to the first set of weights W1.

$\frac{\partial a^{(2)}}{\partial W^{(1)}} = \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial W^{(1)}}$

Similiar to the sigmoid prime of z3 above, the same differentiating happens with a2 and z2.

$a^{(2)} = \varphi(z^{(2)})$

$\frac{\partial a^{(2)}}{\partial z^{(2)}} = {\varphi}'(z^{(2)})$

The final term has arrived all the way back to the beginning of the network, the X input matrix.

$z^{(2)} = X W^{(1)}$

$\frac{\partial z^{(2)}}{\partial W^{(1)}} = X$

To bring it all together now in terms of W1:

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \delta^{(3)} \frac{\partial a^{(2)}}{\partial z^{(2)}} \frac{\partial z^{(2)}}{\partial W^{(1)}} (W^{(2)})^{T}$

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \delta^{(3)} {\varphi}'(z^{(2)}) \frac{\partial z^{(2)}}{\partial W^{(1)}} (W^{(2)})^{T}$

Both the X and W2 matrices required to be transposed in order for their respective entries to be successfully multiplied by every value of delta3 and delta2, the back propagating error of each layer of weights.

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \delta^{(3)} {\varphi}'(z^{(2)}) X^T (W^{(2)})^{T}$

where:

$\delta^{(2)} = \delta^{(3)} {\varphi}'(z^{(2)}) (W^{(2)})^{T}$

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \delta^{(2)} X^T$

Therefore the algorithms which will compute the respective gradients of the weights W1 and W2 ultimately evaluate to:

$\frac{\partial \eta}{\partial W^{(1)}} = \sum \delta^{(2)} X^T$

$\frac{\partial \eta}{\partial W^{(2)}} = \sum \delta^{(3)} \ (a^{(2)})^{T}$

Now that the algorithms are clear and ready to be put in use its time to code them up. These Gradient Descent computations will be implemented into the previous version of the Neural Network class. A new feature added to this class is the regularization parameter used to make sure that the data isn't overfitted and still models the real world.



In [1]:

class Neural_Network(object):
def __init__(self, learning_rate=0):
# define hyperparameters
self.input_layer_size = 2
self.hidden_layer_size = 3
self.output_layer_size = 1

# define parameters
self.W1 = np.random.randn(self.input_layer_size, self.hidden_layer_size)
self.W2 = np.random.randn(self.hidden_layer_size, self.output_layer_size)

# regularization parameter
self.learning_rate = learning_rate

# forward propagation
def forward(self, X):
self.z2 = np.dot(X, self.W1)
self.a2 = self.sigmoid(self.z2)
self.z3 = np.dot(self.a2, self.W2)
prediction = self.sigmoid(self.z3)
return prediction

# activation functions
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))

# derivative of sigmoid function
def sigmoid_prime(self, z):
return np.exp(-z) / ((1 + np.exp(-z))**2)

# efficient backprop
def cost_function(self, X, desired_output):
self.prediction = self.forward(X)
total_error = ((1/2) * sum((desired_output - self.prediction)**2)) / X.shape[0] + \
(self.learning_rate / 2) * (np.sum(self.W1**2) + np.sum(self.W2**2))

def cost_function_prime(self, X, desired_y):
self.prediction = self.forward(X)

# layer 3 backprop error
l3_backprop_error   = np.multiply(-(desired_y - self.prediction), \
self.sigmoid_prime(self.z3))
# divide by X.shape[0] to account for the scale of the data
cost_in_terms_of_W2 = np.dot(self.a2.T, l3_backprop_error) / X.shape[0] + \
(self.learning_rate * self.W2)

# layer 2 backprop error
l2_backprop_error   = np.dot(l3_backprop_error, self.W2.T) * \
self.sigmoid_prime(self.z2)
# divide by X.shape[0] to account for the scale of the data
cost_in_terms_of_W1 = np.dot(X.T, l2_backprop_error) / X.shape[0] + \
(self.learning_rate * self.W1)

return cost_in_terms_of_W1, cost_in_terms_of_W2

# altering and setting the parameters during training
def get_params(self):
# get W1 & W2 rolled into a vector
params = np.concatenate((self.W1.ravel(), self.W2.ravel()))
return params

def set_params(self, params):
# set W1 & W2 using single parameter vector
W1_start = 0
W1_end   = self.hidden_layer_size * self.input_layer_size
# reshape the W1 weights
self.W1  = np.reshape(params[W1_start : W1_end], \
(self.input_layer_size, self.hidden_layer_size))
W2_end   = W1_end + self.hidden_layer_size * self.output_layer_size
# reshape the W2 weights
self.W2  = np.reshape(params[W1_end : W2_end], \
(self.hidden_layer_size, self.output_layer_size))

cost_in_terms_of_W1, cost_in_terms_of_W2 = self.cost_function_prime(X, desired_y)
return np.concatenate((cost_in_terms_of_W1.ravel(), cost_in_terms_of_W2.ravel()))



Before any further progress in terms of training the network is done one more thing is crucial. It’s not all the time that the ANN’s algorithms actually work and converge to nice set of weight parameters. In fact some of the time there could be an underlying issue which involves the incorrect calculation towards a gradient. The problem with this type of error is that it won’t stop the network from still propagating, so there’s no way to know if something’s wrong…unless there’s a way of making sure that the gradients being outputted are at least slightly accurate. This is achieved through Numerical Gradient Checking.



In [2]:

class Helper(object):
def __init__(self, Local_Ref):
# set a local reference to NN class
self.Local_Ref = Local_Ref

# normalize data to account for different units
def scale_data(self, hours, test_score):
MAX_SCORE = 100.
hours      /= np.amax(hours, axis=0)
test_score /= MAX_SCORE
return hours, test_score

initial_params     = self.Local_Ref.get_params()
perturb            = np.zeros(initial_params.shape)

# epsilon value needs to be small enough act as a 'zero'
epsilon = 1e-4

for i in range(len(initial_params)):
# set perturbation vector to alter the original state of the initial params
perturb[i] = epsilon
self.Local_Ref.set_params(initial_params + perturb)
loss_2 = self.Local_Ref.cost_function(X, desired_y)

self.Local_Ref.set_params(initial_params - perturb)
loss_1 = self.Local_Ref.cost_function(X, desired_y)

numerical_gradient[i] = (loss_2 - loss_1) / (2 * epsilon)

perturb[i] = 0

self.Local_Ref.set_params(initial_params)




In [3]:

import numpy as np
from scipy import optimize

# training data
train_x = np.array(([3,5],[5,1],[10,2],[6,1.5]), dtype=float)
train_y = np.array(([75],[82],[93],[70]), dtype=float)

# testing data
test_x = np.array(([4, 5.5],[4.5, 1],[9,2.5],[6,2]), dtype=float)
test_y = np.array(([70],[89],[85],[75]), dtype=float)

NN = Neural_Network(learning_rate=0.0001)
Aux = Helper(NN)

# normalize data
train_x, train_y = Aux.scale_data(train_x, train_y)
test_x, test_y   = Aux.scale_data(test_x, test_y)

# check to see gradients have been correctly calculated




In [4]:




Out[4]:

array([ 0.00236028,  0.00088974, -0.00020404,  0.00248351,  0.00092107,
-0.00025142,  0.00720993,  0.00430147,  0.00402797])




In [5]:




Out[5]:

array([ 0.00236028,  0.00088974, -0.00020404,  0.00248351,  0.00092107,
-0.00025142,  0.00720993,  0.00430147,  0.00402797])



The numerical gradient values seem to be very much the same to the computed gradients values which indicate that our ANN is actually working. Now that the gradients are actually being outputed its time to write up a Trainer class for this ANN. The trainer class will also incorporate the Quasi-Newton BFGS Mathematical Optimization algorithm to change it up a little.



In [6]:

class Trainer(object):
def __init__(self, Local_Ref):
# make local reference to NN
self.Local_Ref = Local_Ref

def cost_function_wrapper(self, params, X, desired_y):
self.Local_Ref.set_params(params)
total_cost = self.Local_Ref.cost_function(X, desired_y)

# track cost function value as training progresses
def callback(self, params):
self.Local_Ref.set_params(params)
self.cost_list.append(self.Local_Ref.cost_function(self.train_x, self.train_y))
self.test_cost_list.append(self.Local_Ref.cost_function(self.test_x, self.test_y))

def train(self, train_x, train_y, test_x, test_y):

# internal variable for callback function
self.train_x = train_x
self.train_y = train_y

self.test_x = test_x
self.test_y = test_y

# empty lists to store costs
self.cost_list = []
self.test_cost_list = []

initial_params =  self.Local_Ref.get_params()

options = {"maxiter": 200, "disp": True}
_result = optimize.minimize(self.cost_function_wrapper, initial_params, jac=True, \
method="BFGS", args=(train_x, train_y), options=options, \
callback=self.callback)

# once the training is complete finally set the new values in
self.Local_Ref.set_params(_result.x)
self.optimization_results = _result




In [7]:

T = Trainer(NN)
T.train(train_x, train_y, test_x, test_y)




Optimization terminated successfully.
Current function value: 0.002618
Iterations: 85
Function evaluations: 91




In [8]:

%matplotlib inline
import matplotlib.pyplot as plt

# graphing the cost in respect to interations
plt.plot(T.cost_list)
plt.plot(T.test_cost_list)
plt.grid(1)
plt.xlabel("Iterations")
plt.ylabel("Cost")




Out[8]:

<matplotlib.text.Text at 0x10e28f5c0>




In [9]:

train_y




Out[9]:

array([[ 0.75],
[ 0.82],
[ 0.93],
[ 0.7 ]])




In [10]:

NN.forward(train_x)




Out[10]:

array([[ 0.74741042],
[ 0.78130396],
[ 0.85711074],
[ 0.80038587]])



Thanks to the early regularisation hyper parameter there is no overfitting of data and the training and testing data still model the real world. This just about concludes the training process of the ANN and in fact the life span of this project.