In [1]:

    
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn as skl
import sklearn.datasets
import sklearn.linear_model
%matplotlib inline



In [2]:

    
# Generate data
X, y = sklearn.datasets.make_moons(300, noise=0.22)
plt.figure(figsize=(7, 5))
plt.scatter(X[:, 0], X[:, 1], s=15, c=y, cmap=plt.cm.Spectral)
plt.show()

Feedforward Neural Network



In [3]:

    
# import feedforward neural net 
from mlnn import neural_net

Let's build a 4-layer neural network. Our network has one input layer, two hidden layer and one output layer. Our model can be represented as a directed acyclic graph wherein each node in a layer is connected all other nodes in its succesive layer. The neural net is shown below-

Each node in the hidden layer uses a nonlinear activation function $f(x)$, which computes the outputs from its inputs and transfer these outputs to successive layers. Here we've used $f(x)= tanh(x)$, as our non-linear activation. Its derivative is given by- $f'(x)= 1-tanh(x)^2$.

Our network graph can be represented as-

Layer No.	Notation	Value	Variable
1	X	$X$	X
2	W1(~)+b1	$W1*X+b1$	pre_act1
2	tanh	$tanh(W1*X+b1)$	act1
3	W2(~)+b2	$W2(tanh(W1X +b1))+b2$	pre_act2
3	tanh	$tanh(W2(tanhW1X+b1))+b2)$	act2
4	W3(~)+b3	$W3(tanh(W2(tanhW1*X+b1))+b2)+b3$	pre_act3
4	softmax	$softmax(W3(tanh(W2(tanhW1*X+b1)+b2))+b3)$ </br>	act3

Backpropagation

Now we formulate the backpropagation algorithm or backprop for training the network. For derivation of the backprop, please see Dr. Hugo Larochelle's excellent course on neural networks.

$ \large\frac{\partial L}{\partial Pred} = \frac{\partial L}{\partial L} * \frac{\partial L}{\partial Pred} $

$ \large\frac{\partial L}{\partial act3} = \frac{\partial L}{\partial Pred} * \frac{\partial Pred}{\partial act3} $

$ \large\frac{\partial L}{\partial pre\_act3} = \frac{\partial L}{\partial act3} * \frac{\partial act3}{\partial pre\_act3}= \delta4$

$ \large\frac{\partial L}{\partial act2} = \frac{\partial L}{\partial pre\_act3} * \frac{\partial pre\_act3}{\partial act2} $

$ \large\frac{\partial L}{\partial pre\_act2} = \frac{\partial L}{\partial act2} * \frac{\partial act2}{\partial pre\_act2}= \delta3$

$ \large\frac{\partial L}{\partial act1} = \frac{\partial L}{\partial pre\_act2} * \frac{\partial pre\_act2}{\partial act1} $

$ \large\frac{\partial L}{\partial pre\_act1} = \frac{\partial L}{\partial act1} * \frac{\partial act1}{\partial pre\_act1}= \delta2$

$ \large\frac{\partial L}{\partial W3} = \delta4 * \frac{\partial pre\_act3}{\partial W3}$

$ \large\frac{\partial L}{\partial W2} = \delta3 * \frac{\partial pre\_act2}{\partial W2}$

$ \large\frac{\partial L}{\partial W1} = \delta2 * \frac{\partial pre\_act1}{\partial W1}$

$ \large\frac{\partial L}{\partial b3} = \delta4 * \frac{\partial pre\_act3}{\partial b3} = \delta4 * 1$

$ \large\frac{\partial L}{\partial b2} = \delta3 * \frac{\partial pre\_act2}{\partial b2}= \delta3 * 1$

$ \large\frac{\partial L}{\partial b1} = \delta2 * \frac{\partial pre\_act1}{\partial b1} = \delta2 * 1$



In [4]:

    
# Visualize tanh and its derivative
x = np.linspace(-np.pi, np.pi, 120)
plt.figure(figsize=(8, 3))
plt.subplot(1, 2, 1)
plt.plot(x, np.tanh(x))
plt.title("tanh(x)")
plt.xlim(-3, 3)
plt.subplot(1, 2, 2)
plt.plot(x, 1 - np.square(np.tanh(x)))
plt.xlim(-3, 3)
plt.title("tanh\'(x)")
plt.show()

It can be seen from the above figure that as we increase our input the our activation starts to saturate which can inturn kill gradients. This can be mitigated using rectified activation functions. Another problem that we encounter in training deep neural networks during backpropagation is vanishing gradient and gradient explosion. It can be observed from the derivative of our nth activation- $\large\frac{\partial act\_n}{\partial pre\_act\_n}$ , is fairly large near zero. Let's assume that the weigths $< 1$, this will usually satisfy $|w_{i}*tanh'(x)| < 1$. The succesive product of such values in each layer will exponentially decrease the computed product leading to vanishing gradient. This is not a robust explanation of vanishing gradient problem. For more information refer to this article.

Similarly if the weigths are large 100, 40.., we can formulate the gradient explosion problem.



In [5]:

    
# Training the neural network

my_nn = neural_net([2, 4, 2]) # [2,4,2] = [input nodes, hidden nodes, output nodes]

my_nn.train(X, y, 0.001, 0.0001) # weights regularization lambda= 0.001 , epsilon= 0.0001









    



Loss after iteration 0: 0.550828
Loss after iteration 1000: 0.312276
Loss after iteration 2000: 0.310907
Loss after iteration 3000: 0.310098
Loss after iteration 4000: 0.309545
Loss after iteration 5000: 0.309141
Loss after iteration 6000: 0.308830
Loss after iteration 7000: 0.308582
Loss after iteration 8000: 0.308379
Loss after iteration 9000: 0.308209
Loss after iteration 10000: 0.308063
Loss after iteration 11000: 0.307935
Loss after iteration 12000: 0.307822
Loss after iteration 13000: 0.307721
Loss after iteration 14000: 0.307629
Loss after iteration 15000: 0.307544
Loss after iteration 16000: 0.307465
Loss after iteration 17000: 0.307390
Loss after iteration 18000: 0.307319
Loss after iteration 19000: 0.307250
Loss after iteration 20000: 0.307183
Loss after iteration 21000: 0.307116
Loss after iteration 22000: 0.307048
Loss after iteration 23000: 0.306980
Loss after iteration 24000: 0.306908






    Out[5]:





{'W1': array([[ 0.11827437,  0.09296233,  0.10304653,  0.11684045],
        [-0.20240277, -0.38252876, -0.43947934, -0.4297747 ]]),
 'W2': array([[ 1.73731779, -0.63048788],
        [ 1.83758017, -1.33090351],
        [ 0.09501323, -0.22903616],
        [-0.70044285,  1.49809302]]),
 'W3': array([[ 1.73731779, -0.63048788],
        [ 1.83758017, -1.33090351],
        [ 0.09501323, -0.22903616],
        [-0.70044285,  1.49809302]]),
 'b1': array([[-0.31322707,  0.10078616,  0.25341526, -0.24919222]]),
 'b2': array([[-0.20185746,  0.04269616, -0.01080139, -0.14951606]]),
 'b3': array([[-0.09947524,  0.09947524]])}



In [6]:

    
### visualize predictions
my_nn.visualize_preds(X ,y)

Animate Training:



In [7]:

    
X_, y_ = sklearn.datasets.make_circles(n_samples=400, noise=0.18, factor=0.005, random_state=1)
plt.figure(figsize=(7, 5))
plt.scatter(X_[:, 0], X_[:, 1], s=15, c=y_, cmap=plt.cm.Spectral)
plt.show()



In [8]:

    
'''
Uncomment the code below to see classification process for above data.
To stop training early reduce no. of iterations.
'''

#new_nn = neural_net([2, 6, 2])
#new_nn.animate_preds(X_, y_, 0.001, 0.0001) # max iterations = 35000









    Out[8]:





'\nUncomment the code below to see classification process for above data.\nTo stop training early reduce no. of iterations.\n'