In [1]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn as skl
import sklearn.datasets
import sklearn.linear_model
%matplotlib inline
In [2]:
# Generate data
X, y = sklearn.datasets.make_moons(300, noise=0.22)
plt.figure(figsize=(7, 5))
plt.scatter(X[:, 0], X[:, 1], s=15, c=y, cmap=plt.cm.Spectral)
plt.show()
In [3]:
# import feedforward neural net
from mlnn import neural_net
Let's build a 4-layer neural network. Our network has one input layer, two hidden layer and one output layer. Our model can be represented as a directed acyclic graph wherein each node in a layer is connected all other nodes in its succesive layer. The neural net is shown below-
Each node in the hidden layer uses a nonlinear activation function $f(x)$, which computes the outputs from its inputs and transfer these outputs to successive layers. Here we've used $f(x)= tanh(x)$, as our non-linear activation. Its derivative is given by- $f'(x)= 1-tanh(x)^2$.
Our network graph can be represented as-
Layer No. | Notation | Value | Variable |
---|---|---|---|
1 | X | $X$ | X |
2 | W1(~)+b1 | $W1*X+b1$ | pre_act1 |
2 | tanh | $tanh(W1*X+b1)$ | act1 |
3 | W2(~)+b2 | $W2*(tanh(W1*X +b1))+b2$ | pre_act2 |
3 | tanh | $tanh(W2*(tanhW1*X+b1))+b2)$ | act2 |
4 | W3(~)+b3 | $W3*(tanh(W2*(tanhW1*X+b1))+b2)+b3$ | pre_act3 |
4 | softmax | $softmax(W3*(tanh(W2*(tanhW1*X+b1)+b2))+b3)$ </br> | act3 |
Now we formulate the backpropagation algorithm or backprop for training the network. For derivation of the backprop, please see Dr. Hugo Larochelle's excellent course on neural networks.
$ \large\frac{\partial L}{\partial Pred} = \frac{\partial L}{\partial L} * \frac{\partial L}{\partial Pred} $
$ \large\frac{\partial L}{\partial act3} = \frac{\partial L}{\partial Pred} * \frac{\partial Pred}{\partial act3} $
$ \large\frac{\partial L}{\partial pre\_act3} = \frac{\partial L}{\partial act3} * \frac{\partial act3}{\partial pre\_act3}= \delta4$
$ \large\frac{\partial L}{\partial act2} = \frac{\partial L}{\partial pre\_act3} * \frac{\partial pre\_act3}{\partial act2} $
$ \large\frac{\partial L}{\partial pre\_act2} = \frac{\partial L}{\partial act2} * \frac{\partial act2}{\partial pre\_act2}= \delta3$
$ \large\frac{\partial L}{\partial act1} = \frac{\partial L}{\partial pre\_act2} * \frac{\partial pre\_act2}{\partial act1} $
$ \large\frac{\partial L}{\partial pre\_act1} = \frac{\partial L}{\partial act1} * \frac{\partial act1}{\partial pre\_act1}= \delta2$
$ \large\frac{\partial L}{\partial W3} = \delta4 * \frac{\partial pre\_act3}{\partial W3}$
$ \large\frac{\partial L}{\partial W2} = \delta3 * \frac{\partial pre\_act2}{\partial W2}$
$ \large\frac{\partial L}{\partial W1} = \delta2 * \frac{\partial pre\_act1}{\partial W1}$
In [4]:
# Visualize tanh and its derivative
x = np.linspace(-np.pi, np.pi, 120)
plt.figure(figsize=(8, 3))
plt.subplot(1, 2, 1)
plt.plot(x, np.tanh(x))
plt.title("tanh(x)")
plt.xlim(-3, 3)
plt.subplot(1, 2, 2)
plt.plot(x, 1 - np.square(np.tanh(x)))
plt.xlim(-3, 3)
plt.title("tanh\'(x)")
plt.show()
It can be seen from the above figure that as we increase our input the our activation starts to saturate which can inturn kill gradients. This can be mitigated using rectified activation functions. Another problem that we encounter in training deep neural networks during backpropagation is vanishing gradient and gradient explosion. It can be observed from the derivative of our nth activation- $\large\frac{\partial act\_n}{\partial pre\_act\_n}$ , is fairly large near zero. Let's assume that the weigths $< 1$, this will usually satisfy $|w_{i}*tanh'(x)| < 1$. The succesive product of such values in each layer will exponentially decrease the computed product leading to vanishing gradient. This is not a robust explanation of vanishing gradient problem. For more information refer to this article.
Similarly if the weigths are large 100, 40.., we can formulate the gradient explosion problem.
In [5]:
# Training the neural network
my_nn = neural_net([2, 4, 2]) # [2,4,2] = [input nodes, hidden nodes, output nodes]
my_nn.train(X, y, 0.001, 0.0001) # weights regularization lambda= 0.001 , epsilon= 0.0001
Out[5]:
In [6]:
### visualize predictions
my_nn.visualize_preds(X ,y)
In [7]:
X_, y_ = sklearn.datasets.make_circles(n_samples=400, noise=0.18, factor=0.005, random_state=1)
plt.figure(figsize=(7, 5))
plt.scatter(X_[:, 0], X_[:, 1], s=15, c=y_, cmap=plt.cm.Spectral)
plt.show()
In [8]:
'''
Uncomment the code below to see classification process for above data.
To stop training early reduce no. of iterations.
'''
#new_nn = neural_net([2, 6, 2])
#new_nn.animate_preds(X_, y_, 0.001, 0.0001) # max iterations = 35000
Out[8]: