Deep neural networks (DNN) are those that feature a large number of hidden layers $L$. While there is no specific threshold over which a network is categorised as "deep", $L$ must be large enough to prevent the network from learning effectively with the traditional Backpropagation (BP) method and sigmoid activation functions.
In [74]:
import NeuralNetwork
import numpy as np
# Load Iris dataset
from sklearn import datasets as dset
import copy
iris = dset.load_iris()
# build network with 4 input, 1 output
nn = NeuralNetwork.MLP([4,4,4,4,1])
# keep original for further experiments
orig = copy.deepcopy(nn)
# Target needs to be divided by 2 because of the sigmoid, values 0, 0.5, 1
idat, itar = iris.data, iris.target/2.0
# regularisation parameter of 0.1
tcost = NeuralNetwork.MLP_Cost(nn, idat, itar, 0.1)
# Cost value for an untrained network
print("J(ini) = " + str(tcost))
# Train with numerical gradient, 50 rounds
# learning rate is 0.01
NeuralNetwork.MLP_Backprop(nn, idat, itar, 0.1, 50, 0.01)
The reason for this loss of learning ability is the nature of BP with sigmoids. BP propagates the derivative of the error backwards in order to apply the gradient descent rule to all the weights. Note that the derivation of the error, which makes use of the chain rule, involves a product with the derivative of the sigmoid activation function $g'(z)$, e.g., for the logistic function:
$g'(z) = g(z) (1- g(z))$
However, this only yields significant values around zero:
In [69]:
z = np.arange(-8, 8, 0.1)
g = NeuralNetwork.Sigmoid(z)
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.plot(z, g*(1.0-g), 'b-', label="g'(z)")
plt.legend(loc='upper left')
plt.xlabel('Input [z]')
plt.ylabel("Output [g']")
plt.title('Derivative of the logistic sigmoid activation function')
plt.show()
Therefore, after the error feedback has traversed several layers, the amount of knowledge tends to vanish by the time it reaches the initial layers (the farthest from the output).
In order to solve this issue, a rectifier function is used as the activation function:
$g(z) = \begin{cases} 0 & z < 0 \\ z & z \geq 0 \end{cases} $
the derivative of which is the step function:
$g'(z) = \begin{cases} 0 & z < 0 \\ 1 & z \geq 0 \end{cases} $
Now, the error feedback is no longer squeezed around zero and it may reach all the hidden units back to the input layer. This is the original success with Deep Learning.
There is a caveat with using ReLU, though. Given that this new activation function is no longer bounded, the values of the weights can get very large, causing the DNN to become unstable, and ultimately reaching saturation. In order to work aroud this issue it is advised to use very small weights, and also small learning rates.
In [73]:
# restore initial network
nn = copy.deepcopy(orig)
# reduce weights
for l in nn:
l*= 0.1
# regularisation parameter of 0.1
tcost = NeuralNetwork.MLP_Cost(nn, idat, itar, 0.1, af=NeuralNetwork.ReLU)
# Cost value for an untrained network
print("J(ini) = " + str(tcost))
# Train with numerical gradient, 50 rounds
# learning rate is 0.1
NeuralNetwork.MLP_Backprop(nn, idat, itar, 0.1, 50, 0.01, af=NeuralNetwork.ReLU)