(C) 2019 by Damir Cavar
Version: 0.1, November 2019
Download: This and various other Jupyter notebooks are available from my GitHub repo.
For more details on Backpropagation and its use in Neural Networks see Rumelhart, Hinton, and Williams (1986a) and Rumelhart, Hinton & Williams (1986b). A detailed overview is also provided in Goodfellow, Bengio, and Courville (2016).
The ideas and initial versions of this Python-based notebook have been inspired by many open and public tutorials and articles, but in particular by these three:
A lot of code examples and discussion has been compiled here using these sources.
This notebook uses nbextensions with python-markdown/main enabled. These extensions might not work in Jupyter Lab, thus some variable references in the markdown cells might not display.
We will use numpy in the following demo. Let us import it and assign the np alias to it:
In [17]:
import numpy as np
For plots of curves and functions we will use pyplot from matplotlib. We will import it here:
In [18]:
from matplotlib import pyplot as plt
The Sigmoid function is defined as:
We can specify it in Python as:
In [19]:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
We can now plot the sigmoid function for x values between -10 and 10:
In [20]:
%matplotlib inline
x = np.arange(-10, 10, 0.2)
y = sigmoid(x)
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
ax.plot(x, y)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Sigmoid")
print()
In the following for Backpropagation we will make use of the Derivative of the Sigmoid function. The Derivative of Sigmoid is defined as:
We can derive this equation as follows. Assume that:
We can invert the fraction using a negative exponent:
We can apply the reciprocal rule, which is, the numerator is the derivative of the function ($g'(x)$) times -1 divided by the square of the denominator $g(x)$:
In our Derivative of Sigmoid derivation, we can now reformulate as:
With $\alpha$ and $\beta$ constants, the Rule of Linearity says that:
This means, using the Rule of Linearity and given that the derivative of a constant is 0, we can rewrite our equation as:
The Exponential Rule says that:
We can thus rewrite:
This is equivalent to:
Given that a derivative of a variable is 1, we can rewrite as:
We can rewrite the derivative as:
We can simplify this to:
This means that we can derive the Derivative of the Sigmoid function as:
We can specify the Python function of the Derivative of the Sigmoid function as:
In [21]:
def sigmoidDerivative(x):
return sigmoid(x) * (1 - sigmoid(x))
We can plot the Derivative of the Sigmoid Function as follows:
In [22]:
%matplotlib inline
x = np.arange(-10, 10, 0.2)
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
y = sigmoidDerivative(x)
ax.plot(x, y, color="red", label='Derivative of Sigmoid'.format(1))
y = sigmoid(x)
ax.plot(x, y, color="blue", label='Sigmoid'.format(1))
fig.legend(loc='center right')
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Derivative of the Sigmoid Function")
print()
We will define a simple network that takes an input as defined for X and that generates a corresponding output as defined in y. The input array X is:
In [23]:
X = np.array( [ [0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1] ] )
The rows in $X$ are the input vectors for our training or learning phase. Each vector has 3 dimensions.
The output array y represents the expected output that the network is expected to learn from the input data. It is defined as a row-vector with 4 rows and 1 column:
In [24]:
y = np.array( [0, 0, 1, 1] ).reshape(-1, 1)
np.shape(y)
Out[24]:
We will define a weight matrix W and initialize it with random weights:
In [25]:
W = 2 * np.random.random((3, 1)) - 1
print(W)
In this simple example W is the weight matrix that connects two layers, the input (X) and the output layer (O).
The optimization or learning phase consists of a certain number of iterations that :
In [26]:
iterations = 4000
Let us keep track of the output error (as becomes clear below) in the following variable:
In [27]:
error = 0.0
Repeat for a specific number of iterations the following computations. Initially we take the entire set of training examples in X and process them all at the same time. This is called full batch training, inidicated by the dot-product between X and W. Computing O is the first prediction step by taking the dot-product of X and W and computing the sigmoid function over it:
In [28]:
for i in range(iterations):
O = sigmoid(np.dot(X, W))
O_error = y - O
error = np.mean(np.abs(O_error))
if (i % 100) == 0:
print("Error:", error)
# Compute the delta
O_delta = O_error * sigmoidDerivative(O)
# update weights
W += np.dot(X.T, O_delta)
print("O:", O)
The matrix X has 4 rows and 3 columns. The weight matrix W has 3 rows and 1 column. The output will be a row vector with 4 rows and 1 column, representing the output that we want to align as close as possible to y.
O_error is the difference between y and the initial guess in O. We want to see O to reflect y as closely as possible. After {{ iterations }} in the loop above, we see that O is resembling y very well, with an error of {{ error }}.
In the next step we compute the derivative of the sigmoid function for the initial guess vector. The Derivative is weighted by the error, which means that if the slope was shallow (close to or approaching 0), the guess was quite good, that is the network was confident about the output for a given input. If the slope was higher, as for example for x = 0, the prediction was not very good. Such bad predictions get updated significantly, while the confident predictions get updated minimally, multiplying them with some small number close to 0.
For every single weight, we
In the following example we will slightly change the ground truth. Compare the following definition of y with the definition above:
In [30]:
y = np.array([[0],
[1],
[1],
[0]])
In the following network specification we introduce a second layer
In [31]:
np.random.seed(1)
# randomly initialize our weights with mean 0
Wh = 2 * np.random.random((3, 4)) - 1
Wo = 2 * np.random.random((4, 1)) - 1
Xt = X.T # precomputing the transform of X for the loop
for i in range(80000):
# Feed forward through layers X, H, and O
H = sigmoid(np.dot(X, Wh))
O = sigmoid(np.dot(H, Wo))
# how much did we miss the target value?
O_error = y - O
error = np.mean(np.abs(O_error))
if (i % 10000) == 0:
print("Error:", error)
# compute the direction of the optimization for the output layer
O_delta = O_error * sigmoidDerivative(O)
# how much did each H value contribute to the O error (according to the weights)?
H_error = O_delta.dot(Wo.T)
# compute the directions of the optimization for the hidden layer
H_delta = H_error * sigmoidDerivative(H)
Wo += H.T.dot(O_delta)
Wh += Xt.dot(H_delta)
print(O)
In [ ]:
In [ ]: