Backpropagation Tutorial

(C) 2019 by Damir Cavar

Version: 0.1, November 2019

Download: This and various other Jupyter notebooks are available from my GitHub repo.

Introduction

For more details on Backpropagation and its use in Neural Networks see Rumelhart, Hinton, and Williams (1986a) and Rumelhart, Hinton & Williams (1986b). A detailed overview is also provided in Goodfellow, Bengio, and Courville (2016).

The ideas and initial versions of this Python-based notebook have been inspired by many open and public tutorials and articles, but in particular by these three:

A lot of code examples and discussion has been compiled here using these sources.

Preliminaries

This notebook uses nbextensions with python-markdown/main enabled. These extensions might not work in Jupyter Lab, thus some variable references in the markdown cells might not display.

We will use numpy in the following demo. Let us import it and assign the np alias to it:


In [17]:
import numpy as np

For plots of curves and functions we will use pyplot from matplotlib. We will import it here:


In [18]:
from matplotlib import pyplot as plt

Non-linearity Function and Derivatives

The Sigmoid function is defined as:

$$\sigma(x) = \frac{1}{1 + e^{-x}} $$

We can specify it in Python as:


In [19]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

We can now plot the sigmoid function for x values between -10 and 10:


In [20]:
%matplotlib inline
x = np.arange(-10, 10, 0.2)
y = sigmoid(x)
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
ax.plot(x, y)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Sigmoid")
print()



In the following for Backpropagation we will make use of the Derivative of the Sigmoid function. The Derivative of Sigmoid is defined as:

$$\frac{d}{dx}\sigma(x) = \sigma(x) (1 - \sigma(x))$$

We can derive this equation as follows. Assume that:

$$\frac{d}{dx} \sigma(x) = \frac{d}{dx} \frac{1}{1 + e^{-x}} $$

We can invert the fraction using a negative exponent:

$$\frac{d}{dx} \sigma(x) = \frac{d}{dx} \frac{1}{1 + e^{-x}} = \frac{d}{dx} (1 + e^{-x})^{-1}$$

We can apply the reciprocal rule, which is, the numerator is the derivative of the function ($g'(x)$) times -1 divided by the square of the denominator $g(x)$:

$$\frac{d}{dx} \left[ \frac{1}{g(x)} \right] = \frac{-g'(x)}{[g(x)]^2} = -g(x)^{-2} g'(x)$$

In our Derivative of Sigmoid derivation, we can now reformulate as:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = -(1 + e^{-x})^{-2} \frac{d}{dx} (1 + e^{-x})$$

With $\alpha$ and $\beta$ constants, the Rule of Linearity says that:

$$\frac{d}{dx} \left( \alpha f(x) + \beta g(x) \right) = \frac{d}{dx} \left( \alpha f(x) \right) + \frac{d}{dx} \left( \beta g(x) \right) = \alpha f'(x) + \beta g'(x)$$

This means, using the Rule of Linearity and given that the derivative of a constant is 0, we can rewrite our equation as:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = -(1 + e^{-x})^{-2} \frac{d}{dx} (1 + e^{-x}) = -(1 + e^{-x})^{-2} \left( \frac{d}{dx}[1] + \frac{d}{dx}[e^{-x}] \right) = -(1 + e^{-x})^{-2} \left( 0 + \frac{d}{dx}[e^{-x}] \right) = -(1 + e^{-x})^{-2} \frac{d}{dx}[e^{-x}] $$

The Exponential Rule says that:

$$\frac{d}{dx} e^{u(x)} = e^{u(x)} \frac{d}{dx} x$$

We can thus rewrite:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = -(1 + e^{-x})^{-2} e^{-x} \frac{d}{dx}[-x] $$

This is equivalent to:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = -(1 + e^{-x})^{-2} e^{-x} -\frac{d}{dx}[x]$$

Given that a derivative of a variable is 1, we can rewrite as:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = -(1 + e^{-x})^{-2} e^{-x} -1 = (1 + e^{-x})^{-2} e^{-x} = \frac{e^{-x}}{(1 + e^{-x})^2}$$

We can rewrite the derivative as:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = \frac{1 e^{-x}}{(1 + e^{-x}) (1 + e^{-x})} = \frac{1}{1 + e^{-x}} \frac{e^{-x}}{1 + e^{-x}} = \frac{1}{1 + e^{-x}} \frac{e^{-x} + 1 - 1}{1 + e^{-x}} = \frac{1}{1 + e^{-x}} \left( \frac{1 + e^{-x}}{1 + e^{-x}} - \frac{1}{1 + e^{-x}} \right)$$

We can simplify this to:

$$\frac{d}{dx} (1 + e^{-x})^{-1} = \frac{1}{1 + e^{-x}} \left( 1 - \frac{1}{1 + e^{-x}} \right)$$

This means that we can derive the Derivative of the Sigmoid function as:

$$\frac{d}{dx} \sigma(x) = \sigma(x) ( 1 - \sigma(x) )$$

We can specify the Python function of the Derivative of the Sigmoid function as:


In [21]:
def sigmoidDerivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

We can plot the Derivative of the Sigmoid Function as follows:


In [22]:
%matplotlib inline
x = np.arange(-10, 10, 0.2)
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
y = sigmoidDerivative(x)
ax.plot(x, y, color="red", label='Derivative of Sigmoid'.format(1))
y = sigmoid(x)
ax.plot(x, y, color="blue", label='Sigmoid'.format(1))
fig.legend(loc='center right')
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_title("Derivative of the Sigmoid Function")
print()



Forward- and Backpropagation

We will define a simple network that takes an input as defined for X and that generates a corresponding output as defined in y. The input array X is:


In [23]:
X = np.array( [ [0, 0, 1],
                [0, 1, 1],
                [1, 0, 1],
                [1, 1, 1] ] )

The rows in $X$ are the input vectors for our training or learning phase. Each vector has 3 dimensions.

The output array y represents the expected output that the network is expected to learn from the input data. It is defined as a row-vector with 4 rows and 1 column:


In [24]:
y = np.array( [0, 0, 1, 1] ).reshape(-1, 1)
np.shape(y)


Out[24]:
(4, 1)

We will define a weight matrix W and initialize it with random weights:


In [25]:
W = 2 * np.random.random((3, 1)) - 1
print(W)


[[-0.1653904 ]
 [ 0.11737966]
 [-0.71922612]]

In this simple example W is the weight matrix that connects two layers, the input (X) and the output layer (O).

The optimization or learning phase consists of a certain number of iterations that :


In [26]:
iterations = 4000

Let us keep track of the output error (as becomes clear below) in the following variable:


In [27]:
error = 0.0

Repeat for a specific number of iterations the following computations. Initially we take the entire set of training examples in X and process them all at the same time. This is called full batch training, inidicated by the dot-product between X and W. Computing O is the first prediction step by taking the dot-product of X and W and computing the sigmoid function over it:


In [28]:
for i in range(iterations):
    O = sigmoid(np.dot(X, W))
    
    O_error = y - O
    error = np.mean(np.abs(O_error))
    if (i % 100) == 0:
        print("Error:", error)
    
    # Compute the delta
    O_delta = O_error * sigmoidDerivative(O)
    
    # update weights
    W += np.dot(X.T, O_delta)

print("O:", O)


Error: 0.5180464037927469
Error: 0.05534622055219526
Error: 0.02812928803534369
Error: 0.01877488911128993
Error: 0.014070439001015959
Error: 0.011244556807545731
Error: 0.009361012941052013
Error: 0.00801647956636641
Error: 0.007008851676297921
Error: 0.006225752933087808
Error: 0.005599744249940878
Error: 0.005087916048536661
Error: 0.00466167128726863
Error: 0.004301220381145958
Error: 0.0039924345571583485
Error: 0.003724957786285008
Error: 0.0034910268701690925
Error: 0.0032847081265180403
Error: 0.0031013886507480163
Error: 0.002937428340221097
Error: 0.0027899163929838468
Error: 0.002656497439046738
Error: 0.0025352451353524907
Error: 0.002424568769275856
Error: 0.0023231432360394703
Error: 0.002229855840119956
Error: 0.0021437653872587153
Error: 0.002064070377780582
Error: 0.001990084023778811
Error: 0.0019212144414435684
Error: 0.0018569488098080936
Error: 0.001796840599364327
Error: 0.001740499198323901
Error: 0.0016875814273992033
Error: 0.0016377845538804781
Error: 0.0015908405048289896
Error: 0.0015465110459704156
Error: 0.001504583743392433
Error: 0.0014648685636853335
Error: 0.0014271949978083263
O: [[0.00180933]
 [0.00120496]
 [0.9989796 ]
 [0.99846766]]

The matrix X has 4 rows and 3 columns. The weight matrix W has 3 rows and 1 column. The output will be a row vector with 4 rows and 1 column, representing the output that we want to align as close as possible to y.

O_error is the difference between y and the initial guess in O. We want to see O to reflect y as closely as possible. After {{ iterations }} in the loop above, we see that O is resembling y very well, with an error of {{ error }}.

In the next step we compute the derivative of the sigmoid function for the initial guess vector. The Derivative is weighted by the error, which means that if the slope was shallow (close to or approaching 0), the guess was quite good, that is the network was confident about the output for a given input. If the slope was higher, as for example for x = 0, the prediction was not very good. Such bad predictions get updated significantly, while the confident predictions get updated minimally, multiplying them with some small number close to 0.

For every single weight, we

Adding a Layer

In the following example we will slightly change the ground truth. Compare the following definition of y with the definition above:


In [30]:
y = np.array([[0],
            [1],
            [1],
            [0]])

In the following network specification we introduce a second layer


In [31]:
np.random.seed(1)

# randomly initialize our weights with mean 0
Wh = 2 * np.random.random((3, 4)) - 1
Wo = 2 * np.random.random((4, 1)) - 1

Xt = X.T # precomputing the transform of X for the loop

for i in range(80000):

    # Feed forward through layers X, H, and O
    H = sigmoid(np.dot(X, Wh))
    O = sigmoid(np.dot(H, Wo))

    # how much did we miss the target value?
    O_error = y - O
    error = np.mean(np.abs(O_error))
    if (i % 10000) == 0:
        print("Error:", error)

    # compute the direction of the optimization for the output layer
    O_delta = O_error * sigmoidDerivative(O)

    # how much did each H value contribute to the O error (according to the weights)?
    H_error = O_delta.dot(Wo.T)
    
    # compute the directions of the optimization for the hidden layer
    H_delta = H_error * sigmoidDerivative(H)

    Wo += H.T.dot(O_delta)
    Wh += Xt.dot(H_delta)

print(O)


Error: 0.4964100319027255
Error: 0.3956678881357977
Error: 0.02807194397668463
Error: 0.009825413967651232
Error: 0.005642137542640457
Error: 0.0037447378038104823
Error: 0.002659632966548794
Error: 0.0020172148236547597
[[0.00154124]
 [0.99805853]
 [0.99828522]
 [0.00135168]]

In [ ]:

References

  • Goodfellow, Ian, Yoshua Bengio, Aaron Courville (2016). Deep Learning. MIT Press.
  • Nielsen, Michael A. (2015). "Chapter 2: How the backpropagation algorithm works". Neural Networks and Deep Learning. Determination Press.
  • Rumelhart, David E., Geoffrey E. Hinton, Ronald J. Williams (1986a). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536. doi:10.1038/323533a0.
  • Rumelhart, David E., Geoffrey E. Hinton, Ronald J. Williams (1986b). "8. Learning Internal Representations by Error Propagation". In: D. Rumelhart, J.L. McClelland (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations. Cambridge: MIT Press.

In [ ]: