Backpropagation

(C) 2018-2019 by Damir Cavar

Version: 1.1, November 2019

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring 2018 at Indiana University.

This material is based on Andrew Trask's tutorial A Neural Network in 11 lines of Python (Part 1) A bare bones neural network implementation to describe the inner workings of backpropagation. See for more details and explanations on his tutorial page. All copyrights are his, except on a few small comments that I added. I also simplified the code and introduced more intuitive variable names, as well as most likely some errors.

Another source of information, in fact a good one, is the ML Cheat Sheet. I would recommend to work through both:

Another good article to read and understand is Andrej Karpathy's Yes you should understand backprop.

Introduction

In this example we will use a very simple network to start with. The network will only have one input and one output layer. We want to make the following predictions from the input:

Input Output
0 0 1 0
1 1 1 1
1 0 1 1
0 1 1 0

We will use Numpy to compute the network parameters, weights, activation, and outputs:


In [21]:
import numpy as np

We will use the Sigmoid activation function:


In [22]:
def sigmoid(z):
    """The sigmoid activation function."""
    return 1 / (1 + np.exp(-z))

We could use the ReLU activation function instead:


In [23]:
def relu(z):
    """The ReLU activation function."""
    return max(0, z)

The Sigmoid activation function introduces non-linearity to the computation. It maps the input value to an output value between $0$ and $1$.

The derivative of the sigmoid function is maximal at $x=0$ and minimal for lower or higher values of $x$:

The sigmoid_prime function returns the derivative of the sigmoid for any given $z$. The derivative of the sigmoid is $z * (1 - z)$. This is basically the slope of the sigmoid function at any given point:


In [24]:
def sigmoid_prime(z):
    """The derivative of sigmoid for z."""
    return z * (1 - z)

We define the inputs as rows in X. There are three input nodes (three columns per vector in $X$. Each row is one trainig example:


In [25]:
X = np.array([ [ 0, 0, 1 ],
               [ 0, 1, 1 ],
               [ 1, 0, 1 ],
               [ 1, 1, 1 ] ])
print(X)


[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]

The outputs are stored in y, where each row represents the output for the corresponding input vector (row) in X. The vector is initiated as a single row vector and with four columns and transposed (using the $.T$ method) into a column vector with four rows:


In [26]:
y = np.array([[0,0,1,1]]).T
print(y)


[[0]
 [0]
 [1]
 [1]]

To make the outputs deterministic, we seed the random number generator with a constant. This will guarantee that every time you run the code, you will get the same random distribution:


In [27]:
np.random.seed(1)

We create a weight matrix ($Wo$) with randomly initialized weights:


In [28]:
n_inputs = 3
n_outputs = 1
#Wo = 2 * np.random.random( (n_inputs, n_outputs) ) - 1
Wo = np.random.random( (n_inputs, n_outputs) ) * np.sqrt(2.0/n_inputs)
print(Wo)


[[3.40497041e-01]
 [5.88142486e-01]
 [9.33866473e-05]]

The reason for the output weight matrix ($Wo$) to have 3 rows and 1 column is that it represents the weights of the connections from the three input neurons to the single output neuron. The initialization of the weight matrix is random with a mean of $0$ and a variance of $1$. There is a good reason for chosing a mean of zero in the weight initialization. See for details the section on Weight Initialization in the Stanford course CS231n on Convolutional Neural Networks for Visual Recognition.

The core representation of this network is basically the weight matrix Wo. The rest, input matrix, output vector and so on are components that we need for learning and evaluation. The learning result is stored in the Wo weight matrix.

We loop in the optimization and learning cycle 10,000 times. In the forward propagation line we process the entire input matrix for training. This is called full batch training. I do not use an alternative variable name to represent the input layer, instead I use the input matrix $X$ directly here. Think of this as the different inputs to the input neurons computed at once. In principle the input or training data could have many more training examples, the code would stay the same.


In [31]:
for n in range(10000):
    # forward propagation
    l1 = sigmoid(np.dot(X, Wo))
    
    # compute the loss
    l1_error = y - l1
    #print("l1_error:\n", l1_error)
    
    # multiply the loss by the slope of the sigmoid at l1
    l1_delta = l1_error * sigmoid_prime(l1)
    #print("l1_delta:\n", l1_delta)
    
    #print("error:", l1_error, "\nderivative:", sigmoid(l1, True), "\ndelta:", l1_delta, "\n", "-"*10, "\n")
    # update weights
    Wo += np.dot(X.T, l1_delta)

print("l1:\n", l1)


l1:
 [[0.00961242]
 [0.00782087]
 [0.99362525]
 [0.99216235]]

The dots in $l1$ represent the lines in the graphic below. The lines represent the slope of the sigmoid in the particular position. The slope is highest with a value $x = 0$ (blue dot). It is rather shallow with $x = 2$ (green dot), and not so shallow and not as high with $x = -1$. All derivatives are between $0$ and $1$, of course, that is, no slope or a maximal slope of $1$. There is no negative slope in a sigmoid function.

The matrix $l1\_error$ is a 4 by 1 matrix (4 rows, 1 column). The derivative matrix $sigmoid\_prime(l1)$ is also a 4 by one matrix. The returned matrix of the element-wise product $l1\_delta$ is also the 4 by 1 matrix.

The product of the error and the slopes reduces the error of high confidence predictions. When the sigmoid slope is very shallow, the network had a very high or a very low value, that is, it was rather confident. If the network guessed something close to $x=0, y=0.5$, it was not very confident. Such predictions without confidence are updated most significantly. The other peripheral scores are multiplied with a number closer to $0$.

In the prediction line $l1 = sigmoid(np.dot(X, Wo))$ we compute the dot-product of the input vectors with the weights and compute the sigmoid on the sums. The result of the dot-product is the number of rows of the first matrix ($X$) and the number of columns of the second matrix ($Wo$). In the computation of the difference between the true (or gold) values in $y$ and the "guessed" values in $l1$ we have an estimate of the miss.

An example computation for the input $[ 1, 0, 1 ]$ and the weights $[ 9.5, 0.2, -0.1 ]$ and an output of $0.99$: If $y = 1$, the $l1\_error = y - l2 = 0.01$, and $l1\_delta = 0.01 * tiny\_deriv$:

More Complex Example

Consider now a more complicated example where no column has a correlation with the output:

Input Output
0 0 1 0
0 1 1 1
1 0 1 1
1 1 1 0

The pattern here is our XOR pattern or problem: If there is a $1$ in either column $1$ or $2$, but not in both, the output is $1$ (XOR over column $1$ and $2$).

From our discussion of the XOR problem we remember that this is a non-linear pattern, a one-to-one relationship between a combination of inputs.

To cope with this problem, we need a network with another layer, that is a layer that will combine and transform the input, and an additional layer will map it to the output. We will add a hidden layer with randomized weights and then train those to optimize the output probabilities of the table above.

We will define a new $X$ input matrix that reflects the above table:


In [15]:
X = np.array([[0, 0, 1],
              [0, 1, 1],
              [1, 0, 1],
              [1, 1, 1]])
print(X)


[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]

We also define a new output matrix $y$:


In [16]:
y = np.array([[ 0, 1, 1, 0]]).T
print(y)


[[0]
 [1]
 [1]
 [0]]

We initialize the random number generator with a constant again:


In [17]:
np.random.seed(1)

Assume that our 3 inputs are mapped to 4 hidden layer ($Wh$) neurons, we have to initialize the hidden layer weights in a 3 by 4 matrix. The outout layer ($Wo$) is a single neuron that is connected to the hidden layer, thus the output layer is a 4 by 1 matrix:


In [18]:
n_inputs = 3
n_hidden_neurons = 4
n_output_neurons = 1
Wh = np.random.random( (n_inputs, n_hidden_neurons) )  * np.sqrt(2.0/n_inputs)
Wo = np.random.random( (n_hidden_neurons, n_output_neurons) )  * np.sqrt(2.0/n_hidden_neurons)
print("Wh:\n", Wh)
print("Wo:\n", Wo)


Wh:
 [[3.40497041e-01 5.88142486e-01 9.33866473e-05 2.46853512e-01]
 [1.19825683e-01 7.53941469e-02 1.52080826e-01 2.82149152e-01]
 [3.23959286e-01 4.39942021e-01 3.42270888e-01 5.59479379e-01]]
Wo:
 [[0.14456957]
 [0.62092279]
 [0.01936595]
 [0.47409212]]

We will loop now 60,000 times to optimize the weights:


In [19]:
for i in range(100000):
    l1 = sigmoid(np.dot(X, Wh))
    l2 = sigmoid(np.dot(l1, Wo))
    
    l2_error = y - l2
    
    if (i % 10000) == 0:
        print("Error:", np.mean(np.abs(l2_error)))
    
    # gradient, changing towards the target value
    l2_delta = l2_error * sigmoid_prime(l2)
    
    # compute the l1 contribution by value to the l2 error, given the output weights
    l1_error = l2_delta.dot(Wo.T)
    
    # direction of the l1 target:
    # in what direction is the target l1?
    l1_delta = l1_error * sigmoid_prime(l1)
    
    Wo += np.dot(l1.T, l2_delta)
    Wh += np.dot(X.T, l1_delta)

print("Wo:\n", Wo)
print("Wh:\n", Wh)


Error: 0.49962354308293877
Error: 0.010185068686370096
Error: 0.006740755437297382
Error: 0.005345052957883488
Error: 0.004547502999833714
Error: 0.004017450105072936
Error: 0.0036334495140870125
Error: 0.003339229754058864
Error: 0.0031047453709322735
Error: 0.0029123275501446943
Wo:
 [[-7.07355497]
 [14.20975817]
 [-7.01708617]
 [-7.15528258]]
Wh:
 [[ 6.66182717  6.80494448 -3.00320286  3.65642155]
 [-3.43419418  6.77518798  6.58168773  3.71366194]
 [ 0.59089835 -1.96151177  0.18071609 -5.62078292]]

The new computation in this new loop is $l1\_error = l2\_delta.dot(Wo.T)$, a confidence weighted error from $l2$ to compute an error for $l1$. The computation sends the error across the weights from $l2$ to $l1$. The result is a contribution weighted error, because we learn how much each node value in $l1$ contributed to the error in $l2$. This step is called backpropagation. We update $Wh$ using the same steps we did in the 2 layer implementation.

Follow Andrew Trask's tutorial and some more of the links and examples that he provides on his website and work through the ML Cheat Sheet.