**(C) 2018-2019 by Damir Cavar**

**Version:** 1.1, November 2019

Another good article to read and understand is Andrej Karpathy's Yes you should understand backprop.

In this example we will use a very simple network to start with. The network will only have one input and one output layer. We want to make the following predictions from the input:

Input | Output |
---|---|

0 0 1 | 0 |

1 1 1 | 1 |

1 0 1 | 1 |

0 1 1 | 0 |

We will use *Numpy* to compute the network parameters, weights, activation, and outputs:

```
In [21]:
```import numpy as np

We will use the *Sigmoid* activation function:

```
In [22]:
```def sigmoid(z):
"""The sigmoid activation function."""
return 1 / (1 + np.exp(-z))

We could use the ReLU activation function instead:

```
In [23]:
```def relu(z):
"""The ReLU activation function."""
return max(0, z)

*sigmoid_prime* function returns the derivative of the sigmoid for any given $z$. The derivative of the sigmoid is $z * (1 - z)$. This is basically the slope of the sigmoid function at any given point:

```
In [24]:
```def sigmoid_prime(z):
"""The derivative of sigmoid for z."""
return z * (1 - z)

*X*. There are three input nodes (three columns per vector in $X$. Each row is one trainig example:

```
In [25]:
```X = np.array([ [ 0, 0, 1 ],
[ 0, 1, 1 ],
[ 1, 0, 1 ],
[ 1, 1, 1 ] ])
print(X)

```
```

*y*, where each row represents the output for the corresponding input vector (row) in *X*. The vector is initiated as a single row vector and with four columns and transposed (using the $.T$ method) into a column vector with four rows:

```
In [26]:
```y = np.array([[0,0,1,1]]).T
print(y)

```
```

```
In [27]:
```np.random.seed(1)

We create a weight matrix ($Wo$) with randomly initialized weights:

```
In [28]:
```n_inputs = 3
n_outputs = 1
#Wo = 2 * np.random.random( (n_inputs, n_outputs) ) - 1
Wo = np.random.random( (n_inputs, n_outputs) ) * np.sqrt(2.0/n_inputs)
print(Wo)

```
```

*Wo*. The rest, input matrix, output vector and so on are components that we need for learning and evaluation. The learning result is stored in the *Wo* weight matrix.

*forward propagation* line we process the entire input matrix for training. This is called **full batch** training. I do not use an alternative variable name to represent the input layer, instead I use the input matrix $X$ directly here. Think of this as the different inputs to the input neurons computed at once. In principle the input or training data could have many more training examples, the code would stay the same.

```
In [31]:
```for n in range(10000):
# forward propagation
l1 = sigmoid(np.dot(X, Wo))
# compute the loss
l1_error = y - l1
#print("l1_error:\n", l1_error)
# multiply the loss by the slope of the sigmoid at l1
l1_delta = l1_error * sigmoid_prime(l1)
#print("l1_delta:\n", l1_delta)
#print("error:", l1_error, "\nderivative:", sigmoid(l1, True), "\ndelta:", l1_delta, "\n", "-"*10, "\n")
# update weights
Wo += np.dot(X.T, l1_delta)
print("l1:\n", l1)

```
```

**reduces the error of high confidence predictions**. When the sigmoid slope is very shallow, the network had a very high or a very low value, that is, it was rather confident. If the network guessed something close to $x=0, y=0.5$, it was not very confident. Such predictions without confidence are updated most significantly. The other peripheral scores are multiplied with a number closer to $0$.

Consider now a more complicated example where no column has a correlation with the output:

Input | Output |
---|---|

0 0 1 | 0 |

0 1 1 | 1 |

1 0 1 | 1 |

1 1 1 | 0 |

*non-linear pattern*, a **one-to-one relationship between a combination of inputs**.

*hidden layer* with randomized weights and then train those to optimize the output probabilities of the table above.

We will define a new $X$ input matrix that reflects the above table:

```
In [15]:
```X = np.array([[0, 0, 1],
[0, 1, 1],
[1, 0, 1],
[1, 1, 1]])
print(X)

```
```

We also define a new output matrix $y$:

```
In [16]:
```y = np.array([[ 0, 1, 1, 0]]).T
print(y)

```
```

We initialize the random number generator with a constant again:

```
In [17]:
```np.random.seed(1)

```
In [18]:
```n_inputs = 3
n_hidden_neurons = 4
n_output_neurons = 1
Wh = np.random.random( (n_inputs, n_hidden_neurons) ) * np.sqrt(2.0/n_inputs)
Wo = np.random.random( (n_hidden_neurons, n_output_neurons) ) * np.sqrt(2.0/n_hidden_neurons)
print("Wh:\n", Wh)
print("Wo:\n", Wo)

```
```

We will loop now 60,000 times to optimize the weights:

```
In [19]:
```for i in range(100000):
l1 = sigmoid(np.dot(X, Wh))
l2 = sigmoid(np.dot(l1, Wo))
l2_error = y - l2
if (i % 10000) == 0:
print("Error:", np.mean(np.abs(l2_error)))
# gradient, changing towards the target value
l2_delta = l2_error * sigmoid_prime(l2)
# compute the l1 contribution by value to the l2 error, given the output weights
l1_error = l2_delta.dot(Wo.T)
# direction of the l1 target:
# in what direction is the target l1?
l1_delta = l1_error * sigmoid_prime(l1)
Wo += np.dot(l1.T, l2_delta)
Wh += np.dot(X.T, l1_delta)
print("Wo:\n", Wo)
print("Wh:\n", Wh)

```
```

**confidence weighted error** from $l2$ to compute an error for $l1$. The computation sends the error across the weights from $l2$ to $l1$. The result is a **contribution weighted error**, because we learn how much each node value in $l1$ **contributed** to the error in $l2$. This step is called **backpropagation**. We update $Wh$ using the same steps we did in the 2 layer implementation.