1.2 The Activation Function

1.3 How do NNs work?

1.4 How do Neural Networks learn?

1.5 Gradient Descent

1.6 Stochastic Gradient Descent

1.7 Backpropagation

ANN Intuition

Credit: Deep Learning A-Z™: Hands-On Artificial Neural Networks

The Neuron

The Neuron contains:

- Neuron
- Dendrites: receiver of the neuron
- Axon: transmitter of the signal for the neuron

How can we represent neuron in machine?

Input signal: Dendrites
Output signal: Axon
Input values, their signals pass through Synapse to Neuron, then the neuron has an output value.

Input layer:
- Receive all input values (independent variables). These independent variables are all for 1 single observation (1 row of all values: age, bank amount, ...)
- You should standardize these input values.
Output value: can be
- Continous (price)
- Binary
- Categorical

Weights: are how neural networks learn.

By adjusting the weights, the neural network decides for every single case which signal is important, and which one is not.
Weights are adjusted through the process of learning.

What happen inside the Neuron?

1st step: $\sum_{i=1}^m w_i x_i$, takes the weighted sum of all the input values.
2nd step: apply the activation function ($\phi$) to the weighted sum.
3rd step: the neuron passes the signal to the next neuron down the line

The Activation Function

There are more activation fucntions, but we are going to look at 4 different types of activation function.

Threshold function

If the value is less than 0, then the Threshold function passes on 0.
If the value is more than or equal to 0, then the Threshold function passes on 1.
basically yes/no type of function

Sigmoid function

a smooth function
very useful in the output layer when we try to predict probabilities.

Rectifier

mostly used, very popular.

Hyperbolic Tangent (tanh)

similar to sigmoid function, but goes below 0 to -1

Example: Assuming the dependent variable is binary (y=0 or 1), which activation function can we use? We have 2 options:

- Threshold function: It fits perfect when we need 0 or 1
- Sigmoid function: between 0 and 1, gives us the probability of 0 and 1.

We have a very common combination where:
- Hidden layer: use rectifier function
- Output layer: use sigmoid function to give us the probabilities

How do NNs work?

Basic form of a neural network

Only contains an input layer and an output layer.
All of the input variables will be weighted up by the synapse and the price will be calculated by the weighted sum of the all inputs. You can use any activation functions to get a certain output.

However, NN has an extra advantage that increases the accuracy which is hidden layers.

For each neuron in the hidden layers:

- The weights of each input variables are not equal. Some weights may have non-zero values, some weights may have zero values. Because not all inputs are important for that neuron. For example: the first neuron only care about 2 inputs: area and distance from city, we can explain that the further from city, the area of the property is larger. That is why we don't need to draw the line of the synapses which are not important.

Each one of the neuron cannot predict the price, but together they can do a proper job.

How do Neural Networks learn?

Perceptron: a single layer feed forward neural network
Output value ($\hat y$): predicted value by the neural network
Actual value: y
We will calculate the cost function, which is the difference (error) between predicted value and actual value:
- Cost function: $\frac{1}{2}(\hat y - y)^2$
- Our goal is to minimize the cost fuction
After having the cost function, we will feed the information back to the neural network, then the weights get updated to minimize the cost function.
1 epoch: is when we go through the whole dataset.

There are many cost functions. Here is A list of cost functions used in neural networks, alongside applications.

Gradient Descent

How can we minimize the cost function?

1 approach is the brute force approach, when we try out lots of weights, then we have this graph:
- y-axis: cost function
- x-axis: $\hat y$

Why we should not use this brute-force approach by trying out lots of parameters and inputs for weights?

Because as you increase number of weights or synapses, you have to face the curse of dimensionality.
Example to understand the curse of dimensionality:
- We have a 1 layer neural network of 25 weights. If we need to try out 1000 combination of weights. We need: $1000^{25}= 10^{75}$ combinations.
- Sunway Taihulight: World's fastest super computer can do 93 PFLOPS (93 x $10^{15}$ floating operation/ second). It will take $10^{75} / 93 x 10^{15} = 1.08 * 10^{58} \text{seconds} = 3.42 * 10^{50} \text{years}$

We need a different approach: Gradient Descent

Intuition:

Let's say we have a starting cost function. By looking at the angle of the cost function, we just need to differentiate, find out the slope is positive or negative, then we can decide to go downhill or uphill until we reach the minimum.
It is called gradient descent, because you are descending into the minimum of the cost function.

Stochastic Gradient Descent

Gradient Descent requires the cost function to be a convex function. If our cost function is not convex, then our gradient descent will lead us to the local minimum instead of the global minimum.

However, Stochastic Gradient Descent does not require our cost function to be convex.

Differences between Gradient Descent (also called Batch Gradient Descent) and Stochastic Gradient Descent

Gradient Descent	Stochastic Gradient Descent
Calculate cost functions and adjust weights by taking all input values	Calculate the cost function and adjust the weight by looking at 1 row at a time
runs lower	runs faster
deterministic algorithm	stochatic algorithm (random)

The reason why Stochastic Gradient Descent helps avoid the problem of stucking in local minimum is because Stochastic Gradient Descent has a much higher flunctuation. It is much more higher to find global minimum.

Additional Reading:

Backpropagation

Forward Propagation: Information is entered into the input layer and then it is propagated forward to get the output value $\hat y$. The output values are then compared to actual values to compute errors. The errors are then back-propagated through the network in the opposite direction to train the network by adjusting the weights.

Backpropagation allows us to adjust all the weights at the same time.

Additional Reading: Chapter 2 - How the backpropagation algorithm works

Steps-by-steps walkthough in the training of ANN

Table of Contents