notebook.community

Edit and run

1 CNN Intuition

1.1 Plan of attack

1.2 What are Convolutional Neural Networks?

1.3 Step 1 - Convolution

1.4 Step 1(b): ReLU Layer

1.5 Step 2: Max Pooling

1.6 Step 3: Flattening

1.7 Step 4: Full Connection

1.8 Summary

1.9 Softmax & Cross-Entropy

CNN Intuition

Credit: Deep Learning A-Z™: Hands-On Artificial Neural Networks

Plan of attack

What are Convolutional Neural Networks?
Step 1: Convolution Operation
Step 1(b): ReLU Layer
Step 2: Pooling
Step 3: Flattening
Step 4: Full Connection
Extra: Softmax & Cross-Entropy

What are Convolutional Neural Networks?

The brain processes a certain feature on an image and it classifies that assertion. Depends on the feature that you see, you categorize things in certain ways.
- For example:
  - By looking on the right side of the above image, you see a person looking to the right.
  - By looking on the left side of the above image, you see a person looking at you.
CNN works similar to our brain processing.
Our CNN will see:
- Black and white image = 2d array
- Colored image = 3d array

We can simplify image as: white=0 and black=1

Step 1 - Convolution

Feature detector is a 3x3 matrix. AlexNet uses a 7x7 matrix.
Feature dectector can also be called kernel or filter.
A convolution operation is
- $\bigotimes$ : an x inside a circle
What happens in the background is
- First, you put the filter on top of the input image, and multiply each respected value (element wise multiplication), and then add up the result.
- Then we move the filter, the step to move the filter is called "stride". Stride 1 is move 1 pixel, stride 2 is move 2 pixels.
- After the first 2 steps, we have a feature map matrix (convolved feature)
Why we do this?
- Making the image smaller so it is easier to process. The bigger the stride the smaller the image.
- The purpose of the feature detector is to detect certain features or part of the image that are integral. The highest number in the feature is when the pattern matches up.
- Same thing with our brain, where we does not look at all pixels but rather look at features like: nose, hat, ...

We create many feature maps to obtain our first convolution layer. Because each feature detector we apply gives us a different feature maps.
The primary purpose of convolution is to find features in your image using the feature detector, put them into a feature map and still preserve the spatial relationship between pixel.

Additional Reading

Jianxin Wu, 2017, Introduction to Convolutional Neural Networks

Step 1(b): ReLU Layer

ReLU: Rectified Linear Unit
By applying ReLU, you can increase non-linearity in our CNN since images are highly non-linear.

In the example above, ReLU function removes all the black (negative) in the image.
- There is a certain linearity when the image goes from white to black (white -> gray -> black). The ReLU function wil remove that progression.

Additional Reading

C.-C. Jay Kuo, 2016, Understanding Convolutional Neural Networks with A Mathematical Model
Kaiming He et al., 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Step 2: Max Pooling

The problem is when we have multiple images about the same object but in different direction, and the texture is a bit different.
We have to make sure our neural network has a property called spatial invariance - meaning that it does not care where the features are, not so much as in which part of the image

There are several types of pooling: mean pooling, max pooling, sum pooling...

Applying Max Pooling

We take a box of 2x2 pixel and place it in the top left hand corner of the feature map, then find the maximum value in that box, and only record that value. Then move the box to the right by stride.
A few things happen:
- We still are able to preserve the features. The maximum value represents the closest similarity to the feature.
- Pooling these features help us getting rid of 75% of the not important information. We are reducing the size and number of parameters that are going to go into our final layer of the neural network and therefore prevent overfitting.
- Taking the maximum of the pixels accounts for any distortion. The pool feature of the different rotation images will be exactly the same.

Additional Reading

Dominik Scherer et al., 2010, Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition
2D Visualization of a Convolutional Neural Network

Step 3: Flattening

Pooled Feature Map: result of these steps:
- First, we apply the convolution operation to our image.
- Then we apply pooling to the result of the first step.
Flattening: we take the Pooled Feature Map and flatten it into a column. Basically just take the numbers row by row and put them into one long column.
- The reason for this is because we want to put this into an artificial neural network for further processing.

Summary of 3 steps

Step 4: Full Connection

After all 3 steps above: convolution, pooling, flattening, we then add a whole new ANN after that.
The fully connected layers are the hidden layers of the ANN.
- Hiddent layers don't have to be fully connected.
- However, in Convolutional Neural Network, we are using a specific type of hidden layer that is fully connected.
The main purpose of the ANN is to combine our features into more attributes that predict the classes better.

In this example above, why do we have 2 outputs but not one?

One output is when we are predicting a numerical value - meaning running a regression type of problem.
However, when we do classification, we need 1 output per class or per category.

The process in this ANN works as normal:

After the pooling step, information is going through the ANN, and a prediction is made.
Error is calculated by the cost function (loss function - cross entropy function). We try to minimize the loss function to optimize our network. And then the error is back propagated through the network.
A couple of things are adjusted in the network to help optimize the performance:
- weights (synapses) are adjusted.
- Feature detectors (3x3 matrices) are adjusted.

How two output neurons work?

Let's say hypothetically we have these numbers in the previous fully connected layer. The numbers are between 0 and 1, but they can be anything. Each number represents a feature importance.
- number 1: neuron is confident that it found an important feature
- number 0: neuron did not find that feature important.
In this case, the high number indicates that those features (fluffy ears, wet nose, ...) belong to a dog. Even though the signals are sent to both Dog and Cat but some signals indicate that it is a dog.
Throughout lots of iterations, the dog class will know which of these neurons indeed fire up when the feature belongs to a dog. On the other hand, the Cat neuron will ignore these neurons because they are not an indication of a cat.
That is how these final output neurons learn which neurons in the final fully connected layer to listen to. That is how the features are propagated through the network and conveyed to the output.
Depends on the probabilities of the neuron in the final fully connected layer, they get to vote to make the final prediction, the weight are the importance of the vote.

Summary

Step 1: Starting with an input image, we apply multiple different feature detectors (filters), this comprises the convolutional layer.
Step 1(b): On top of the convolutional layer, we apply the ReLU (Rectified Linear Unit) to remove linearity and increase non-linearity in our images.
Step 2: Apply the pooling layer to our convolutional layer. The main purpose of the pooling layer is to make sure that we have a spatial invariants in our images, reduce the size, and reduce overfitting.
Step 3: Flatten all the pooled images into 1 long vector so we can input that into an Artificial Neural Network.
Step 4: Fully connected ANN where all features are processed through a network, then we have the final layer which performs the voting towards the classes. All of this is trained through a forward propagation and back propagation process. Lots of iteration and epochs.

Additional Reading

Adit Deshpande, 2016, The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)

Softmax & Cross-Entropy

The output of a single prediction output will be 2 values indicate the probability of a dog and a cat, and why these two values add up to 1?
- Normally, the dog and the cat neurons would have any kind of real values
- If we apply the softmax function, that would bring these two values to be between 0 and 1, and would make them add up to 1.
- Softmax function: $\large f_i(z) = \large \frac{e^{z_j}}{\sum_k e^{z_k}}$
Cross-Entropy function: $\large L_i = - \log (\large \frac{e^{f_{y_i}}}{\sum_j e^{f_i}}) $
- A different representation of cross-entropy: $\large H(p,q) = - \sum_{x} p(x) \log q(x) $
- After softmax, we use cross-entropy as a loss function. The loss function is something we want to minimize in order to maximize the performance of our network.
- You can plug in the number to cross-entropy as below

Compare different loss functions when compare 2 neural networks

The advantage of Cross-Entropy over Mean Squared Error:
- If at the start of your back-propagation, your output is very small, then the gradient in the gradient descent will be very low. It will be very hard for the neural network to start moving around, and adjust the weights. However, since the Cross-Entropy has the logarithm in it, it helps the network access the small value.
Cross-Entropy is only preffered for classification. If it is a Regression problem, then you should use Mean Squared Error.

Additional Reading

Rob DiPietro, 2016, A Friendly Introduction to Cross-Entropy Loss

Table of Contents

CNN Intuition

Plan of attack

What are Convolutional Neural Networks?

Step 1 - Convolution

Step 1(b): ReLU Layer

Step 2: Max Pooling

Step 3: Flattening

Step 4: Full Connection

Summary

Softmax & Cross-Entropy