Convolutional Neural Networks

This notebook will summarize convolutional neural networks (CNN) as presented in the following fantastic resources:

Architecture Overview

The CNN is similar to a regular feed-forward NN, except it assumes that inputs are images which allows us to make certain assumptions throughout the network.

Regular NNs have an input layer, 1 or more hidden layers, and an output layer. These layers are dense and fully-connected, which leads to many connections, many weights, and many calculations. They do well with smaller sized images (32 x 32 x 3 = 3072 parameters) ... but do not scale well to full sized images (200 x 200 x 3 = 120,000 parameters). Unlike a regular NN, CNNs have neurons in 3-dimensions for width, height, and depth. However, Neurons will only be connected to a small region of the previous layer; it will not be fully-connected which saves on computations. The final output layer will be fully-connected and brought down to a smaller vector of class scores arranged along the depth dimension. For example, a 1x1x10 vector for a 10 category classification problem.

Every CNN layer transforms a 3D input volume into an output 3D volume with some differentiable function that may or may not have parameters

CNN Layers

The 3 most popular layers for CNNs are: Convolutional, Pooling, and Fully-Connected

Example architecture: [INPUT > CONV > RELU > POOL > FC]

INPUT [32x32x3] holds raw pixel values
CONV layer multiplies its weights by a small region of the input resulting in [32x32x12] if we use 12 filters
RELU layer applies elementwise activation function like max(0,x) leaving the volume unchanged
POOL downsamples along spatial dimensions (height, width) resulting in [16x16x12]
FC computes class scores resulting in [1x1x10]
DROPOUT randomly disregards node weights during training to prevent overtraining

CONV/FC layers have parameters and are trained with gradient descent to minimize overall error
RELU/POOL layers don't and implement a fixed function

CONV/FC/POOL have additional hyperparameters; RELU does not

Convolutional Layer

Overview - CONV Layer

Resource-wise this is the core computational block of CNNs. It consists of a set of "learnable" filters which are small spatially (height x width), but extend through the full depth of input volume. For example, a first layer ConvNet filter for an RGB image could be size [5x5x3] which is 5x5 pixels by 3 color channels. The convolution process:

slides the filter across and down (typewriter style) the input
computes a dot product between its entries and the input values it's "covering"
produces a 2-D activation map showing that filter's responses
repeats in parallel the same process for other filters in the CONV layer filter set
stack the activation maps along depth dimension to produce an output volume

The receptive field is the filter size and this is how many neurons we connect to in the previous layer. Spatially (height x width) this will be smaller than the input volume, but it ALWAYS covers the full depth on the input volume. If our input image had 12 color bands, then each neuron in the CONV layer will have weights for a 5x5x12 region in the input volume.

Hyperparameters - CONV Layer

These 4 values control the size of the output volume:

Number of Filters (K), or depth, of the output volume corresponds to number of filters (each looks for different features like edges, gradients, colors, etc.). 5 filters would mean 5 different neurons looking at the same input region (5x5x3) and they would form a depth column. The weight matrices in a CNN belong to these filters. Unlike a FC layer which applies weights to the input and produces 1 output, the CNN filters each apply their weights to the input and produce an output. So 12 filters will produce 12 output images from that layer. During training, SGD will determine what the best features (filter weights) are by itself so we don't have to guess. Successive layers of a CNN will have more filters, combining lower level filters in differing ways to create new more complex filters. Example, an RGB input image would be size (224x224x3) and after a CONV layer with 12 filters, it will be size (224x224x3x12).
Stride (S) is how we move the filter. S=1 means we slide from pixel to pixel, while S=2 means we are skipping every other pixel. Larger strides result in smaller output volumes spatially (HxW).
Zero-Padding (P) adds zeros around the border to control the spatial size of the output volume. Used to preserve a spatial output equal to the input.
Spatial Extent (F) is the height (same as width) of the "window"

Output volume spatial size = (W - F + 2P)/S + 1

W = input volume size
F = receptive field size of CONV layer neurons
S = stride
P = zero padding on the border

Ensure input & output volume are equal by setting padding P = (F-1)/2 when the stride is 1

Parameter Sharing is used in CONV layers to control the number of parameters that must be kept in memory when doing computations. We assume that a feature found at one spatial position will be useful in another, so instead of each neuron within a depth slice (filter layer) keeping separate weights & bias, we set them all to be the same. Now, every neuron in a depth slice has the same weights so we only have unique weights for each depth slice. Use an example where input image size is [227x227x3] with F=11, S=4, P=0, and K=96.

Without parameter sharing: first CONV layer will have [55x55x96] = 290,400 neurons with unique weights. Each of these neurons is connected to a [11x11x3] area on the input (each has 1 bias unit as well). This leave 290,400x364 = 105,705,600 parameters
With parameter sharing: first CONV layer will have 96 unique neuron weights (one for each depth slice/filter). Each of these connects to an [11x11x3] input area so we have 96x11x11x3 = 34,944 (+96 biases)

With all neurons in a depth slice using the same weights, then the forward pass of each depth slice within the CONV layer is computed as a convolution. Then, we refer to the sets of weights as a filter, or kernel, that is convolved with the input.

Parameter sharing makes sense when the features detected are simple like line and gradients, but when the spatial location matters (like looking for eyes on a portrait) then parameter sharing can be relaxed and we call the layer a locally-connected layer.

Pooling Layer

Overview - POOL Layer

To progressively reduce the spatial size and reduce the number of parameters in a ConvNet, Pooling layers are inserted between CONV layers. This reduces the amount of computation required and helps control overfitting. The POOL layer operates independently on each depth slice of the input and resizes it spatially using the MAX operation.

Hyperparameters - POOL Layer

These 2 values determine how much we are going to reduce the ConvNet:

Spatial Extent (F) is the size of the window
Stride (S) is how many pixels we move the window

POOL layer accepts an input of [WxHxD] and produces an output of:

W' = (W-F)/S + 1
H' = (H-F)/S + 1
D' = D

This does not introduce any parameters (removes them actually) and we usually do not Pad the POOL layer. In practice, there are only 2 types of MAX POOL layers used:

F=2, S=2
F=3, S=2 (overlapping pooling)

The example above shows a 2x2 window sliding across the input with stride 2 (not overlapping) which outputs the max value it sees, reducing the activations by 75%. Other pooling units can be used besides MAX, including average pooling, or L2-norm pooling.

Fully Connected Layer

Overview - FC Layer

Each neuron in a fully-connected layer has full connections to every activation in the previous layer. A matrix of weights is applied to the inputs to produce an output. Because we have labeled data, the NN learns (during the training phase) what those weights should be to transform the input into the desired output.

Resource-wise, this is where most of the memory is taken up holding all of the parameters between the Fully Connected layers and all their nodes.

Dropout Layer

Overview - DROPOUT Layer

Dropout is used during training to disregard the weights from a random subset of nodes. That means the weights of these nodes are not used in calculations and do not contribute to the forward propagation; they are also not updated during backprop (ex: SGD). This has the effect of reducing overfitting because all the stuff that has been learned (the matrix weights) is somewhat dampened, or turned off. This actually gives us worse performance on our training set than what we just spent all our time learning, so why do it?

DROPOUT is used to prevent overfitting, thereby increasing our validation set accuracy. This generalizes our NN making it perform slightly worse on the training data, but much better on the validation or test sets where it really matters.

DROPOUT is not used during the actual fitting or prediction phase because we want all of our NN's power devoted to giving us the correct output classification. As an example, if we drop 50% of our weights (during training), but then use all the weights during fitting, there will be a higher contribution to the output! Keras internally scales weights, so we don't have to do anything about this. Keras knows that weights need to have their contributions lessened by 50% in order to predict correctly.

DROPOUT layers are not used in earlier layers of the NN (like the CONV layers) because it would remove that information for all of the remaining layers. Doing it in later layers is throwing away "model information" more than "input information".

CNN Architectures

This section looks at how to best combine various CONV, RELU, POOL, and FC layers into a functioning convolutional neural network to accomplish a specific task. Commonly, there are a few stacks of CONV-RELU layers followed by POOL, this is repeated until the input image has been sufficiently downsized spatially. At some point, transition to FC layers and then the last FC layer which holds the output, such as class scores:

INPUT > [[CONV > RELU] * N > POOL?] * M > [FC > RELU] * K > FC

usually, N <= 3, M >= 0, K < 3

INPUT > [CONV > RELU > POOL] * 2 > FC > RELU > FC

a single CONV layer between every POOL

INPUT > [CONV > RELU > CONV > RELU > POOL] * 3 > [FC > RELU] * 2 > FC

two CONV layers before each POOL, better for larger and deeper networks
multiple stacked CONV layers can develop more complex features before destructive POOLing

Practical Considerations

We prefer to stack more smaller CONV layers than using one larger one. In case 1, let's stack three 3x3 layers on top of each other. Neurons in the first CONV layer see a 3x3 view of the input. Neurons in the second CONV layer see those 3 neurons, so effectively are seeing a 5x5 view on the input. At the third layer, they effectively see a 7x7 view of the input. In case 2, we use a single 7x7 filter which has the same receptive field size of the input. What is different?

case 1 contains non-linearities between the layers which makes features more expressive. It also uses fewer parameters than a 7x7 which saves on computation
case 2 has neurons that are computing a linear function over the input (not good) and has more parameters. However, our system would require less memory because we don't need to hold on to intermediate layer results for backpropagation

State-of-the-art research shows that a linear list of layers may not be the best. More intricate connectivities between layers can outperform this basic flow.

Usually better to just download a pretrained model that worked well on ImageNet data and then finetune it to fit new data.

Layer Sizing Patterns

INPUT Layer

Should be divisible by 2 many times. Common numbers are 32 (CIFAR-10), 64, 96 (STL-10), 224 (ImageNet), 384, 512

CONV Layer

Use small layers of 3x3 or at most 5x5
Use a stride of S=1 (this means downsampling will fall on POOL)
Pad the input volume with zeros so the CONV layer does not alter its spatial dimensions
- 3x3 CONV layer has F=3, P=1
- 5x5 CONV layer has F=5, P=2
- in general, P = (F-1)/2

POOL Layer

Use 2x2 Max Pooling with F=2, S=2 which will discard 75% of activations
Do not go above F=3, S=2 because it is too destructive

DROPOUT Layer

Use 0.5 as a starting point. It will disregard 50% of the weights during training to help reduce overfitting and make the model more general.
If Validation Accuracy is higher than Training Accuracy, reduce dropout or even turn it off (set to 0.0)

Case Studies

LeNet (1990s)

first successful application of CNN, it could read numbers

AlexNet (2012)

1st in ImageNet 2012. First to have multiple CONV layers

ZF Net (2013)

1st in ImageNet 2013. Similar to AlexNet, but with smaller first CONV layer and bigger internal CONV layers

GoogLeNet (2014)

1st in ImageNet 2014. Introduction of Inception Module which drastically reduced paramters from 60M down to 4M. Uses AVG POOL instead of FC layers at the top

VGGNet

2nd in ImageNet 2014. Showed that depth in a CNN mattered. Ended up using 16 CONV/FC layers with 3x3 convolutions and 2x2 pooling throughout. Expensive to evaluate and uses a lot of memory

ResNet

1st in ImageNet 2015. Features skip connections, heavy use of batch normalization, doesn't use FC layers at the end. Currently state-of-the-art

VGG Detailed Sizing

A rough calculation for the memory requirements of running VGG16 can be calculated, as was done in the Stanford CS231n CNN Course. At each layer, we can find the size of the memory required and weights. Notice that most of the memory (and compute time) is used in the first layers, while most of the parameters are in the last FC layers. Notice that the POOL layers reduce the spatial dimensions by 50% (don't effect depth) and do not introduce any new parameters.

Layer	Size/Memory	Weights
INPUT	224x224x3 = 150K	0
CONV3-64	224x224x64 = 3.2M	(3x3x3)x64 = 1,728
CONV3-64	224x224x64 = 3.2M	(3x3x3)x64 = 36,864
POOL2	112x112x64 = 800K	0
CONV3-128	112x112x128 = 1.6M	(3x3x64)x128 = 73,728
CONV3-128	112x112x128 = 1.6M	(3x3x128)x128 = 147,456
POOL2	56x56x128 = 400K	0
CONV3-256	56x56x256 = 800K	(3x3x128)x256 = 294,912
CONV3-256	56x56x256 = 800K	(3x3x256)x256 = 589,824
CONV3-256	56x56x256 = 800K	(3x3x256)x256 = 589,824
POOL2	28x28x256 = 200K	0
CONV3-512	28x28x512 = 400K	(3x3x256)x512 = 1,179,648
CONV3-512	28x28x512 = 400K	(3x3x512)x512 = 2,359,296
CONV3-512	28x28x512 = 400K	(3x3x512)x512 = 2,359,296
POOL2	14x14x512 = 100K	0
CONV3-512	14x14x512 = 100K	(3x3x512)x512 = 2,359,296
CONV3-512	14x14x512 = 100K	(3x3x512)x512 = 2,359,296
CONV3-512	14x14x512 = 100K	(3x3x512)x512 = 2,359,296
POOL2	7x7x512 = 25K	0
FC	1x1x4096 = 4K	7x7x512x4096 = 102,760,448
FC	1x1x4096 = 4K	4096x4096 = 16,777,216
FC	1x1x1000 = 1K	4096x1000 = 4,096,000

TOTAL MEMORY = (LayerSizes + 3*Weights) * 4 Bytes * 2 (fwd and bkwd passes) * images/batch



In [2]:

    
#GBs required for 16 image mini-batch
size = ((15184000 + 3*4096000) * 4 * 2 * 16) / (1024**3)
print(str(round(size,2)) + 'GB')









    



3.27GB

This makes sense when tested with my 6GB GTX980ti. A mini-batch size of 32 ran out of VRAM. The GPU has to run other stuff too and has a normal load of around 0.7GB.

Memory

There are 3 major sources of memory to track

Intermediate volume sizes: the raw number of activations at every layer and their gradients (same amount). Needed for backpropagation, they are more concentrated towards the front of the CNN.
Parameter sizes: the numbers that hold the network parameters, or weights and their gradients during backprop. Usually, multiply this number by ~3 to get the actual memory.
Miscellaneous memory such as image data batches



In [ ]:

Table of Contents