This notebook will summarize convolutional neural networks (CNN) as presented in the following fantastic resources:
The CNN is similar to a regular feed-forward NN, except it assumes that inputs are images which allows us to make certain assumptions throughout the network.
Regular NNs have an input layer, 1 or more hidden layers, and an output layer. These layers are dense and fully-connected, which leads to many connections, many weights, and many calculations. They do well with smaller sized images (32 x 32 x 3 = 3072 parameters) ... but do not scale well to full sized images (200 x 200 x 3 = 120,000 parameters). Unlike a regular NN, CNNs have neurons in 3-dimensions for width, height, and depth. However, Neurons will only be connected to a small region of the previous layer; it will not be fully-connected which saves on computations. The final output layer will be fully-connected and brought down to a smaller vector of class scores arranged along the depth dimension. For example, a 1x1x10 vector for a 10 category classification problem.
Every CNN layer transforms a 3D input volume into an output 3D volume with some differentiable function that may or may not have parameters
The 3 most popular layers for CNNs are: Convolutional, Pooling, and Fully-Connected
Example architecture: [INPUT > CONV > RELU > POOL > FC]
CONV/FC layers have parameters and are trained with gradient descent to minimize overall error
RELU/POOL layers don't and implement a fixed function
CONV/FC/POOL have additional hyperparameters; RELU does not
Resource-wise this is the core computational block of CNNs. It consists of a set of "learnable" filters which are small spatially (height x width), but extend through the full depth of input volume. For example, a first layer ConvNet filter for an RGB image could be size [5x5x3] which is 5x5 pixels by 3 color channels. The convolution process:
The receptive field is the filter size and this is how many neurons we connect to in the previous layer. Spatially (height x width) this will be smaller than the input volume, but it ALWAYS covers the full depth on the input volume. If our input image had 12 color bands, then each neuron in the CONV layer will have weights for a 5x5x12 region in the input volume.
These 4 values control the size of the output volume:
Output volume spatial size = (W - F + 2P)/S + 1
Ensure input & output volume are equal by setting padding P = (F-1)/2 when the stride is 1
Parameter Sharing is used in CONV layers to control the number of parameters that must be kept in memory when doing computations. We assume that a feature found at one spatial position will be useful in another, so instead of each neuron within a depth slice (filter layer) keeping separate weights & bias, we set them all to be the same. Now, every neuron in a depth slice has the same weights so we only have unique weights for each depth slice. Use an example where input image size is [227x227x3] with F=11, S=4, P=0, and K=96.
With all neurons in a depth slice using the same weights, then the forward pass of each depth slice within the CONV layer is computed as a convolution. Then, we refer to the sets of weights as a filter, or kernel, that is convolved with the input.
Parameter sharing makes sense when the features detected are simple like line and gradients, but when the spatial location matters (like looking for eyes on a portrait) then parameter sharing can be relaxed and we call the layer a locally-connected layer.
To progressively reduce the spatial size and reduce the number of parameters in a ConvNet, Pooling layers are inserted between CONV layers. This reduces the amount of computation required and helps control overfitting. The POOL layer operates independently on each depth slice of the input and resizes it spatially using the MAX operation.
These 2 values determine how much we are going to reduce the ConvNet:
POOL layer accepts an input of [WxHxD] and produces an output of:
This does not introduce any parameters (removes them actually) and we usually do not Pad the POOL layer. In practice, there are only 2 types of MAX POOL layers used:
The example above shows a 2x2 window sliding across the input with stride 2 (not overlapping) which outputs the max value it sees, reducing the activations by 75%. Other pooling units can be used besides MAX, including average pooling, or L2-norm pooling.
Each neuron in a fully-connected layer has full connections to every activation in the previous layer. A matrix of weights is applied to the inputs to produce an output. Because we have labeled data, the NN learns (during the training phase) what those weights should be to transform the input into the desired output.
Resource-wise, this is where most of the memory is taken up holding all of the parameters between the Fully Connected layers and all their nodes.
Dropout is used during training to disregard the weights from a random subset of nodes. That means the weights of these nodes are not used in calculations and do not contribute to the forward propagation; they are also not updated during backprop (ex: SGD). This has the effect of reducing overfitting because all the stuff that has been learned (the matrix weights) is somewhat dampened, or turned off. This actually gives us worse performance on our training set than what we just spent all our time learning, so why do it?
DROPOUT is used to prevent overfitting, thereby increasing our validation set accuracy. This generalizes our NN making it perform slightly worse on the training data, but much better on the validation or test sets where it really matters.
DROPOUT is not used during the actual fitting or prediction phase because we want all of our NN's power devoted to giving us the correct output classification. As an example, if we drop 50% of our weights (during training), but then use all the weights during fitting, there will be a higher contribution to the output! Keras internally scales weights, so we don't have to do anything about this. Keras knows that weights need to have their contributions lessened by 50% in order to predict correctly.
DROPOUT layers are not used in earlier layers of the NN (like the CONV layers) because it would remove that information for all of the remaining layers. Doing it in later layers is throwing away "model information" more than "input information".
This section looks at how to best combine various CONV, RELU, POOL, and FC layers into a functioning convolutional neural network to accomplish a specific task. Commonly, there are a few stacks of CONV-RELU layers followed by POOL, this is repeated until the input image has been sufficiently downsized spatially. At some point, transition to FC layers and then the last FC layer which holds the output, such as class scores:
INPUT > [[CONV > RELU] * N > POOL?] * M > [FC > RELU] * K > FC
INPUT > [CONV > RELU > POOL] * 2 > FC > RELU > FC
INPUT > [CONV > RELU > CONV > RELU > POOL] * 3 > [FC > RELU] * 2 > FC
We prefer to stack more smaller CONV layers than using one larger one. In case 1, let's stack three 3x3 layers on top of each other. Neurons in the first CONV layer see a 3x3 view of the input. Neurons in the second CONV layer see those 3 neurons, so effectively are seeing a 5x5 view on the input. At the third layer, they effectively see a 7x7 view of the input. In case 2, we use a single 7x7 filter which has the same receptive field size of the input. What is different?
State-of-the-art research shows that a linear list of layers may not be the best. More intricate connectivities between layers can outperform this basic flow.
Usually better to just download a pretrained model that worked well on ImageNet data and then finetune it to fit new data.
A rough calculation for the memory requirements of running VGG16 can be calculated, as was done in the Stanford CS231n CNN Course. At each layer, we can find the size of the memory required and weights. Notice that most of the memory (and compute time) is used in the first layers, while most of the parameters are in the last FC layers. Notice that the POOL layers reduce the spatial dimensions by 50% (don't effect depth) and do not introduce any new parameters.
Layer | Size/Memory | Weights |
---|---|---|
INPUT | 224x224x3 = 150K | 0 |
CONV3-64 | 224x224x64 = 3.2M | (3x3x3)x64 = 1,728 |
CONV3-64 | 224x224x64 = 3.2M | (3x3x3)x64 = 36,864 |
POOL2 | 112x112x64 = 800K | 0 |
CONV3-128 | 112x112x128 = 1.6M | (3x3x64)x128 = 73,728 |
CONV3-128 | 112x112x128 = 1.6M | (3x3x128)x128 = 147,456 |
POOL2 | 56x56x128 = 400K | 0 |
CONV3-256 | 56x56x256 = 800K | (3x3x128)x256 = 294,912 |
CONV3-256 | 56x56x256 = 800K | (3x3x256)x256 = 589,824 |
CONV3-256 | 56x56x256 = 800K | (3x3x256)x256 = 589,824 |
POOL2 | 28x28x256 = 200K | 0 |
CONV3-512 | 28x28x512 = 400K | (3x3x256)x512 = 1,179,648 |
CONV3-512 | 28x28x512 = 400K | (3x3x512)x512 = 2,359,296 |
CONV3-512 | 28x28x512 = 400K | (3x3x512)x512 = 2,359,296 |
POOL2 | 14x14x512 = 100K | 0 |
CONV3-512 | 14x14x512 = 100K | (3x3x512)x512 = 2,359,296 |
CONV3-512 | 14x14x512 = 100K | (3x3x512)x512 = 2,359,296 |
CONV3-512 | 14x14x512 = 100K | (3x3x512)x512 = 2,359,296 |
POOL2 | 7x7x512 = 25K | 0 |
FC | 1x1x4096 = 4K | 7x7x512x4096 = 102,760,448 |
FC | 1x1x4096 = 4K | 4096x4096 = 16,777,216 |
FC | 1x1x1000 = 1K | 4096x1000 = 4,096,000 |
TOTAL MEMORY = (LayerSizes + 3*Weights) * 4 Bytes * 2 (fwd and bkwd passes) * images/batch
In [2]:
#GBs required for 16 image mini-batch
size = ((15184000 + 3*4096000) * 4 * 2 * 16) / (1024**3)
print(str(round(size,2)) + 'GB')
This makes sense when tested with my 6GB GTX980ti. A mini-batch size of 32 ran out of VRAM. The GPU has to run other stuff too and has a normal load of around 0.7GB.
There are 3 major sources of memory to track
In [ ]: