Efficient Covolution

Receptive Field

Consider image classification: cat vs. dog?

Ideally, we want to look at all the pixels. But this results in too many wegiths.
Practically, we break the image into patches, and use local connectivity concept to reduce the number of weights.

The window

Intuition: one element i the patch corresponds to how many original elements.

Example: $x_{n\times n}\rightarrow []_{m\times m} \rightarrow ReLU_{m\times m} \rightarrow \text{pooling layer}(2\times 2) $

Convolution
Non-linear activation: ReLU
Pooling

Then, one value in the final output above corresponds to $2m\times 2m$ elments in $x$. So, the receptive field is $2m\times 2m$

AlexNet:

The very first paper that showed deep convolutional network works well in image classification: "Imagenet Classification with Deep Convolutional Neural Networks", NIPS 2012

5 convolutional layers, 3 fully connected
60 million parameters.
GPU-based training (massive amount of parallelism)
- 60 million parameters won't fir in GPU mempry $\Rightarrow$ use 2 GPUs
- split the model into half, one part of the model is on 1 GPU, the other on second GPU

VGG

Very Deep Convolutional Networks dor Large Scale image Recognition", Arxiv 1409,1556

19 layers deep, 3 fully connected layers
144 million parameters
$3\times 3$ convolutional filters with sride $1$: (they showed that smaller filters works better)
$2\times 2$ max-pooling layers with stride $2$
Established that smaller filters (are parameter efficient) and deeper networks are better

If we have two filters of size $n$ and $m$: the receptive filed will be $(n+m-1)\times (n+m-1)$

Consider two cases below:
- two filters of size $3\times3$ and $3\times 3$ has $5\times5$ receptive filed: $[filter]_{3\times 3} \longrightarrow [ReLU] \longrightarrow [filter]_{3\times3}$
- Using a single filter of size $5\times5$ also has the same receptive
- But the first option has more non-linearity, between them
  - the more non-linearity we have, the more

Take home message: Do not use filters more than size $3$

GoogleLeNet

Going Deeper with Convolutions, CVPR 2015

Multiple networks together
22 layers, introduced the "inception" module
Efficient architecture

Inception module

$1\times 1$ convolution is a saclar multiplication: $[w]_{1\times1}=\alpha$ $$y[i] = \sum_k x[i-k] w[k] = \alpha x[i]\ \ \ \ \ \Longrightarrow \text{scalar multiplcation}$$

Input image: $C\times H\times W$

$N\ \ 1\times1 convolution$ : $N\times C\times 1\times1$

Output: $N\times H\times W$ (this is because n+m-1)



In [ ]: