Efficient Covolution

Receptive Field

Consider image classification: cat vs. dog?

  • Ideally, we want to look at all the pixels. But this results in too many wegiths.
  • Practically, we break the image into patches, and use local connectivity concept to reduce the number of weights.

The window

Intuition: one element i the patch corresponds to how many original elements.

Example: $x_{n\times n}\rightarrow []_{m\times m} \rightarrow ReLU_{m\times m} \rightarrow \text{pooling layer}(2\times 2) $

  • Convolution
  • Non-linear activation: ReLU
  • Pooling

Then, one value in the final output above corresponds to $2m\times 2m$ elments in $x$. So, the receptive field is $2m\times 2m$

AlexNet:

The very first paper that showed deep convolutional network works well in image classification: "Imagenet Classification with Deep Convolutional Neural Networks", NIPS 2012

  • 5 convolutional layers, 3 fully connected
  • 60 million parameters.
  • GPU-based training (massive amount of parallelism)
    • 60 million parameters won't fir in GPU mempry $\Rightarrow$ use 2 GPUs
    • split the model into half, one part of the model is on 1 GPU, the other on second GPU

VGG

Very Deep Convolutional Networks dor Large Scale image Recognition", Arxiv 1409,1556

  • 19 layers deep, 3 fully connected layers
  • 144 million parameters
  • $3\times 3$ convolutional filters with sride $1$: (they showed that smaller filters works better)
  • $2\times 2$ max-pooling layers with stride $2$
  • Established that smaller filters (are parameter efficient) and deeper networks are better
  • If we have two filters of size $n$ and $m$: the receptive filed will be $(n+m-1)\times (n+m-1)$

    Consider two cases below:

    • two filters of size $3\times3$ and $3\times 3$ has $5\times5$ receptive filed: $[filter]_{3\times 3} \longrightarrow [ReLU] \longrightarrow [filter]_{3\times3}$
    • Using a single filter of size $5\times5$ also has the same receptive

    • But the first option has more non-linearity, between them

      • the more non-linearity we have, the more

Take home message: Do not use filters more than size $3$

GoogleLeNet

Going Deeper with Convolutions, CVPR 2015

  • Multiple networks together
  • 22 layers, introduced the "inception" module
  • Efficient architecture

Inception module

  • $1\times 1$ convolution is a saclar multiplication: $[w]_{1\times1}=\alpha$ $$y[i] = \sum_k x[i-k] w[k] = \alpha x[i]\ \ \ \ \ \Longrightarrow \text{scalar multiplcation}$$

Input image: $C\times H\times W$

$N\ \ 1\times1 convolution$ : $N\times C\times 1\times1$

Output: $N\times H\times W$ (this is because n+m-1)


In [ ]: