Consider image classification: cat vs. dog?
The window
Intuition: one element i the patch corresponds to how many original elements.
Example: $x_{n\times n}\rightarrow []_{m\times m} \rightarrow ReLU_{m\times m} \rightarrow \text{pooling layer}(2\times 2) $
Then, one value in the final output above corresponds to $2m\times 2m$ elments in $x$. So, the receptive field is $2m\times 2m$
The very first paper that showed deep convolutional network works well in image classification: "Imagenet Classification with Deep Convolutional Neural Networks", NIPS 2012
Very Deep Convolutional Networks dor Large Scale image Recognition", Arxiv 1409,1556
If we have two filters of size $n$ and $m$: the receptive filed will be $(n+m-1)\times (n+m-1)$
Consider two cases below:
Using a single filter of size $5\times5$ also has the same receptive
But the first option has more non-linearity, between them
Take home message: Do not use filters more than size $3$
Input image: $C\times H\times W$
$N\ \ 1\times1 convolution$ : $N\times C\times 1\times1$
Output: $N\times H\times W$ (this is because n+m-1)
In [ ]: