Convolutional Neural Nework (CNN)

Consider image recognition problem. We want a neural network with the following requirements:

must deal with very high dimensional data
can explit the 2D topology of pixels. Neighboring pixels are not completely random, rather, they are smomehow related.
can build invariance to certain variations
- translations
- illuminations
- noise

Convolutionl neural networks leverage these ideas

local connectivity
parameter sharing
pooling / subsampling hidden units

Local connectivity:

Use local connectivity of hidden units (instead of all units)
- Each hidden unit is connected to only a subregion (patch) of image
- connected to all channels: 1 for grayscae, RGB ..
Reduces the number of weights

Reduces the number of parameters
extracts the same features at every position (features are "equivalet")

Feature map: $\displaystyle \mathbf{y}_j = \sigma\left(\sum_k \mathbf{w}_{ik} * \mathbf{x}_i \right)$

Receptive Field

What is the purpose? Imagine you have an image, with a man which takes a portion of image of size $128\times 64$. When you design a network, you want to make sure that at the end, you are looking at the whole part of where the man is in the picture. But, if you find out that your receptive field is only $32\times 32$, then there is no output unit that look at the entire pixels associated with the man.

Convolution Operation

Continous convolution

$$s(t) = (f * w)(t) = \int f(x) w(t-x) dx$$

Discrete convolution

$$s(t) = (f*w)(t) = \sum_{a=-\infty}^{+\infty} f(a) w(t-a)$$

The convolution of $\mathbf{x}$ with weights $\mathbf{w}$ is computed as foloows

$\displaystyle y[i] = \sum_k x[i-k] w[k]$
flip $\mathbf{x}$, and slide it over $\mathbf{w}$.

$\displaystyle y[i] = \sum_k x[k] w[i-k]$

If we have a $2\times2$ and $2\times 2$, then we get a $3\times 3$.

(related concept): the correlation of $\mathbf{x}$ with weights $\mathbf{w}$ is

$\displaystyle y[i] = \sum_k x[i+k] w[k]$
$\displaystyle y[i] = \sum_k x[i] w[i+k]$
with correlation, we do not flip any vector

Flipping kernel vs not flipping

Cumulative property

Example:

Convolution:

$$\displaystyle \left[\begin{array}{ccc}0&20&40\\80&20&0\\0&0&40 \end{array}\right] * \left[\begin{array}{cc}0&0.5 \\ 0.25 & 1 \end{array}\right] = ?$$

Flip $\left[\begin{array}{cc}0&0.5 \\ 0.25 & 1 \end{array}\right]$ on both axes: $\left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right]$

Then, we can use simple element-wise multiplications followed by summing over the products:

First row:

$$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\0 & 20 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\20 & 40 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\40 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

second row: $$\left[\begin{array}{cc}0&0\\0 & 80 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

$$\left[\begin{array}{cc}0&20\\80 & 20 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}20&40\\20 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}40&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

Third row: $$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

$$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&40\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}40&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

Correlation:

$$\displaystyle \left[\begin{array}{ccc}0&20&40\\80&20&0\\0&0&40 \end{array}\right] \star \left[\begin{array}{cc}0&0.5 \\ 0.25 & 1 \end{array}\right] = ?$$

Pooling

Pool responcses of hidden units in the same neighborhppd
Pooling is performed in non-overlapping neighborhoods (sab-sampling)

Advantages:

invariance to local translations

Max-pooling

Example:

$$\displaystyle \left[\begin{array}{cccc}0.3&0.2&1&0.1\\3&0.8&0.8&2\\0.4&0.7&-0.8&-1\\-1.7&0.7&2.1&0.4 \end{array}\right] \longrightarrow^{\text{max-pooling}} \left[\begin{array}{cc}?&?\\?&? \end{array}\right]$$

Example: Local invariance from pooling

$$\displaystyle \left[\begin{array}{ccccc} 0&0&255&0&0\\0&0&255&0&0\\0&0&255&0&0\\0&255&0&0&0\\255&0&0&0&0 \end{array}\right] \longrightarrow \left[\begin{array}{ccc}?&?\\ ?&?\end{array}\right] \longleftarrow \left[\begin{array}{ccccc} 0&0&255&0&0\\0&0&0&0&0\\0&255&255&0&0\\255&0&0&0&0\\0&0&0&0&0\end{array}\right]$$

Putting everything together

Convolutional neural network alternates between the convolutional and pooling layers

Output is a fully connected layer
..

Convolution Layer

Input dimension: $(h, w, channels=10)$
Output dimension: $(h', w', channels=16)$

To get from input to output, we need a weight (filter) matrix with dimensionality $(16, f_x, f_y, 10)$ where $f_x,f_y$ are the filter size, for example $5\times 5$.

$$y_i = \sum_{k=1}^{10} X_k * W_{i,.,.,k}$$

Backpropagation for Convolution

Let $l$ be our loss function, and $\mathbf{y}_j = \mathbf{x}_i * \mathbf{w}_{ij}$
- Gradient of input $\mathbf{x}_i$
  
  $$\frac{\partial l}{\partial \mathbf{x}_i} = \sum_j \left(\frac{\partial l}{\partial \mathbf{y}_j}\right) \star \mathbf{w}_{ij} \ \ \ (\star\text{\equiv correlation})$$
  
  $\star$ represents correlation
- Gradient of weights $\mathbf{w}_{ij}$
  
  $$\frac{\partial l}{\partial \mathbf{w}_{ij}} = \sum_j \left(\frac{\partial l}{\partial \mathbf{y}_j}\right) * \mathbf{x}_{ij}$$

Backpropagation for pooling

Gradient of max-pooling

$$\left[\begin{array}{ccc}x_1&x_2\\ x_3&x_4\end{array}\right] $$

Let's assume that the maximum occurs at $x_2$ for this example., i.e. $y = \max\{x_1,x_2,x_3,x_4\} = x_2$

Therefore, we get $\left[\begin{array}{ccc}0&1\\0&0\end{array}\right] $

Gradient of mean-pooing

$\displaystyle \left[\begin{array}{ccc}\frac{1}{4}&\frac{1}{4}\\ \frac{1}{4}&\frac{1}{4}\end{array}\right] $

Regularizing CNNs

Training ensembles
Model averaging
Dropout: randomly drop some hidden units ($a_i$)
DropConnect: randomy drop some of the weights ($\mathbf{W}_{ij}$)



In [ ]:

Convolutional Neural Nework (CNN)

Local connectivity:

Parameter sharing

Receptive Field

Convolution Operation

Flipping kernel vs not flipping

Example:

Pooling

Max-pooling

Example:

Example: Local invariance from pooling

Putting everything together

Convolution Layer

Backpropagation for Convolution

Backpropagation for pooling

Gradient of max-pooling

Gradient of mean-pooing

Regularizing CNNs