Convolutional Neural Nework (CNN)

Consider image recognition problem. We want a neural network with the following requirements:

  • must deal with very high dimensional data
  • can explit the 2D topology of pixels. Neighboring pixels are not completely random, rather, they are smomehow related.
  • can build invariance to certain variations
    • translations
    • illuminations
    • noise

Convolutionl neural networks leverage these ideas

  • local connectivity
  • parameter sharing
  • pooling / subsampling hidden units

Local connectivity:

  • Use local connectivity of hidden units (instead of all units)

    • Each hidden unit is connected to only a subregion (patch) of image
    • connected to all channels: 1 for grayscae, RGB ..
  • Reduces the number of weights

Parameter sharing

  • Reduces the number of parameters
  • extracts the same features at every position (features are "equivalet")

Feature map: $\displaystyle \mathbf{y}_j = \sigma\left(\sum_k \mathbf{w}_{ik} * \mathbf{x}_i \right)$

Receptive Field

  • What is the purpose? Imagine you have an image, with a man which takes a portion of image of size $128\times 64$. When you design a network, you want to make sure that at the end, you are looking at the whole part of where the man is in the picture. But, if you find out that your receptive field is only $32\times 32$, then there is no output unit that look at the entire pixels associated with the man.

Convolution Operation

  • Continous convolution
$$s(t) = (f * w)(t) = \int f(x) w(t-x) dx$$
  • Discrete convolution
$$s(t) = (f*w)(t) = \sum_{a=-\infty}^{+\infty} f(a) w(t-a)$$
  • The convolution of $\mathbf{x}$ with weights $\mathbf{w}$ is computed as foloows

    $\displaystyle y[i] = \sum_k x[i-k] w[k]$
    flip $\mathbf{x}$, and slide it over $\mathbf{w}$.

    $\displaystyle y[i] = \sum_k x[k] w[i-k]$

If we have a $2\times2$ and $2\times 2$, then we get a $3\times 3$.

  • (related concept): the correlation of $\mathbf{x}$ with weights $\mathbf{w}$ is

    $\displaystyle y[i] = \sum_k x[i+k] w[k]$
    $\displaystyle y[i] = \sum_k x[i] w[i+k]$
    with correlation, we do not flip any vector

Flipping kernel vs not flipping

  • Cumulative property

Example:

  • Convolution:
$$\displaystyle \left[\begin{array}{ccc}0&20&40\\80&20&0\\0&0&40 \end{array}\right] * \left[\begin{array}{cc}0&0.5 \\ 0.25 & 1 \end{array}\right] = ?$$

Flip $\left[\begin{array}{cc}0&0.5 \\ 0.25 & 1 \end{array}\right]$ on both axes: $\left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right]$

Then, we can use simple element-wise multiplications followed by summing over the products:

First row:

$$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\0 & 20 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\20 & 40 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\40 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

second row: $$\left[\begin{array}{cc}0&0\\0 & 80 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

$$\left[\begin{array}{cc}0&20\\80 & 20 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}20&40\\20 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}40&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

Third row: $$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$

$$\left[\begin{array}{cc}0&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}0&40\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$$$\left[\begin{array}{cc}40&0\\0 & 0 \end{array}\right] \left[\begin{array}{cc}1 &0.25 \\ 0.5 & 0 \end{array}\right] = $$
  • Correlation:
$$\displaystyle \left[\begin{array}{ccc}0&20&40\\80&20&0\\0&0&40 \end{array}\right] \star \left[\begin{array}{cc}0&0.5 \\ 0.25 & 1 \end{array}\right] = ?$$

Pooling

  • Pool responcses of hidden units in the same neighborhppd
  • Pooling is performed in non-overlapping neighborhoods (sab-sampling)

Advantages:

  • invariance to local translations

Max-pooling

Example:

$$\displaystyle \left[\begin{array}{cccc}0.3&0.2&1&0.1\\3&0.8&0.8&2\\0.4&0.7&-0.8&-1\\-1.7&0.7&2.1&0.4 \end{array}\right] \longrightarrow^{\text{max-pooling}} \left[\begin{array}{cc}?&?\\?&? \end{array}\right]$$

Example: Local invariance from pooling

$$\displaystyle \left[\begin{array}{ccccc} 0&0&255&0&0\\0&0&255&0&0\\0&0&255&0&0\\0&255&0&0&0\\255&0&0&0&0 \end{array}\right] \longrightarrow \left[\begin{array}{ccc}?&?\\ ?&?\end{array}\right] \longleftarrow \left[\begin{array}{ccccc} 0&0&255&0&0\\0&0&0&0&0\\0&255&255&0&0\\255&0&0&0&0\\0&0&0&0&0\end{array}\right]$$

Putting everything together

  • Convolutional neural network alternates between the convolutional and pooling layers
  • Output is a fully connected layer
  • ..

Convolution Layer

  • Input dimension: $(h, w, channels=10)$
  • Output dimension: $(h', w', channels=16)$

To get from input to output, we need a weight (filter) matrix with dimensionality $(16, f_x, f_y, 10)$ where $f_x,f_y$ are the filter size, for example $5\times 5$.

$$y_i = \sum_{k=1}^{10} X_k * W_{i,.,.,k}$$

Backpropagation for Convolution

  • Let $l$ be our loss function, and $\mathbf{y}_j = \mathbf{x}_i * \mathbf{w}_{ij}$

    • Gradient of input $\mathbf{x}_i$

      $$\frac{\partial l}{\partial \mathbf{x}_i} = \sum_j \left(\frac{\partial l}{\partial \mathbf{y}_j}\right) \star \mathbf{w}_{ij} \ \ \ (\star\text{\equiv correlation})$$

      $\star$ represents correlation

    • Gradient of weights $\mathbf{w}_{ij}$

      $$\frac{\partial l}{\partial \mathbf{w}_{ij}} = \sum_j \left(\frac{\partial l}{\partial \mathbf{y}_j}\right) * \mathbf{x}_{ij}$$

Backpropagation for pooling

Gradient of max-pooling

$$\left[\begin{array}{ccc}x_1&x_2\\ x_3&x_4\end{array}\right] $$

Let's assume that the maximum occurs at $x_2$ for this example., i.e. $y = \max\{x_1,x_2,x_3,x_4\} = x_2$

Therefore, we get $\left[\begin{array}{ccc}0&1\\0&0\end{array}\right] $

Gradient of mean-pooing

$\displaystyle \left[\begin{array}{ccc}\frac{1}{4}&\frac{1}{4}\\ \frac{1}{4}&\frac{1}{4}\end{array}\right] $

Regularizing CNNs

  • Training ensembles
  • Model averaging
  • Dropout: randomly drop some hidden units ($a_i$)
  • DropConnect: randomy drop some of the weights ($\mathbf{W}_{ij}$)

In [ ]: