3d max pooling using 2d operators

Support for 3d pooling is spotty across convolutional neural network libraries, but it can be fairly easily implemented as a series of 2d pooling and matrix reshaping operations. In this overview I provide examples using the Theano python library.

Here are some links if none of that really made sense:

3d max pooling overview

Suppose we have the following 3d matrix:


In [2]:
import numpy
original = numpy.array([[[3, 8, 6, 6], [1, 6, 4, 1], [7, 9, 7, 9], [5, 11, 3, 2]], [[4, 5, 8, 1], [6, 7, 4, 2], [0, 3, 5, 9], [9, 10, 10, 2]], [[10, 3, 7, 4], [2, 1, 2, 9], [8, 8, 1, 1], [6, 3, 0, 4]]])
original


Out[2]:
array([[[ 3,  8,  6,  6],
        [ 1,  6,  4,  1],
        [ 7,  9,  7,  9],
        [ 5, 11,  3,  2]],

       [[ 4,  5,  8,  1],
        [ 6,  7,  4,  2],
        [ 0,  3,  5,  9],
        [ 9, 10, 10,  2]],

       [[10,  3,  7,  4],
        [ 2,  1,  2,  9],
        [ 8,  8,  1,  1],
        [ 6,  3,  0,  4]]])

In [5]:
original.shape


Out[5]:
(3, 4, 4)

We can think of this as 3 separate 4x4 images. In 3d max pooling, the goal is to create a new matrix by selecting maximum values from (usually non-overlapping) cubes of the original matrix. The size of the cube is called the pool size and the step size to the next cube is called the stride. In a large number of cases, you want to set the pool size to the stride size, so there are no overlapping cubes and no gaps. In this example, the answer of a max pool 3d with a stride of (2, 2, 2) is:


In [6]:
a = numpy.array([[[8, 8], [11, 10]], [[10, 9], [8, 4]]])
a


Out[6]:
array([[[ 8,  8],
        [11, 10]],

       [[10,  9],
        [ 8,  4]]])

In [7]:
a.shape


Out[7]:
(2, 2, 2)

Notice that the size of the pooled matrix is ceil(dim/2) in all dimensions: (3, 4, 4) -> (2, 2, 2). In this case we are assuming that we are padding a dimension with 0's if the columns or rows is of size 1 in any stride window. If we don't pad the dimensions, then the shape can be thought of as floor(dim/2).

3d max pooling using 2d operators

Highly efficient, GPU-optimized versions of max pool 2d exist in most languages but max pool 3d is more tricky, so we'd like to structure our 3d max pooling only using max pool 2d.

To do this, we can take advantage of the associativity and commutativity properties of the max operator, first finding the maximum values of each (X, Y) plane (image), and then finding the maximum values along the Z axis (along all images). The rough algorithm is:

  1. Assume matrix is of shape (Z, X, Y)
  2. Max pool 2d along (X, Y) with stride (2, 2) for each Z dimension
  3. Reshape matrix with Z as new column: (X, Y, Z)
  4. Max pool 2d along (Y, Z) but with stride (1, 2) for each X dimension
  5. Reshape to original (Z, X, Y) layout

Example

Referring back to our original example, if we have a 3x4x4 matrix (3 images of size 4x4), we first perform a max pool 2d in the XY planes for each image. Again assume that our stride is (2, 2, 2), effectively halving the entire matrix. Theano comes with very nice operators for 2-dimensional downsampling, so we can reuse that.


In [9]:
import theano
from theano.tensor.signal import downsample

x = theano.shared(original)
xypool = downsample.max_pool_2d(x, (2, 2))
xypool.eval()


Out[9]:
array([[[ 8,  6],
        [11,  9]],

       [[ 7,  8],
        [10, 10]],

       [[10,  9],
        [ 8,  4]]])

This creates a new matrix of size (Z, X/2, Y/2). Now each image has been max pooled, but we need to pool along the Z axis still. We can do this by reshaping the matrix with Z as the columns and (X, Y) as the row indices. You end up with a 3d matrix of size (2, 2, 3). I've annotated the indices below to illustrate the transformation from the previous matrix a bit better.

[x=0 [y=0 [8, 7, 10], y=1 [6, 8, 9]], x=1[y=0 [11, 10, 8], y=1 [9, 10, 4]]]


In [10]:
shufl = [1, 2, 0]
shuffled = xypool.dimshuffle(shufl)
shuffled.eval()


Out[10]:
array([[[ 8,  7, 10],
        [ 6,  8,  9]],

       [[11, 10,  8],
        [ 9, 10,  4]]])

The shuffle operators moves the 0 index (Z) to the back, and shifts X and Y one over. So now (X, Y, Z) = [0, 1, 2]. This is conceptually a little tricky. A row here (X,Y) is the distinct X,Y indice values([X=0, y=0, X=1, y=0, etc]) across all Z(images), so we need to max pool across EACH combination, meaning our stride in this case is (1, 2). The implicit assumption in this example is that we pad our array with 0's to prevent dimensionality reduction.

Now we just max_pool_2d again.


In [11]:
pooled = downsample.max_pool_2d(shuffled, (1, 2))
pooled.eval()


Out[11]:
array([[[ 8, 10],
        [ 8,  9]],

       [[11,  8],
        [10,  4]]])

This is starting to look pretty good! At least, the dimensions are proper from what we outlined earlier. But is this correct? Well, not quite. This downsampled matrix still has out of order columns (X, Y, Z), but the size is (X/2, Y/2, Z/2), so we just need to reshape again, moving Z to the front.


In [12]:
shufl = [2, 0, 1]
normal = pooled.dimshuffle(shufl)
normal.eval()


Out[12]:
array([[[ 8,  8],
        [11, 10]],

       [[10,  9],
        [ 8,  4]]])

Voila! This is the exact result we expected.

Complete Code


In [14]:
import numpy
import theano
from theano.tensor.signal import downsample
original = numpy.array([[[3, 8, 6, 6], [1, 6, 4, 1], [7, 9, 7, 9], [5, 11, 3, 2]], [[4, 5, 8, 1], [6, 7, 4, 2], [0, 3, 5, 9], [9, 10, 10, 2]], [[10, 3, 7, 4], [2, 1, 2, 9], [8, 8, 1, 1], [6, 3, 0, 4]]])

x = theano.shared(original)

#Downsample across X,Y plane
xypool = downsample.max_pool_2d(x, (2, 2))

#Shuffle so Z is in the back
shufl = [1, 2, 0]
shuffled = xypool.dimshuffle(shufl)

#Downsample across Z columns
pooled = downsample.max_pool_2d(shuffled, (1, 2))

#Reshuffle to original shape
shufl = [2, 0, 1]
normal = pooled.dimshuffle(shufl)
normal.eval()


Out[14]:
array([[[ 8,  8],
        [11, 10]],

       [[10,  9],
        [ 8,  4]]])

In [ ]: