Net Surgery

Caffe models can be transformed to your particular needs by editing the network parameters. In this example, we translate the inner product classifier layers of the Caffe Reference ImageNet model into convolutional layers. This yields a fully-convolutional model that generates a classification map for any given input size instead of a single classification. In particular, a classification will be made for every 6 $\times$ 6 region of the pool5 layer, giving a 8 $\times$ 8 classification map for our example 454 $\times$ 454 input dimensions.

Note that this model isn't totally appropriate for sliding-window detection since it was trained for whole-image classification. Sliding-window training and finetuning can be done by defining a sliding-window ground truth and loss such that a loss map is made for every location and solving as usual. (While planned, this is currently an exercise for the reader.)

Roll up your sleeves for net surgery with pycaffe!


In [1]:
!diff imagenet/imagenet_full_conv.prototxt imagenet/imagenet_deploy.prototxt


1c1
< name: "CaffeNetConv"
---
> name: "CaffeNet"
3c3
< input_dim: 1
---
> input_dim: 10
5,6c5,6
< input_dim: 454
< input_dim: 454
---
> input_dim: 227
> input_dim: 227
151,152c151,152
<   name: "fc6-conv"
<   type: CONVOLUTION
---
>   name: "fc6"
>   type: INNER_PRODUCT
154,155c154,155
<   top: "fc6-conv"
<   convolution_param {
---
>   top: "fc6"
>   inner_product_param {
157d156
<     kernel_size: 6
163,164c162,163
<   bottom: "fc6-conv"
<   top: "fc6-conv"
---
>   bottom: "fc6"
>   top: "fc6"
169,170c168,169
<   bottom: "fc6-conv"
<   top: "fc6-conv"
---
>   bottom: "fc6"
>   top: "fc6"
176,180c175,179
<   name: "fc7-conv"
<   type: CONVOLUTION
<   bottom: "fc6-conv"
<   top: "fc7-conv"
<   convolution_param {
---
>   name: "fc7"
>   type: INNER_PRODUCT
>   bottom: "fc6"
>   top: "fc7"
>   inner_product_param {
182d180
<     kernel_size: 1
188,189c186,187
<   bottom: "fc7-conv"
<   top: "fc7-conv"
---
>   bottom: "fc7"
>   top: "fc7"
194,195c192,193
<   bottom: "fc7-conv"
<   top: "fc7-conv"
---
>   bottom: "fc7"
>   top: "fc7"
201,205c199,203
<   name: "fc8-conv"
<   type: CONVOLUTION
<   bottom: "fc7-conv"
<   top: "fc8-conv"
<   convolution_param {
---
>   name: "fc8"
>   type: INNER_PRODUCT
>   bottom: "fc7"
>   top: "fc8"
>   inner_product_param {
207d204
<     kernel_size: 1
213c210
<   bottom: "fc8-conv"
---
>   bottom: "fc8"

The only differences needed in the architecture are to change the fully-connected classifier inner product layers into convolutional layers with the right filter size -- 6 x 6, since the reference model classifiers take the 36 elements of pool5 as input -- and stride 1 for dense classification. Note that the layers are renamed so that Caffe does not try to blindly load the old parameters when it maps layer names to the pretrained model.


In [2]:
import caffe

# Load the original network and extract the fully-connected layers' parameters.
net = caffe.Net('imagenet/imagenet_deploy.prototxt', 'imagenet/caffe_reference_imagenet_model')
params = ['fc6', 'fc7', 'fc8']
# fc_params = {name: (weights, biases)}
fc_params = {pr: (net.params[pr][0].data, net.params[pr][1].data) for pr in params}

for fc in params:
    print '{} weights are {} dimensional and biases are {} dimensional'.format(fc, fc_params[fc][0].shape, fc_params[fc][1].shape)


fc6 weights are (1, 1, 4096, 9216) dimensional and biases are (1, 1, 1, 4096) dimensional
fc7 weights are (1, 1, 4096, 4096) dimensional and biases are (1, 1, 1, 4096) dimensional
fc8 weights are (1, 1, 1000, 4096) dimensional and biases are (1, 1, 1, 1000) dimensional

Consider the shapes of the inner product parameters. For weights and biases the zeroth and first dimensions are both 1. The second and third weight dimensions are the output and input sizes while the last bias dimension is the output size.


In [3]:
# Load the fully-convolutional network to transplant the parameters.
net_full_conv = caffe.Net('imagenet/imagenet_full_conv.prototxt', 'imagenet/caffe_reference_imagenet_model')
params_full_conv = ['fc6-conv', 'fc7-conv', 'fc8-conv']
# conv_params = {name: (weights, biases)}
conv_params = {pr: (net_full_conv.params[pr][0].data, net_full_conv.params[pr][1].data) for pr in params_full_conv}

for conv in params_full_conv:
    print '{} weights are {} dimensional and biases are {} dimensional'.format(conv, conv_params[conv][0].shape, conv_params[conv][1].shape)


fc6-conv weights are (4096, 256, 6, 6) dimensional and biases are (1, 1, 1, 4096) dimensional
fc7-conv weights are (4096, 4096, 1, 1) dimensional and biases are (1, 1, 1, 4096) dimensional
fc8-conv weights are (1000, 4096, 1, 1) dimensional and biases are (1, 1, 1, 1000) dimensional

The convolution weights are arranged in output $\times$ input $\times$ height $\times$ width dimensions. To map the inner product weights to convolution filters, we need to roll the flat inner product vectors into channel $\times$ height $\times$ width filter matrices.

The biases are identical to those of the inner product -- let's transplant these first since no reshaping is needed.


In [4]:
for pr, pr_conv in zip(params, params_full_conv):
    conv_params[pr_conv][1][...] = fc_params[pr][1]

The output channels have the leading dimension of both the inner product and convolution weights, so the parameters are translated by reshaping the flat input dimensional parameter vector from the inner product into the channel $\times$ height $\times$ width filter shape.


In [5]:
for pr, pr_conv in zip(params, params_full_conv):
    out, in_, h, w = conv_params[pr_conv][0].shape
    W = fc_params[pr][0].reshape((out, in_, h, w))
    conv_params[pr_conv][0][...] = W

Next, save the new model weights.


In [6]:
net_full_conv.save('imagenet/caffe_imagenet_full_conv')

To conclude, let's make a classification map from the example cat image. This gives an 8-by-8 labeling of overlapping image regions.


In [7]:
# load input and configure preprocessing
im = caffe.io.load_image('images/cat.jpg')
plt.imshow(im)
net_full_conv.set_mean('data', '../python/caffe/imagenet/ilsvrc_2012_mean.npy')
net_full_conv.set_channel_swap('data', (2,1,0))
net_full_conv.set_input_scale('data', 255.0)
# make classification map by forward pass and show top prediction index per location
out = net_full_conv.forward_all(data=np.asarray([net_full_conv.preprocess('data', im)]))
out['prob'][0].argmax(axis=0)


Out[7]:
array([[278, 151, 259, 281, 282, 259, 282, 282],
       [283, 259, 283, 282, 283, 281, 259, 277],
       [283, 283, 283, 287, 287, 287, 287, 282],
       [283, 283, 283, 281, 281, 259, 259, 333],
       [283, 283, 283, 283, 283, 283, 283, 283],
       [283, 283, 283, 283, 283, 259, 283, 333],
       [283, 356, 359, 371, 368, 368, 259, 852],
       [356, 335, 358, 151, 283, 263, 277, 744]])

The classifications include various cats -- 282 = tabby, 283 = tiger, 281 = persian -- and foxes and other mammals.

In this way the fully-connected layers can be extracted as dense features across an image (see net_full_conv.blobs['fc6'].data for instance), which is perhaps more useful than the classification map itself.

A thank you to Rowland Depp for first suggesting this trick.