R-CNN is a state-of-the-art detector that classifies region proposals by a finetuned Caffe model. For the full details of the R-CNN system and model, refer to its project site and the paper:

Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014. Arxiv 2013.

In this example, we do detection by a pure Caffe edition of the R-CNN model for ImageNet. The R-CNN detector outputs class scores for the 200 detection classes of ILSVRC13. Keep in mind that these are raw one vs. all SVM scores, so they are not probabilistically calibrated or exactly comparable across classes. Note that this off-the-shelf model is simply for convenience, and is not the full R-CNN model.

Let's run detection on an image of a bicyclist riding a fish bike in the desert (from the ImageNet challenge—no joke).

First, we'll need region proposals and the Caffe R-CNN ImageNet model:

  • Selective Search is the region proposer used by R-CNN. The selective_search_ijcv_with_python Python module takes care of extracting proposals through the selective search MATLAB implementation. To install it, download the module and name its directory selective_search_ijcv_with_python, run the demo in MATLAB to compile the necessary functions, then add it to your PYTHONPATH for importing. (If you have your own region proposals prepared, or would rather not bother with this step, detect.py accepts a list of images and bounding boxes as CSV.)

  • Follow the model instructions to get the Caffe R-CNN ImageNet model.

With that done, we'll call the bundled detect.py to generate the region proposals and run the network. For an explanation of the arguments, do ./detect.py --help.


In [1]:
!mkdir -p _temp
!echo `pwd`/images/fish-bike.jpg > _temp/det_input.txt
!../python/detect.py --crop_mode=selective_search --pretrained_model=imagenet/caffe_rcnn_imagenet_model --model_def=imagenet/rcnn_imagenet_deploy.prototxt --gpu _temp/det_input.txt _temp/det_output.h5


WARNING: Logging before InitGoogleLogging() is written to STDERR
I0610 10:12:49.299607 25530 net.cpp:36] Initializing net from parameters: 
name: "R-CNN-ilsvrc13"
layers {
  bottom: "data"
  top: "conv1"
  name: "conv1"
  type: CONVOLUTION
  convolution_param {
    num_output: 96
    kernel_size: 11
    stride: 4
  }
}
layers {
  bottom: "conv1"
  top: "conv1"
  name: "relu1"
  type: RELU
}
layers {
  bottom: "conv1"
  top: "pool1"
  name: "pool1"
  type: POOLING
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
}
layers {
  bottom: "pool1"
  top: "norm1"
  name: "norm1"
  type: LRN
  lrn_param {
    local_size: 5
    alpha: 0.0001
    beta: 0.75
  }
}
layers {
  bottom: "norm1"
  top: "conv2"
  name: "conv2"
  type: CONVOLUTION
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
    group: 2
  }
}
layers {
  bottom: "conv2"
  top: "conv2"
  name: "relu2"
  type: RELU
}
layers {
  bottom: "conv2"
  top: "pool2"
  name: "pool2"
  type: POOLING
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
}
layers {
  bottom: "pool2"
  top: "norm2"
  name: "norm2"
  type: LRN
  lrn_param {
    local_size: 5
    alpha: 0.0001
    beta: 0.75
  }
}
layers {
  bottom: "norm2"
  top: "conv3"
  name: "conv3"
  type: CONVOLUTION
  convolution_param {
    num_output: 384
    pad: 1
    kernel_size: 3
  }
}
layers {
  bottom: "conv3"
  top: "conv3"
  name: "relu3"
  type: RELU
}
layers {
  bottom: "conv3"
  top: "conv4"
  name: "conv4"
  type: CONVOLUTION
  convolution_param {
    num_output: 384
    pad: 1
    kernel_size: 3
    group: 2
  }
}
layers {
  bottom: "conv4"
  top: "conv4"
  name: "relu4"
  type: RELU
}
layers {
  bottom: "conv4"
  top: "conv5"
  name: "conv5"
  type: CONVOLUTION
  convolution_param {
    num_output: 256
    pad: 1
    kernel_size: 3
    group: 2
  }
}
layers {
  bottom: "conv5"
  top: "conv5"
  name: "relu5"
  type: RELU
}
layers {
  bottom: "conv5"
  top: "pool5"
  name: "pool5"
  type: POOLING
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
}
layers {
  bottom: "pool5"
  top: "fc6"
  name: "fc6"
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 4096
  }
}
layers {
  bottom: "fc6"
  top: "fc6"
  name: "relu6"
  type: RELU
}
layers {
  bottom: "fc6"
  top: "fc6"
  name: "drop6"
  type: DROPOUT
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  bottom: "fc6"
  top: "fc7"
  name: "fc7"
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 4096
  }
}
layers {
  bottom: "fc7"
  top: "fc7"
  name: "relu7"
  type: RELU
}
layers {
  bottom: "fc7"
  top: "fc7"
  name: "drop7"
  type: DROPOUT
  dropout_param {
    dropout_ratio: 0.5
  }
}
layers {
  bottom: "fc7"
  top: "fc-rcnn"
  name: "fc-rcnn"
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 200
  }
}
input: "data"
input_dim: 10
input_dim: 3
input_dim: 227
input_dim: 227
I0610 10:12:49.300204 25530 net.cpp:77] Creating Layer conv1
I0610 10:12:49.300214 25530 net.cpp:87] conv1 <- data
I0610 10:12:49.300220 25530 net.cpp:113] conv1 -> conv1
I0610 10:12:49.300283 25530 net.cpp:128] Top shape: 10 96 55 55 (2904000)
I0610 10:12:49.300294 25530 net.cpp:154] conv1 needs backward computation.
I0610 10:12:49.300302 25530 net.cpp:77] Creating Layer relu1
I0610 10:12:49.300308 25530 net.cpp:87] relu1 <- conv1
I0610 10:12:49.300314 25530 net.cpp:101] relu1 -> conv1 (in-place)
I0610 10:12:49.300323 25530 net.cpp:128] Top shape: 10 96 55 55 (2904000)
I0610 10:12:49.300328 25530 net.cpp:154] relu1 needs backward computation.
I0610 10:12:49.300335 25530 net.cpp:77] Creating Layer pool1
I0610 10:12:49.300341 25530 net.cpp:87] pool1 <- conv1
I0610 10:12:49.300348 25530 net.cpp:113] pool1 -> pool1
I0610 10:12:49.300357 25530 net.cpp:128] Top shape: 10 96 27 27 (699840)
I0610 10:12:49.300365 25530 net.cpp:154] pool1 needs backward computation.
I0610 10:12:49.300372 25530 net.cpp:77] Creating Layer norm1
I0610 10:12:49.300379 25530 net.cpp:87] norm1 <- pool1
I0610 10:12:49.300384 25530 net.cpp:113] norm1 -> norm1
I0610 10:12:49.300393 25530 net.cpp:128] Top shape: 10 96 27 27 (699840)
I0610 10:12:49.300400 25530 net.cpp:154] norm1 needs backward computation.
I0610 10:12:49.300406 25530 net.cpp:77] Creating Layer conv2
I0610 10:12:49.300412 25530 net.cpp:87] conv2 <- norm1
I0610 10:12:49.300420 25530 net.cpp:113] conv2 -> conv2
I0610 10:12:49.300925 25530 net.cpp:128] Top shape: 10 256 27 27 (1866240)
I0610 10:12:49.300935 25530 net.cpp:154] conv2 needs backward computation.
I0610 10:12:49.300941 25530 net.cpp:77] Creating Layer relu2
I0610 10:12:49.300947 25530 net.cpp:87] relu2 <- conv2
I0610 10:12:49.300954 25530 net.cpp:101] relu2 -> conv2 (in-place)
I0610 10:12:49.300961 25530 net.cpp:128] Top shape: 10 256 27 27 (1866240)
I0610 10:12:49.300967 25530 net.cpp:154] relu2 needs backward computation.
I0610 10:12:49.300974 25530 net.cpp:77] Creating Layer pool2
I0610 10:12:49.300981 25530 net.cpp:87] pool2 <- conv2
I0610 10:12:49.300987 25530 net.cpp:113] pool2 -> pool2
I0610 10:12:49.300994 25530 net.cpp:128] Top shape: 10 256 13 13 (432640)
I0610 10:12:49.301000 25530 net.cpp:154] pool2 needs backward computation.
I0610 10:12:49.301007 25530 net.cpp:77] Creating Layer norm2
I0610 10:12:49.301013 25530 net.cpp:87] norm2 <- pool2
I0610 10:12:49.301019 25530 net.cpp:113] norm2 -> norm2
I0610 10:12:49.301026 25530 net.cpp:128] Top shape: 10 256 13 13 (432640)
I0610 10:12:49.301033 25530 net.cpp:154] norm2 needs backward computation.
I0610 10:12:49.301041 25530 net.cpp:77] Creating Layer conv3
I0610 10:12:49.301048 25530 net.cpp:87] conv3 <- norm2
I0610 10:12:49.301054 25530 net.cpp:113] conv3 -> conv3
I0610 10:12:49.302455 25530 net.cpp:128] Top shape: 10 384 13 13 (648960)
I0610 10:12:49.302467 25530 net.cpp:154] conv3 needs backward computation.
I0610 10:12:49.302477 25530 net.cpp:77] Creating Layer relu3
I0610 10:12:49.302484 25530 net.cpp:87] relu3 <- conv3
I0610 10:12:49.302490 25530 net.cpp:101] relu3 -> conv3 (in-place)
I0610 10:12:49.302496 25530 net.cpp:128] Top shape: 10 384 13 13 (648960)
I0610 10:12:49.302503 25530 net.cpp:154] relu3 needs backward computation.
I0610 10:12:49.302510 25530 net.cpp:77] Creating Layer conv4
I0610 10:12:49.302515 25530 net.cpp:87] conv4 <- conv3
I0610 10:12:49.302521 25530 net.cpp:113] conv4 -> conv4
I0610 10:12:49.303639 25530 net.cpp:128] Top shape: 10 384 13 13 (648960)
I0610 10:12:49.303650 25530 net.cpp:154] conv4 needs backward computation.
I0610 10:12:49.303658 25530 net.cpp:77] Creating Layer relu4
I0610 10:12:49.303663 25530 net.cpp:87] relu4 <- conv4
I0610 10:12:49.303670 25530 net.cpp:101] relu4 -> conv4 (in-place)
I0610 10:12:49.303676 25530 net.cpp:128] Top shape: 10 384 13 13 (648960)
I0610 10:12:49.303683 25530 net.cpp:154] relu4 needs backward computation.
I0610 10:12:49.303691 25530 net.cpp:77] Creating Layer conv5
I0610 10:12:49.303697 25530 net.cpp:87] conv5 <- conv4
I0610 10:12:49.303704 25530 net.cpp:113] conv5 -> conv5
I0610 10:12:49.304410 25530 net.cpp:128] Top shape: 10 256 13 13 (432640)
I0610 10:12:49.304420 25530 net.cpp:154] conv5 needs backward computation.
I0610 10:12:49.304427 25530 net.cpp:77] Creating Layer relu5
I0610 10:12:49.304433 25530 net.cpp:87] relu5 <- conv5
I0610 10:12:49.304440 25530 net.cpp:101] relu5 -> conv5 (in-place)
I0610 10:12:49.304446 25530 net.cpp:128] Top shape: 10 256 13 13 (432640)
I0610 10:12:49.304471 25530 net.cpp:154] relu5 needs backward computation.
I0610 10:12:49.304478 25530 net.cpp:77] Creating Layer pool5
I0610 10:12:49.304484 25530 net.cpp:87] pool5 <- conv5
I0610 10:12:49.304491 25530 net.cpp:113] pool5 -> pool5
I0610 10:12:49.304498 25530 net.cpp:128] Top shape: 10 256 6 6 (92160)
I0610 10:12:49.304504 25530 net.cpp:154] pool5 needs backward computation.
I0610 10:12:49.304512 25530 net.cpp:77] Creating Layer fc6
I0610 10:12:49.304517 25530 net.cpp:87] fc6 <- pool5
I0610 10:12:49.304523 25530 net.cpp:113] fc6 -> fc6
I0610 10:12:49.364333 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960)
I0610 10:12:49.364372 25530 net.cpp:154] fc6 needs backward computation.
I0610 10:12:49.364387 25530 net.cpp:77] Creating Layer relu6
I0610 10:12:49.364420 25530 net.cpp:87] relu6 <- fc6
I0610 10:12:49.364429 25530 net.cpp:101] relu6 -> fc6 (in-place)
I0610 10:12:49.364437 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960)
I0610 10:12:49.364444 25530 net.cpp:154] relu6 needs backward computation.
I0610 10:12:49.364455 25530 net.cpp:77] Creating Layer drop6
I0610 10:12:49.364461 25530 net.cpp:87] drop6 <- fc6
I0610 10:12:49.364467 25530 net.cpp:101] drop6 -> fc6 (in-place)
I0610 10:12:49.364480 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960)
I0610 10:12:49.364487 25530 net.cpp:154] drop6 needs backward computation.
I0610 10:12:49.364495 25530 net.cpp:77] Creating Layer fc7
I0610 10:12:49.364501 25530 net.cpp:87] fc7 <- fc6
I0610 10:12:49.364507 25530 net.cpp:113] fc7 -> fc7
I0610 10:12:49.391316 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960)
I0610 10:12:49.391350 25530 net.cpp:154] fc7 needs backward computation.
I0610 10:12:49.391361 25530 net.cpp:77] Creating Layer relu7
I0610 10:12:49.391369 25530 net.cpp:87] relu7 <- fc7
I0610 10:12:49.391377 25530 net.cpp:101] relu7 -> fc7 (in-place)
I0610 10:12:49.391384 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960)
I0610 10:12:49.391391 25530 net.cpp:154] relu7 needs backward computation.
I0610 10:12:49.391398 25530 net.cpp:77] Creating Layer drop7
I0610 10:12:49.391427 25530 net.cpp:87] drop7 <- fc7
I0610 10:12:49.391433 25530 net.cpp:101] drop7 -> fc7 (in-place)
I0610 10:12:49.391440 25530 net.cpp:128] Top shape: 10 4096 1 1 (40960)
I0610 10:12:49.391446 25530 net.cpp:154] drop7 needs backward computation.
I0610 10:12:49.391454 25530 net.cpp:77] Creating Layer fc-rcnn
I0610 10:12:49.391459 25530 net.cpp:87] fc-rcnn <- fc7
I0610 10:12:49.391466 25530 net.cpp:113] fc-rcnn -> fc-rcnn
I0610 10:12:49.392812 25530 net.cpp:128] Top shape: 10 200 1 1 (2000)
I0610 10:12:49.392823 25530 net.cpp:154] fc-rcnn needs backward computation.
I0610 10:12:49.392829 25530 net.cpp:165] This network produces output fc-rcnn
I0610 10:12:49.392850 25530 net.cpp:183] Collecting Learning Rate and Weight Decay.
I0610 10:12:49.392868 25530 net.cpp:176] Network initialization done.
I0610 10:12:49.392875 25530 net.cpp:177] Memory required for Data 41950840
GPU mode
Loading input...
selective_search_rcnn({'/home/shelhamer/caffe/examples/images/fish-bike.jpg'}, '/tmp/tmpo7yOum.mat')
Processed 1570 windows in 35.012 s.
/home/shelhamer/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['prediction']]

  warnings.warn(ws, PerformanceWarning)
Saved to _temp/det_output.h5 in 0.035 s.

This run was in GPU mode. For CPU mode detection, call detect.py without the --gpu argument.

Running this outputs a DataFrame with the filenames, selected windows, and their detection scores to an HDF5 file. (We only ran on one image, so the filenames will all be the same.)


In [2]:
import pandas as pd

df = pd.read_hdf('_temp/det_output.h5', 'df')
print(df.shape)
print(df.iloc[0])


(1570, 5)
prediction    [-2.64547, -2.88455, -2.85903, -3.17038, -1.92...
ymin                                                     79.846
xmin                                                       9.62
ymax                                                     246.31
xmax                                                    339.624
Name: /home/shelhamer/caffe/examples/images/fish-bike.jpg, dtype: object

1570 regions were proposed with the R-CNN configuration of selective search. The number of proposals will vary from image to image based on its contents and size -- selective search isn't scale invariant.

In general, detect.py is most efficient when running on a lot of images: it first extracts window proposals for all of them, batches the windows for efficient GPU processing, and then outputs the results. Simply list an image per line in the images_file, and it will process all of them.

Although this guide gives an example of R-CNN ImageNet detection, detect.py is clever enough to adapt to different Caffe models’ input dimensions, batch size, and output categories. You can switch the model definition and pretrained model as desired. Refer to python detect.py --help for the parameters to describe your data set. There's no need for hardcoding.

Anyway, let's now load the ILSVRC13 detection class names and make a DataFrame of the predictions. Note you'll need the auxiliary ilsvrc2012 data fetched by data/ilsvrc12/get_ilsvrc12_aux.sh.


In [3]:
with open('../data/ilsvrc12/det_synset_words.txt') as f:
    labels_df = pd.DataFrame([
        {
            'synset_id': l.strip().split(' ')[0],
            'name': ' '.join(l.strip().split(' ')[1:]).split(',')[0]
        }
        for l in f.readlines()
    ])
labels_df.sort('synset_id')
predictions_df = pd.DataFrame(np.vstack(df.prediction.values), columns=labels_df['name'])
print(predictions_df.iloc[0])


name
accordion      -2.645470
airplane       -2.884554
ant            -2.859026
antelope       -3.170383
apple          -1.924201
armadillo      -2.493925
artichoke      -2.235427
axe            -2.378177
baby bed       -2.757855
backpack       -2.160120
bagel          -2.715738
balance beam   -2.716172
banana         -2.418939
band aid       -1.604563
banjo          -2.329196
...
trombone        -2.531519
trumpet         -2.382109
turtle          -2.378510
tv or monitor   -2.777433
unicycle        -2.263807
vacuum          -1.894700
violin          -2.797967
volleyball      -2.807812
waffle iron     -2.418155
washer          -2.429423
water bottle    -2.163465
watercraft      -2.803971
whale           -3.094172
wine bottle     -2.830827
zebra           -2.791829
Name: 0, Length: 200, dtype: float32

Let's look at the activations.


In [4]:
gray()
matshow(predictions_df.values)
xlabel('Classes')
ylabel('Windows')


Out[4]:
<matplotlib.text.Text at 0x4e2c090>
<matplotlib.figure.Figure at 0x4d008d0>

Now let's take max across all windows and plot the top classes.


In [5]:
max_s = predictions_df.max(0)
max_s.sort(ascending=False)
print(max_s[:10])


name
person          1.883164
bicycle         0.936994
unicycle        0.016907
banjo           0.013019
motorcycle     -0.024704
electric fan   -0.193420
turtle         -0.243857
cart           -0.289637
lizard         -0.307945
baby bed       -0.582180
dtype: float32

The top detections are in fact a person and bicycle. Picking good localizations is a work in progress; we pick the top-scoring person and bicycle detections.


In [6]:
# Find, print, and display the top detections: person and bicycle.
i = predictions_df['person'].argmax()
j = predictions_df['bicycle'].argmax()

# Show top predictions for top detection.
f = pd.Series(df['prediction'].iloc[i], index=labels_df['name'])
print('Top detection:')
print(f.order(ascending=False)[:5])
print('')

# Show top predictions for second-best detection.
f = pd.Series(df['prediction'].iloc[j], index=labels_df['name'])
print('Second-best detection:')
print(f.order(ascending=False)[:5])

# Show top detection in red, second-best top detection in blue.
im = imread('images/fish-bike.jpg')
imshow(im)
currentAxis = plt.gca()

det = df.iloc[i]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(Rectangle(*coords, fill=False, edgecolor='r', linewidth=5))

det = df.iloc[j]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(Rectangle(*coords, fill=False, edgecolor='b', linewidth=5))


Top detection:
name
person             1.883164
swimming trunks   -1.136701
rubber eraser     -1.251888
plastic bag       -1.286928
snowmobile        -1.304962
dtype: float32

Second-best detection:
name
bicycle     0.936994
unicycle   -0.372841
scorpion   -0.812350
lobster    -1.041506
lamp       -1.118889
dtype: float32
Out[6]:
<matplotlib.patches.Rectangle at 0x4f59f10>

That's cool. Let's take all 'bicycle' detections and NMS them to get rid of overlapping windows.


In [7]:
def nms_detections(dets, overlap=0.3):
    """
    Non-maximum suppression: Greedily select high-scoring detections and
    skip detections that are significantly covered by a previously
    selected detection.

    This version is translated from Matlab code by Tomasz Malisiewicz,
    who sped up Pedro Felzenszwalb's code.

    Parameters
    ----------
    dets: ndarray
        each row is ['xmin', 'ymin', 'xmax', 'ymax', 'score']
    overlap: float
        minimum overlap ratio (0.3 default)

    Output
    ------
    dets: ndarray
        remaining after suppression.
    """
    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]
    ind = np.argsort(dets[:, 4])

    w = x2 - x1
    h = y2 - y1
    area = (w * h).astype(float)

    pick = []
    while len(ind) > 0:
        i = ind[-1]
        pick.append(i)
        ind = ind[:-1]

        xx1 = np.maximum(x1[i], x1[ind])
        yy1 = np.maximum(y1[i], y1[ind])
        xx2 = np.minimum(x2[i], x2[ind])
        yy2 = np.minimum(y2[i], y2[ind])

        w = np.maximum(0., xx2 - xx1)
        h = np.maximum(0., yy2 - yy1)

        wh = w * h
        o = wh / (area[i] + area[ind] - wh)

        ind = ind[np.nonzero(o <= overlap)[0]]

    return dets[pick, :]

In [8]:
scores = predictions_df['bicycle']
windows = df[['xmin', 'ymin', 'xmax', 'ymax']].values
dets = np.hstack((windows, scores[:, np.newaxis]))
nms_dets = nms_detections(dets)

Show top 3 NMS'd detections for 'bicycle' in the image and note the gap between the top scoring box (red) and the remaining boxes.


In [10]:
imshow(im)
currentAxis = plt.gca()
colors = ['r', 'b', 'y']
for c, det in zip(colors, nms_dets[:3]):
    currentAxis.add_patch(
        Rectangle((det[0], det[1]), det[2]-det[0], det[3]-det[1],
        fill=False, edgecolor=c, linewidth=5)
    )
print 'scores:', nms_dets[:3, 4]


scores: [ 0.93699419 -0.65612102 -1.32907355]

This was an easy instance for bicycle as it was in the class's training set. However, the person result is a true detection since this was not in the set for that class.

You should try out detection on an image of your own next!

(Remove the temp directory to clean up, and we're done.)


In [10]:
!rm -rf _temp