This approach follows ideas described in Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Arxiv 2013.

First of all, we'll need a little Python script to run the Matlab Selective Search code.

Let's run detection on an image of a couple of cats frolicking (one of the ImageNet detection challenge pictures), which we will download from the web.

Before you get started with this notebook, make sure to follow instructions for getting the pretrained ImageNet model.


In [1]:
!mkdir _temp
!curl http://farm1.static.flickr.com/220/512450093_7717fb8ce8.jpg > _temp/cat.jpg
!echo `pwd`/_temp/cat.jpg > _temp/cat.txt
!python ../python/caffe/detection/detector.py --crop_mode=selective_search --pretrained_model=../examples/imagenet/caffe_reference_imagenet_model --model_def=../examples/imagenet/imagenet_deploy.prototxt _temp/cat.txt _temp/cat.h5


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  212k  100  212k    0     0   263k      0 --:--:-- --:--:-- --:--:--  519k
Loading Caffe model.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0318 11:15:21.671466 2104947072 net.cpp:74] Creating Layer conv1
I0318 11:15:21.671494 2104947072 net.cpp:84] conv1 <- data
I0318 11:15:21.671500 2104947072 net.cpp:110] conv1 -> conv1
I0318 11:15:21.993130 2104947072 net.cpp:125] Top shape: 10 96 55 55 (2904000)
I0318 11:15:21.993155 2104947072 net.cpp:151] conv1 needs backward computation.
I0318 11:15:21.993165 2104947072 net.cpp:74] Creating Layer relu1
I0318 11:15:21.993170 2104947072 net.cpp:84] relu1 <- conv1
I0318 11:15:21.993175 2104947072 net.cpp:98] relu1 -> conv1 (in-place)
I0318 11:15:21.993182 2104947072 net.cpp:125] Top shape: 10 96 55 55 (2904000)
I0318 11:15:21.993187 2104947072 net.cpp:151] relu1 needs backward computation.
I0318 11:15:21.993192 2104947072 net.cpp:74] Creating Layer pool1
I0318 11:15:21.993197 2104947072 net.cpp:84] pool1 <- conv1
I0318 11:15:21.993201 2104947072 net.cpp:110] pool1 -> pool1
I0318 11:15:21.993208 2104947072 net.cpp:125] Top shape: 10 96 27 27 (699840)
I0318 11:15:21.993212 2104947072 net.cpp:151] pool1 needs backward computation.
I0318 11:15:21.993217 2104947072 net.cpp:74] Creating Layer norm1
I0318 11:15:21.993221 2104947072 net.cpp:84] norm1 <- pool1
I0318 11:15:21.993227 2104947072 net.cpp:110] norm1 -> norm1
I0318 11:15:21.993233 2104947072 net.cpp:125] Top shape: 10 96 27 27 (699840)
I0318 11:15:21.993238 2104947072 net.cpp:151] norm1 needs backward computation.
I0318 11:15:21.993244 2104947072 net.cpp:74] Creating Layer conv2
I0318 11:15:21.993248 2104947072 net.cpp:84] conv2 <- norm1
I0318 11:15:21.993252 2104947072 net.cpp:110] conv2 -> conv2
I0318 11:15:21.995401 2104947072 net.cpp:125] Top shape: 10 256 27 27 (1866240)
I0318 11:15:21.995414 2104947072 net.cpp:151] conv2 needs backward computation.
I0318 11:15:21.995419 2104947072 net.cpp:74] Creating Layer relu2
I0318 11:15:21.995424 2104947072 net.cpp:84] relu2 <- conv2
I0318 11:15:21.995429 2104947072 net.cpp:98] relu2 -> conv2 (in-place)
I0318 11:15:21.995432 2104947072 net.cpp:125] Top shape: 10 256 27 27 (1866240)
I0318 11:15:21.995437 2104947072 net.cpp:151] relu2 needs backward computation.
I0318 11:15:21.995441 2104947072 net.cpp:74] Creating Layer pool2
I0318 11:15:21.995445 2104947072 net.cpp:84] pool2 <- conv2
I0318 11:15:21.995450 2104947072 net.cpp:110] pool2 -> pool2
I0318 11:15:21.995455 2104947072 net.cpp:125] Top shape: 10 256 13 13 (432640)
I0318 11:15:21.995460 2104947072 net.cpp:151] pool2 needs backward computation.
I0318 11:15:21.995463 2104947072 net.cpp:74] Creating Layer norm2
I0318 11:15:21.995467 2104947072 net.cpp:84] norm2 <- pool2
I0318 11:15:21.995471 2104947072 net.cpp:110] norm2 -> norm2
I0318 11:15:21.995477 2104947072 net.cpp:125] Top shape: 10 256 13 13 (432640)
I0318 11:15:21.995481 2104947072 net.cpp:151] norm2 needs backward computation.
I0318 11:15:21.995487 2104947072 net.cpp:74] Creating Layer conv3
I0318 11:15:21.995491 2104947072 net.cpp:84] conv3 <- norm2
I0318 11:15:21.995496 2104947072 net.cpp:110] conv3 -> conv3
I0318 11:15:22.001526 2104947072 net.cpp:125] Top shape: 10 384 13 13 (648960)
I0318 11:15:22.001549 2104947072 net.cpp:151] conv3 needs backward computation.
I0318 11:15:22.001555 2104947072 net.cpp:74] Creating Layer relu3
I0318 11:15:22.001560 2104947072 net.cpp:84] relu3 <- conv3
I0318 11:15:22.001565 2104947072 net.cpp:98] relu3 -> conv3 (in-place)
I0318 11:15:22.001570 2104947072 net.cpp:125] Top shape: 10 384 13 13 (648960)
I0318 11:15:22.001574 2104947072 net.cpp:151] relu3 needs backward computation.
I0318 11:15:22.001580 2104947072 net.cpp:74] Creating Layer conv4
I0318 11:15:22.001585 2104947072 net.cpp:84] conv4 <- conv3
I0318 11:15:22.001588 2104947072 net.cpp:110] conv4 -> conv4
I0318 11:15:22.005995 2104947072 net.cpp:125] Top shape: 10 384 13 13 (648960)
I0318 11:15:22.006008 2104947072 net.cpp:151] conv4 needs backward computation.
I0318 11:15:22.006014 2104947072 net.cpp:74] Creating Layer relu4
I0318 11:15:22.006018 2104947072 net.cpp:84] relu4 <- conv4
I0318 11:15:22.006022 2104947072 net.cpp:98] relu4 -> conv4 (in-place)
I0318 11:15:22.006027 2104947072 net.cpp:125] Top shape: 10 384 13 13 (648960)
I0318 11:15:22.006031 2104947072 net.cpp:151] relu4 needs backward computation.
I0318 11:15:22.006037 2104947072 net.cpp:74] Creating Layer conv5
I0318 11:15:22.006042 2104947072 net.cpp:84] conv5 <- conv4
I0318 11:15:22.006045 2104947072 net.cpp:110] conv5 -> conv5
I0318 11:15:22.009027 2104947072 net.cpp:125] Top shape: 10 256 13 13 (432640)
I0318 11:15:22.009048 2104947072 net.cpp:151] conv5 needs backward computation.
I0318 11:15:22.009057 2104947072 net.cpp:74] Creating Layer relu5
I0318 11:15:22.009062 2104947072 net.cpp:84] relu5 <- conv5
I0318 11:15:22.009065 2104947072 net.cpp:98] relu5 -> conv5 (in-place)
I0318 11:15:22.009071 2104947072 net.cpp:125] Top shape: 10 256 13 13 (432640)
I0318 11:15:22.009075 2104947072 net.cpp:151] relu5 needs backward computation.
I0318 11:15:22.009080 2104947072 net.cpp:74] Creating Layer pool5
I0318 11:15:22.009084 2104947072 net.cpp:84] pool5 <- conv5
I0318 11:15:22.009088 2104947072 net.cpp:110] pool5 -> pool5
I0318 11:15:22.009093 2104947072 net.cpp:125] Top shape: 10 256 6 6 (92160)
I0318 11:15:22.009099 2104947072 net.cpp:151] pool5 needs backward computation.
I0318 11:15:22.009104 2104947072 net.cpp:74] Creating Layer fc6
I0318 11:15:22.009107 2104947072 net.cpp:84] fc6 <- pool5
I0318 11:15:22.009111 2104947072 net.cpp:110] fc6 -> fc6
I0318 11:15:22.271282 2104947072 net.cpp:125] Top shape: 10 4096 1 1 (40960)
I0318 11:15:22.271308 2104947072 net.cpp:151] fc6 needs backward computation.
I0318 11:15:22.271320 2104947072 net.cpp:74] Creating Layer relu6
I0318 11:15:22.271327 2104947072 net.cpp:84] relu6 <- fc6
I0318 11:15:22.271332 2104947072 net.cpp:98] relu6 -> fc6 (in-place)
I0318 11:15:22.271337 2104947072 net.cpp:125] Top shape: 10 4096 1 1 (40960)
I0318 11:15:22.271340 2104947072 net.cpp:151] relu6 needs backward computation.
I0318 11:15:22.271345 2104947072 net.cpp:74] Creating Layer drop6
I0318 11:15:22.271349 2104947072 net.cpp:84] drop6 <- fc6
I0318 11:15:22.271353 2104947072 net.cpp:98] drop6 -> fc6 (in-place)
I0318 11:15:22.271369 2104947072 net.cpp:125] Top shape: 10 4096 1 1 (40960)
I0318 11:15:22.271374 2104947072 net.cpp:151] drop6 needs backward computation.
I0318 11:15:22.271380 2104947072 net.cpp:74] Creating Layer fc7
I0318 11:15:22.271384 2104947072 net.cpp:84] fc7 <- fc6
I0318 11:15:22.271389 2104947072 net.cpp:110] fc7 -> fc7
I0318 11:15:22.389216 2104947072 net.cpp:125] Top shape: 10 4096 1 1 (40960)
I0318 11:15:22.389250 2104947072 net.cpp:151] fc7 needs backward computation.
I0318 11:15:22.389258 2104947072 net.cpp:74] Creating Layer relu7
I0318 11:15:22.389264 2104947072 net.cpp:84] relu7 <- fc7
I0318 11:15:22.389271 2104947072 net.cpp:98] relu7 -> fc7 (in-place)
I0318 11:15:22.389276 2104947072 net.cpp:125] Top shape: 10 4096 1 1 (40960)
I0318 11:15:22.389279 2104947072 net.cpp:151] relu7 needs backward computation.
I0318 11:15:22.389284 2104947072 net.cpp:74] Creating Layer drop7
I0318 11:15:22.389289 2104947072 net.cpp:84] drop7 <- fc7
I0318 11:15:22.389293 2104947072 net.cpp:98] drop7 -> fc7 (in-place)
I0318 11:15:22.389298 2104947072 net.cpp:125] Top shape: 10 4096 1 1 (40960)
I0318 11:15:22.389302 2104947072 net.cpp:151] drop7 needs backward computation.
I0318 11:15:22.389308 2104947072 net.cpp:74] Creating Layer fc8
I0318 11:15:22.389312 2104947072 net.cpp:84] fc8 <- fc7
I0318 11:15:22.389317 2104947072 net.cpp:110] fc8 -> fc8
I0318 11:15:22.417853 2104947072 net.cpp:125] Top shape: 10 1000 1 1 (10000)
I0318 11:15:22.417879 2104947072 net.cpp:151] fc8 needs backward computation.
I0318 11:15:22.417887 2104947072 net.cpp:74] Creating Layer prob
I0318 11:15:22.417892 2104947072 net.cpp:84] prob <- fc8
I0318 11:15:22.417898 2104947072 net.cpp:110] prob -> prob
I0318 11:15:22.417917 2104947072 net.cpp:125] Top shape: 10 1000 1 1 (10000)
I0318 11:15:22.417920 2104947072 net.cpp:151] prob needs backward computation.
I0318 11:15:22.417924 2104947072 net.cpp:162] This network produces output prob
I0318 11:15:22.417928 2104947072 net.cpp:173] Collecting Learning Rate and Weight Decay.
I0318 11:15:22.417944 2104947072 net.cpp:166] Network initialization done.
I0318 11:15:22.417948 2104947072 net.cpp:167] Memory required for Data 42022840
Caffe model loaded in 1.621 s
Loading input and assembling batches...
selective_search({'/Users/karayev/work/caffe-bvlc/examples/_temp/cat.jpg'}, '/var/folders/4q/vm1lt3t91p9gl06nz6s1dzzw0000gn/T/tmpOcszAc.mat')
23 batches assembled in 5.225 s
Processing 1 files in 23 batches
...on batch 0/23, elapsed time: 0.000 s
...on batch 10/23, elapsed time: 3.819 s
...on batch 20/23, elapsed time: 7.571 s
Processing complete after 8.818 s.
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['feat']]

  warnings.warn(ws, PerformanceWarning)
Done. Saving to _temp/cat.h5 took 0.160 s.

Running this outputs a DataFrame with the filenames, selected windows, and their ImageNet scores to an HDF5 file. (We only ran on one image, so the filenames will all be the same.)


In [2]:
import pandas as pd

df = pd.read_hdf('_temp/cat.h5', 'df')
print(df.shape)
print(df.iloc[0])


(223, 5)
feat    [6.90396e-06, 1.27811e-06, 1.82159e-06, 1.1020...
ymin                                                    0
xmin                                                    0
ymax                                                  500
xmax                                                  496
Name: /Users/karayev/work/caffe-bvlc/examples/_temp/cat.jpg, dtype: object

In general, detector.py is most efficient when running on a lot of images: it first extracts window proposals for all of them, batches the windows for efficient GPU processing, and then outputs the results. Simply list an image per line in the images_file, and it will process all of them.

Although this guide gives an example of ImageNet detection, detector.py is clever enough to adapt to different Caffe models’ input dimensions, batch size, and output categories. Refer to python detector.py --help and the images_dim and images_mean_file parameters to describe your data set. No need for hardcoding.

Anyway, let's now load ImageNet class names and make a DataFrame of the features. Note you'll need the auxiliary ilsvrc2012 data fetched by data/ilsvrc12/get_ilsvrc12_aux.sh.


In [3]:
with open('../data/ilsvrc12/synset_words.txt') as f:
    labels_df = pd.DataFrame([
        {
            'synset_id': l.strip().split(' ')[0],
            'name': ' '.join(l.strip().split(' ')[1:]).split(',')[0]
        }
        for l in f.readlines()
    ])
labels_df.sort('synset_id')
feats_df = pd.DataFrame(np.vstack(df.feat.values), columns=labels_df['name'])
print(feats_df.iloc[0])


name
tench                0.000007
goldfish             0.000001
great white shark    0.000002
tiger shark          0.000001
hammerhead           0.000007
electric ray         0.000004
stingray             0.000007
cock                 0.000060
hen                  0.003055
ostrich              0.000010
brambling            0.000004
goldfinch            0.000001
house finch          0.000004
junco                0.000002
indigo bunting       0.000001
...
daisy                    0.000002
yellow lady's slipper    0.000002
corn                     0.000020
acorn                    0.000011
hip                      0.000003
buckeye                  0.000010
coral fungus             0.000005
agaric                   0.000019
gyromitra                0.000039
stinkhorn                0.000002
earthstar                0.000025
hen-of-the-woods         0.000035
bolete                   0.000037
ear                      0.000008
toilet tissue            0.000019
Name: 0, Length: 1000, dtype: float32

Let's look at the activations.


In [4]:
gray()
matshow(feats_df.values)
xlabel('Classes')
ylabel('Windows')


Out[4]:
<matplotlib.text.Text at 0x107290150>
<matplotlib.figure.Figure at 0x106877510>

Now let's take max across all windows and plot the top classes.


In [5]:
max_s = feats_df.max(0)
max_s.sort(ascending=False)
print(max_s[:10])


name
proboscis monkey       0.923392
tiger cat              0.918685
milk can               0.783663
American black bear    0.637560
broccoli               0.612832
tiger                  0.515798
platypus               0.514660
dhole                  0.509583
lion                   0.496187
dingo                  0.482885
dtype: float32

Okay, there are indeed cats in there (and some nonsense). Picking good localizations is work in progress; manually, we see that the third and thirteenth top detections correspond to the two cats.


In [6]:
# Find, print, and display max detection.
window_order = pd.Series(feats_df.values.max(1)).order(ascending=False)

i = window_order.index[3]
j = window_order.index[13]

# Show top predictions for top detection.
f = pd.Series(df['feat'].iloc[i], index=labels_df['name'])
print('Top detection:')
print(f.order(ascending=False)[:5])
print('')

# Show top predictions for 10th top detection.
f = pd.Series(df['feat'].iloc[j], index=labels_df['name'])
print('10th detection:')
print(f.order(ascending=False)[:5])

# Show top detection in red, 10th top detection in blue.
im = imread('_temp/cat.jpg')
imshow(im)
currentAxis = plt.gca()

det = df.iloc[i]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(Rectangle(*coords, fill=False, edgecolor='r', linewidth=5))

det = df.iloc[j]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(Rectangle(*coords, fill=False, edgecolor='b', linewidth=5))


Top detection:
name
tiger cat       0.882021
tiger           0.075015
tabby           0.024404
lynx            0.012947
Egyptian cat    0.004409
dtype: float32

10th detection:
name
tiger cat           0.681169
Pembroke            0.063924
dingo               0.050501
golden retriever    0.027614
tabby               0.021413
dtype: float32
Out[6]:
<matplotlib.patches.Rectangle at 0x108516c90>

That's cool. Both of these detections are tiger cats. Let's take all 'tiger cat' detections and NMS them to get rid of overlapping windows.


In [7]:
def nms_detections(dets, overlap=0.5):
    """
    Non-maximum suppression: Greedily select high-scoring detections and
    skip detections that are significantly covered by a previously
    selected detection.

    This version is translated from Matlab code by Tomasz Malisiewicz,
    who sped up Pedro Felzenszwalb's code.

    Parameters
    ----------
    dets: ndarray
        each row is ['xmin', 'ymin', 'xmax', 'ymax', 'score']
    overlap: float
        minimum overlap ratio (0.5 default)

    Output
    ------
    dets: ndarray
        remaining after suppression.
    """
    if np.shape(dets)[0] < 1:
        return dets

    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]

    w = x2 - x1
    h = y2 - y1
    area = w * h

    s = dets[:, 4]
    ind = np.argsort(s)

    pick = []
    counter = 0
    while len(ind) > 0:
        last = len(ind) - 1
        i = ind[last]
        pick.append(i)
        counter += 1

        xx1 = np.maximum(x1[i], x1[ind[:last]])
        yy1 = np.maximum(y1[i], y1[ind[:last]])
        xx2 = np.minimum(x2[i], x2[ind[:last]])
        yy2 = np.minimum(y2[i], y2[ind[:last]])

        w = np.maximum(0., xx2 - xx1 + 1)
        h = np.maximum(0., yy2 - yy1 + 1)

        o = w * h / area[ind[:last]]

        to_delete = np.concatenate(
            (np.nonzero(o > overlap)[0], np.array([last])))
        ind = np.delete(ind, to_delete)

    return dets[pick, :]

In [8]:
scores = feats_df['tiger cat']
windows = df[['xmin', 'ymin', 'xmax', 'ymax']].values
dets = np.hstack((windows, scores[:, np.newaxis]))
nms_dets = nms_detections(dets)

Show top 3 NMS'd detections for 'tiger cat' in the image.


In [9]:
imshow(im)
currentAxis = plt.gca()
colors = ['r', 'b', 'y']
for c, det in zip(colors, nms_dets[:3]):
    currentAxis.add_patch(
        Rectangle((det[0], det[1]), det[2], det[3],
        fill=False, edgecolor=c, linewidth=5)
    )


Remove the temp directory to clean up.


In [10]:
import shutil
shutil.rmtree('_temp')