This approach follows ideas described in Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Arxiv 2013.

First of all, we'll need a little Python script to run the Matlab Selective Search code.

Let's run detection on an image of a couple of cats frolicking (one of the ImageNet detection challenge pictures), which we will download from the web.

Before you get started with this notebook, make sure to follow instructions for getting the pretrained ImageNet model.



In [1]:

    
!mkdir _temp
!curl http://farm1.static.flickr.com/220/512450093_7717fb8ce8.jpg > _temp/cat.jpg
!echo `pwd`/_temp/cat.jpg > _temp/cat.txt
!../python/detect.py --crop_mode=selective_search --pretrained_model=imagenet/caffe_reference_imagenet_model --model_def=imagenet/imagenet_deploy.prototxt _temp/cat.txt _temp/cat.h5









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  212k  100  212k    0     0   852k      0 --:--:-- --:--:-- --:--:--  858k
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0520 12:14:46.505362  2522 net.cpp:75] Creating Layer conv1
I0520 12:14:46.505406  2522 net.cpp:85] conv1 <- data
I0520 12:14:46.505462  2522 net.cpp:111] conv1 -> conv1
I0520 12:14:46.505530  2522 net.cpp:126] Top shape: 10 96 55 55 (2904000)
I0520 12:14:46.505542  2522 net.cpp:152] conv1 needs backward computation.
I0520 12:14:46.505550  2522 net.cpp:75] Creating Layer relu1
I0520 12:14:46.505556  2522 net.cpp:85] relu1 <- conv1
I0520 12:14:46.505563  2522 net.cpp:99] relu1 -> conv1 (in-place)
I0520 12:14:46.505570  2522 net.cpp:126] Top shape: 10 96 55 55 (2904000)
I0520 12:14:46.505578  2522 net.cpp:152] relu1 needs backward computation.
I0520 12:14:46.505584  2522 net.cpp:75] Creating Layer pool1
I0520 12:14:46.505590  2522 net.cpp:85] pool1 <- conv1
I0520 12:14:46.505596  2522 net.cpp:111] pool1 -> pool1
I0520 12:14:46.505606  2522 net.cpp:126] Top shape: 10 96 27 27 (699840)
I0520 12:14:46.505612  2522 net.cpp:152] pool1 needs backward computation.
I0520 12:14:46.505620  2522 net.cpp:75] Creating Layer norm1
I0520 12:14:46.505626  2522 net.cpp:85] norm1 <- pool1
I0520 12:14:46.505632  2522 net.cpp:111] norm1 -> norm1
I0520 12:14:46.505640  2522 net.cpp:126] Top shape: 10 96 27 27 (699840)
I0520 12:14:46.505646  2522 net.cpp:152] norm1 needs backward computation.
I0520 12:14:46.505656  2522 net.cpp:75] Creating Layer conv2
I0520 12:14:46.505661  2522 net.cpp:85] conv2 <- norm1
I0520 12:14:46.505668  2522 net.cpp:111] conv2 -> conv2
I0520 12:14:46.506363  2522 net.cpp:126] Top shape: 10 256 27 27 (1866240)
I0520 12:14:46.506383  2522 net.cpp:152] conv2 needs backward computation.
I0520 12:14:46.506392  2522 net.cpp:75] Creating Layer relu2
I0520 12:14:46.506398  2522 net.cpp:85] relu2 <- conv2
I0520 12:14:46.506409  2522 net.cpp:99] relu2 -> conv2 (in-place)
I0520 12:14:46.506417  2522 net.cpp:126] Top shape: 10 256 27 27 (1866240)
I0520 12:14:46.506422  2522 net.cpp:152] relu2 needs backward computation.
I0520 12:14:46.506429  2522 net.cpp:75] Creating Layer pool2
I0520 12:14:46.506435  2522 net.cpp:85] pool2 <- conv2
I0520 12:14:46.506441  2522 net.cpp:111] pool2 -> pool2
I0520 12:14:46.506448  2522 net.cpp:126] Top shape: 10 256 13 13 (432640)
I0520 12:14:46.506454  2522 net.cpp:152] pool2 needs backward computation.
I0520 12:14:46.506463  2522 net.cpp:75] Creating Layer norm2
I0520 12:14:46.506469  2522 net.cpp:85] norm2 <- pool2
I0520 12:14:46.506475  2522 net.cpp:111] norm2 -> norm2
I0520 12:14:46.506482  2522 net.cpp:126] Top shape: 10 256 13 13 (432640)
I0520 12:14:46.506489  2522 net.cpp:152] norm2 needs backward computation.
I0520 12:14:46.506496  2522 net.cpp:75] Creating Layer conv3
I0520 12:14:46.506502  2522 net.cpp:85] conv3 <- norm2
I0520 12:14:46.506508  2522 net.cpp:111] conv3 -> conv3
I0520 12:14:46.508342  2522 net.cpp:126] Top shape: 10 384 13 13 (648960)
I0520 12:14:46.508359  2522 net.cpp:152] conv3 needs backward computation.
I0520 12:14:46.508369  2522 net.cpp:75] Creating Layer relu3
I0520 12:14:46.508375  2522 net.cpp:85] relu3 <- conv3
I0520 12:14:46.508383  2522 net.cpp:99] relu3 -> conv3 (in-place)
I0520 12:14:46.508389  2522 net.cpp:126] Top shape: 10 384 13 13 (648960)
I0520 12:14:46.508395  2522 net.cpp:152] relu3 needs backward computation.
I0520 12:14:46.508402  2522 net.cpp:75] Creating Layer conv4
I0520 12:14:46.508409  2522 net.cpp:85] conv4 <- conv3
I0520 12:14:46.508415  2522 net.cpp:111] conv4 -> conv4
I0520 12:14:46.509848  2522 net.cpp:126] Top shape: 10 384 13 13 (648960)
I0520 12:14:46.509870  2522 net.cpp:152] conv4 needs backward computation.
I0520 12:14:46.509877  2522 net.cpp:75] Creating Layer relu4
I0520 12:14:46.509884  2522 net.cpp:85] relu4 <- conv4
I0520 12:14:46.509891  2522 net.cpp:99] relu4 -> conv4 (in-place)
I0520 12:14:46.509897  2522 net.cpp:126] Top shape: 10 384 13 13 (648960)
I0520 12:14:46.509903  2522 net.cpp:152] relu4 needs backward computation.
I0520 12:14:46.509912  2522 net.cpp:75] Creating Layer conv5
I0520 12:14:46.509917  2522 net.cpp:85] conv5 <- conv4
I0520 12:14:46.509923  2522 net.cpp:111] conv5 -> conv5
I0520 12:14:46.510815  2522 net.cpp:126] Top shape: 10 256 13 13 (432640)
I0520 12:14:46.510850  2522 net.cpp:152] conv5 needs backward computation.
I0520 12:14:46.510860  2522 net.cpp:75] Creating Layer relu5
I0520 12:14:46.510867  2522 net.cpp:85] relu5 <- conv5
I0520 12:14:46.510875  2522 net.cpp:99] relu5 -> conv5 (in-place)
I0520 12:14:46.510884  2522 net.cpp:126] Top shape: 10 256 13 13 (432640)
I0520 12:14:46.510890  2522 net.cpp:152] relu5 needs backward computation.
I0520 12:14:46.510897  2522 net.cpp:75] Creating Layer pool5
I0520 12:14:46.510903  2522 net.cpp:85] pool5 <- conv5
I0520 12:14:46.510910  2522 net.cpp:111] pool5 -> pool5
I0520 12:14:46.510920  2522 net.cpp:126] Top shape: 10 256 6 6 (92160)
I0520 12:14:46.510926  2522 net.cpp:152] pool5 needs backward computation.
I0520 12:14:46.510936  2522 net.cpp:75] Creating Layer fc6
I0520 12:14:46.510942  2522 net.cpp:85] fc6 <- pool5
I0520 12:14:46.510949  2522 net.cpp:111] fc6 -> fc6
I0520 12:14:46.566017  2522 net.cpp:126] Top shape: 10 4096 1 1 (40960)
I0520 12:14:46.566061  2522 net.cpp:152] fc6 needs backward computation.
I0520 12:14:46.566076  2522 net.cpp:75] Creating Layer relu6
I0520 12:14:46.566084  2522 net.cpp:85] relu6 <- fc6
I0520 12:14:46.566092  2522 net.cpp:99] relu6 -> fc6 (in-place)
I0520 12:14:46.566100  2522 net.cpp:126] Top shape: 10 4096 1 1 (40960)
I0520 12:14:46.566140  2522 net.cpp:152] relu6 needs backward computation.
I0520 12:14:46.566149  2522 net.cpp:75] Creating Layer drop6
I0520 12:14:46.566155  2522 net.cpp:85] drop6 <- fc6
I0520 12:14:46.566161  2522 net.cpp:99] drop6 -> fc6 (in-place)
I0520 12:14:46.566174  2522 net.cpp:126] Top shape: 10 4096 1 1 (40960)
I0520 12:14:46.566179  2522 net.cpp:152] drop6 needs backward computation.
I0520 12:14:46.566187  2522 net.cpp:75] Creating Layer fc7
I0520 12:14:46.566193  2522 net.cpp:85] fc7 <- fc6
I0520 12:14:46.566200  2522 net.cpp:111] fc7 -> fc7
I0520 12:14:46.600733  2522 net.cpp:126] Top shape: 10 4096 1 1 (40960)
I0520 12:14:46.600765  2522 net.cpp:152] fc7 needs backward computation.
I0520 12:14:46.600777  2522 net.cpp:75] Creating Layer relu7
I0520 12:14:46.600785  2522 net.cpp:85] relu7 <- fc7
I0520 12:14:46.600793  2522 net.cpp:99] relu7 -> fc7 (in-place)
I0520 12:14:46.600802  2522 net.cpp:126] Top shape: 10 4096 1 1 (40960)
I0520 12:14:46.600808  2522 net.cpp:152] relu7 needs backward computation.
I0520 12:14:46.600816  2522 net.cpp:75] Creating Layer drop7
I0520 12:14:46.600823  2522 net.cpp:85] drop7 <- fc7
I0520 12:14:46.600829  2522 net.cpp:99] drop7 -> fc7 (in-place)
I0520 12:14:46.600836  2522 net.cpp:126] Top shape: 10 4096 1 1 (40960)
I0520 12:14:46.600843  2522 net.cpp:152] drop7 needs backward computation.
I0520 12:14:46.600850  2522 net.cpp:75] Creating Layer fc8
I0520 12:14:46.600857  2522 net.cpp:85] fc8 <- fc7
I0520 12:14:46.600864  2522 net.cpp:111] fc8 -> fc8
I0520 12:14:46.615557  2522 net.cpp:126] Top shape: 10 1000 1 1 (10000)
I0520 12:14:46.615602  2522 net.cpp:152] fc8 needs backward computation.
I0520 12:14:46.615614  2522 net.cpp:75] Creating Layer prob
I0520 12:14:46.615623  2522 net.cpp:85] prob <- fc8
I0520 12:14:46.615631  2522 net.cpp:111] prob -> prob
I0520 12:14:46.615649  2522 net.cpp:126] Top shape: 10 1000 1 1 (10000)
I0520 12:14:46.615656  2522 net.cpp:152] prob needs backward computation.
I0520 12:14:46.615664  2522 net.cpp:163] This network produces output prob
I0520 12:14:46.615682  2522 net.cpp:181] Collecting Learning Rate and Weight Decay.
I0520 12:14:46.615696  2522 net.cpp:174] Network initialization done.
I0520 12:14:46.615702  2522 net.cpp:175] Memory required for Data 42022840
Loading input...
selective_search({'/home/shelhamer/caffe/examples/_temp/cat.jpg'}, '/tmp/tmplkH92s.mat')
Processed 223 windows in 16.525 s.
/home/shelhamer/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:2446: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['prediction']]

  warnings.warn(ws, PerformanceWarning)
Saved to _temp/cat.h5 in 0.353 s.

Running this outputs a DataFrame with the filenames, selected windows, and their ImageNet scores to an HDF5 file. (We only ran on one image, so the filenames will all be the same.)



In [2]:

    
import pandas as pd

df = pd.read_hdf('_temp/cat.h5', 'df')
print(df.shape)
print(df.iloc[0])









    



(223, 5)
prediction    [6.67012e-06, 1.26349e-06, 1.86075e-06, 1.0960...
ymin                                                          0
xmin                                                          0
ymax                                                        500
xmax                                                        496
Name: /home/shelhamer/caffe/examples/_temp/cat.jpg, dtype: object

In general, detect.py is most efficient when running on a lot of images: it first extracts window proposals for all of them, batches the windows for efficient GPU processing, and then outputs the results. Simply list an image per line in the images_file, and it will process all of them.

Although this guide gives an example of ImageNet detection, detect.py is clever enough to adapt to different Caffe models’ input dimensions, batch size, and output categories. Refer to python detect.py --help for the parameters to describe your data set. No need for hardcoding.

Anyway, let's now load ImageNet class names and make a DataFrame of the predictions. Note you'll need the auxiliary ilsvrc2012 data fetched by data/ilsvrc12/get_ilsvrc12_aux.sh.



In [3]:

    
with open('../data/ilsvrc12/synset_words.txt') as f:
    labels_df = pd.DataFrame([
        {
            'synset_id': l.strip().split(' ')[0],
            'name': ' '.join(l.strip().split(' ')[1:]).split(',')[0]
        }
        for l in f.readlines()
    ])
labels_df.sort('synset_id')
predictions_df = pd.DataFrame(np.vstack(df.prediction.values), columns=labels_df['name'])
print(predictions_df.iloc[0])









    



name
tench                0.000007
goldfish             0.000001
great white shark    0.000002
tiger shark          0.000001
hammerhead           0.000007
electric ray         0.000004
stingray             0.000007
cock                 0.000057
hen                  0.002985
ostrich              0.000010
brambling            0.000004
goldfinch            0.000001
house finch          0.000004
junco                0.000002
indigo bunting       0.000001
...
daisy                    0.000002
yellow lady's slipper    0.000002
corn                     0.000019
acorn                    0.000011
hip                      0.000003
buckeye                  0.000010
coral fungus             0.000005
agaric                   0.000019
gyromitra                0.000039
stinkhorn                0.000002
earthstar                0.000025
hen-of-the-woods         0.000035
bolete                   0.000036
ear                      0.000008
toilet tissue            0.000019
Name: 0, Length: 1000, dtype: float32

Let's look at the activations.



In [4]:

    
gray()
matshow(predictions_df.values)
xlabel('Classes')
ylabel('Windows')









    Out[4]:





<matplotlib.text.Text at 0x4798650>






    





<matplotlib.figure.Figure at 0x4668990>

Now let's take max across all windows and plot the top classes.



In [5]:

    
max_s = predictions_df.max(0)
max_s.sort(ascending=False)
print(max_s[:10])









    



name
proboscis monkey       0.920136
tiger cat              0.916973
milk can               0.791307
American black bear    0.625850
broccoli               0.609467
dhole                  0.513998
platypus               0.507829
tiger                  0.497029
lion                   0.481180
dingo                  0.474689
dtype: float32

Okay, there are indeed cats in there (and some nonsense). Picking good localizations is work in progress; manually, we see that the third and thirteenth top detections correspond to the two cats.



In [6]:

    
# Find, print, and display max detection.
window_order = pd.Series(predictions_df.values.max(1)).order(ascending=False)

i = window_order.index[3]
j = window_order.index[13]

# Show top predictions for top detection.
f = pd.Series(df['prediction'].iloc[i], index=labels_df['name'])
print('Top detection:')
print(f.order(ascending=False)[:5])
print('')

# Show top predictions for 10th top detection.
f = pd.Series(df['prediction'].iloc[j], index=labels_df['name'])
print('10th detection:')
print(f.order(ascending=False)[:5])

# Show top detection in red, 10th top detection in blue.
im = imread('_temp/cat.jpg')
imshow(im)
currentAxis = plt.gca()

det = df.iloc[i]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(Rectangle(*coords, fill=False, edgecolor='r', linewidth=5))

det = df.iloc[j]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
currentAxis.add_patch(Rectangle(*coords, fill=False, edgecolor='b', linewidth=5))









    



Top detection:
name
tiger cat       0.882972
tiger           0.073158
tabby           0.025290
lynx            0.012881
Egyptian cat    0.004481
dtype: float32

10th detection:
name
tiger cat           0.677493
Pembroke            0.064214
dingo               0.050635
golden retriever    0.028331
tabby               0.021945
dtype: float32






    Out[6]:





<matplotlib.patches.Rectangle at 0x4ab3510>

That's cool. Both of these detections are tiger cats. Let's take all 'tiger cat' detections and NMS them to get rid of overlapping windows.



In [7]:

    
def nms_detections(dets, overlap=0.5):
    """
    Non-maximum suppression: Greedily select high-scoring detections and
    skip detections that are significantly covered by a previously
    selected detection.

    This version is translated from Matlab code by Tomasz Malisiewicz,
    who sped up Pedro Felzenszwalb's code.

    Parameters
    ----------
    dets: ndarray
        each row is ['xmin', 'ymin', 'xmax', 'ymax', 'score']
    overlap: float
        minimum overlap ratio (0.5 default)

    Output
    ------
    dets: ndarray
        remaining after suppression.
    """
    if np.shape(dets)[0] < 1:
        return dets

    x1 = dets[:, 0]
    y1 = dets[:, 1]
    x2 = dets[:, 2]
    y2 = dets[:, 3]

    w = x2 - x1
    h = y2 - y1
    area = w * h

    s = dets[:, 4]
    ind = np.argsort(s)

    pick = []
    counter = 0
    while len(ind) > 0:
        last = len(ind) - 1
        i = ind[last]
        pick.append(i)
        counter += 1

        xx1 = np.maximum(x1[i], x1[ind[:last]])
        yy1 = np.maximum(y1[i], y1[ind[:last]])
        xx2 = np.minimum(x2[i], x2[ind[:last]])
        yy2 = np.minimum(y2[i], y2[ind[:last]])

        w = np.maximum(0., xx2 - xx1 + 1)
        h = np.maximum(0., yy2 - yy1 + 1)

        o = w * h / area[ind[:last]]

        to_delete = np.concatenate(
            (np.nonzero(o > overlap)[0], np.array([last])))
        ind = np.delete(ind, to_delete)

    return dets[pick, :]



In [8]:

    
scores = predictions_df['tiger cat']
windows = df[['xmin', 'ymin', 'xmax', 'ymax']].values
dets = np.hstack((windows, scores[:, np.newaxis]))
nms_dets = nms_detections(dets)

Show top 3 NMS'd detections for 'tiger cat' in the image.



In [9]:

    
imshow(im)
currentAxis = plt.gca()
colors = ['r', 'b', 'y']
for c, det in zip(colors, nms_dets[:3]):
    currentAxis.add_patch(
        Rectangle((det[0], det[1]), det[2], det[3],
        fill=False, edgecolor=c, linewidth=5)
    )

Remove the temp directory to clean up.



In [10]:

    
import shutil
shutil.rmtree('_temp')