Seminar 6: Hacking R-CNN

WARNING: This seminar was not thoroughly tested so expect weird bugs. Please report if you found one!

In this assignment, you will hack the existing code for object detection ( in order to make it usable with Theano-based models.

Originally, Ross Girshick uses Caffe as a backend for deep learning. This is not very good for us as Caffe is substantially different from Theano, and one cannot simply replace one framework with the other. Luckily, the training and the testing procedures are built on top of the Python interface (instead of the Caffe's native CLI+protobufs combo) which makes it possible to locate and exterminate Caffe-contaminated parts and fill the gaps with appropriate wrappers around Theano machinery.

While it sounds like an easy task, this surgery still requires good familiarity with the RCNN's internals, so you will have to go through the Rob's slides and the code and make sure that you understand what is happening in each of the modules.

In [ ]:
# Useful imports that we are going to be needing throughout the seminar.
import os

R-CNN introduction

In a nutshell, R-CNN is just a common image recognition neural network applied to patches that obtained by rescaling rectangular regions of a bigger scene. Those regions are usually computed externally using, for example, Selective Search. The overall pipeline is best described by the following slide from the Girshick's awesome presentation:

You are encouraged to read the relevant paper to get deeper understanding of how the system works.

Setup the code

Cython modules

First, we need to build Cython modules that come with the original Faster R-CNN package.

In [ ]:
!cd ./lib; make


In our experiments, we are going to use PASCAL VOC 2007 so we download training, validation and test data as well as VOCdevkit (this may take a while!). We are assuming you have the wget utility installed on your computer.

In [ ]:
# If you already have the dataset on your machine just put the absolute path (root path - w/o VOCdevkit) 
# into the following variable:

if not PASCALVOC2007_PATH:
    # You can change the download target path.
    !mkdir -p {PASCALVOC2007_PATH}
    !wget {PASCALVOC2007_PATH}
    !wget {PASCALVOC2007_PATH}
    !wget {PASCALVOC2007_PATH}
    !cd {PASCALVOC2007_PATH}; tar xvf VOCtrainval_06-Nov-2007.tar
    !cd {PASCALVOC2007_PATH}; tar xvf VOCtest_06-Nov-2007.tar
    !cd {PASCALVOC2007_PATH}; tar xvf VOCdevkit_08-Jun-2007.tar
# We symlink the dataset path to the folder that is expected by the R-CNN code.
VOCDEVKIT_PATH = os.path.join(PASCALVOC2007_PATH, 'VOCdevkit')
!ln -f -s {VOCDEVKIT_PATH} ./data/VOCdevkit2007

Region proposals

Next, you need to fetch some external region proposals. The original code uses Selective Search by default. You can download and setup this data by running the cell below. It is very time-consuming, so go have some coffee.

In [ ]:
# Alternatively, you can ask people who have already downloaded the data to share it. 
# Then you can just put it in the ./data folder and skip this step.

Other option would be to use precomuted RPN proposals. Fetch them from here and unpack to the ./data folder. The advantage of using them is that you only need 300 entities per image (instead of 2000) to get a good performance.

Pretrained base models

Finally, you need to download one of the pretrained NNs that we will use a starting point for our object detection pipeline. As we are using Lasagne, it's a good idea to check out Lasagne's ModelZoo. We suggest not to go crazy and use a moderately-sized model like caffe_reference (or even shallower). We all want to set the new state-of-the-art for Object Detection, but let's start small.

Write your modules

The most interesting part of the assignment is, of course, the custom model injection. As we already mentioned, the original code uses Caffe. We (the TAs) did our best to get rid of this nasty dependency. Now, spend some time comparing the code provided by us and the Faster R-CNN repo. The code makes heavy use of the configuration files available by calling from fast_rcnn.config import cfg. The cfg contains various parameters that govern the behaviour of the whole system. The standard configuration is defined in ./lib/fast_rcnn/ In order to override it use ./experiments/cfgs/rcnn.yml.

You'll have to

  1. Add missing pieces to ./lib/fast_rcnn/ (this is where the main training loop is).
  2. Implement the network (./custom/ and the solver (./custom/ class respecting the interface described in the source files. You are NOT required to code up the bounding-box regression.
  3. Repeat the steps above for two kinds of training regimes:
    • Fine-tuning of the top layer.
    • Fine-tuning of the whole network.

Custom classes

The main purpose of the Net class is to hold a symbolic representation of your neural network. This is where you load a pretrained NN optionally modifying its architecture or weights.

The Solver is used to define actual training procedure:

  1. Create an instance of Net
  2. Define a loss function
  3. Define an optimization algorithm and its parameters
  4. Gather all of the above into the training step function


Your solver is likely to require some kind of training data source. In the orginial Girshick's code this data is obtained via ROIDataLayer. The ROIDataLayer takes care of the dataset management and produces tensors that are ready for the neural net processing. We are providing you with an adapted version of this layer suitable for incorporating into your custom modules. The typical usage is outlined below:

In [ ]:
# Do NOT run this cell. This serves just as an illustration.

from roi_data_layer.layer import RoIDataLayer

# Create an instance and run initial setup.
roi_data_layer = RoIDataLayer()

# This is important! You need to supply the roidb. This one is available in ./lib/fast_rcnn/ via SolverWrapper (look at the __init__).
# By the way, SolverWrapper.__init__ is the place where you should instantiate your solver.

# This is what you'd call at every training iteration.
# This populates list with with tensors that you might need for conducting a step of you solver.
# First three tensors are:
#   1. data (ndarray): (cfg.TRAIN.IMS_PER_BATCH, 3, image_height, image_width) tensor containing the whole scenes
#   2. rois (ndarray): (cfg.TRAIN.BATCH_SIZE, 5) tensor containg ROIs; rois[:, 0] are indices of scenes in data, the rest
#                      are (left, top, right, bottom) coordinates
#   3. labels (ndarray): (cfg.TRAIN.BATCH_SIZE,) tensor contaning correct labels (those are float32, convert if needed)
data, rois, labels =[: 3]

Some tips

  1. Please be careful with the input format. Make sure that your network receives what it expects, i.e. check the order of color channels (RGB/BGR), range of values ($ [0, 1] $/ $ [0, 255] $), etc.
  2. Try Dropout to reduce overfitting.
  3. Try different optimization algorithms. Adam seems to require less tuning but may give slightly worse results than SGD.

Train model

Now it's time to train your freshly written model. First, edit ./experiments/cfgs/rcnn.yml to override the standard settings.

The training procedure is launched by invoking the following shell-command:

In [ ]:
# Change this to adjust the number of training iterations.

!./experiments/scripts/ {NUM_ITERS}

Test model

Let's asses the quality of trained model on the test set of PASCAL VOC 2007. In order to do this, you have to implement the Tester encapsulating a trained network. The Tester's sole purpose is to produce confidence scores for bounding-boxes at, well, test time.

After the coding is done, the evaluation can be run by invoking:

In [ ]:
# Put the path to the network snapshot here:

!./experiments/scripts/ {SNAPSHOT}

The test set is rather large (4952 images) so it may take some time for the evaluation to finish. Just like in any serious ML research, it's a good idea to leave this thing running overnight.

Write a short report

Describe the model that you use for this assignment. How do you train it? What did work and what didn't? Most importantly, does fine-tuning of the whole network work any better than just adapting the last layer?

If things are slow for you (advanced stuff)

Then you are welcome to replace an older R-CNN pipeline with a newer Fast R-CNN approach (see this paper and, again, this presentation). The only missing piece is the so-called ROI Pooling. Fortunately, you can find a Theano implementation here (it's still somewhat untested). Roughly speaking, you need to do the following:

  1. Wrap the Theano Op into a Lasagne's custom layer (see
  2. Change your network to receive two inputs (the whole image(s) and a set of rois - those are available through ROIDataLayer) instead of stack of patches of a fixed-size.