WARNING: This seminar was not thoroughly tested so expect weird bugs. Please report if you found one!
In this assignment, you will hack the existing code for object detection (https://github.com/rbgirshick/py-faster-rcnn) in order to make it usable with Theano-based models.
Originally, Ross Girshick uses Caffe as a backend for deep learning. This is not very good for us as Caffe is substantially different from Theano, and one cannot simply replace one framework with the other. Luckily, the training and the testing procedures are built on top of the Python interface (instead of the Caffe's native CLI+protobufs combo) which makes it possible to locate and exterminate Caffe-contaminated parts and fill the gaps with appropriate wrappers around Theano machinery.
While it sounds like an easy task, this surgery still requires good familiarity with the RCNN's internals, so you will have to go through the Rob's slides and the code and make sure that you understand what is happening in each of the modules.
In [ ]:
# Useful imports that we are going to be needing throughout the seminar.
import os
In a nutshell, R-CNN is just a common image recognition neural network applied to patches that obtained by rescaling rectangular regions of a bigger scene. Those regions are usually computed externally using, for example, Selective Search. The overall pipeline is best described by the following slide from the Girshick's awesome presentation:
You are encouraged to read the relevant paper to get deeper understanding of how the system works.
In [ ]:
!cd ./lib; make
In our experiments, we are going to use PASCAL VOC 2007 so we download training, validation and test data as well as VOCdevkit (this may take a while!). We are assuming you have the wget
utility installed on your computer.
In [ ]:
# If you already have the dataset on your machine just put the absolute path (root path - w/o VOCdevkit)
# into the following variable:
PASCALVOC2007_PATH=''
if not PASCALVOC2007_PATH:
# You can change the download target path.
PASCALVOC2007_PATH='./downloads'
!mkdir -p {PASCALVOC2007_PATH}
!wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar {PASCALVOC2007_PATH}
!wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar {PASCALVOC2007_PATH}
!wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar {PASCALVOC2007_PATH}
!cd {PASCALVOC2007_PATH}; tar xvf VOCtrainval_06-Nov-2007.tar
!cd {PASCALVOC2007_PATH}; tar xvf VOCtest_06-Nov-2007.tar
!cd {PASCALVOC2007_PATH}; tar xvf VOCdevkit_08-Jun-2007.tar
# We symlink the dataset path to the folder that is expected by the R-CNN code.
VOCDEVKIT_PATH = os.path.join(PASCALVOC2007_PATH, 'VOCdevkit')
!ln -f -s {VOCDEVKIT_PATH} ./data/VOCdevkit2007
In [ ]:
# Alternatively, you can ask people who have already downloaded the data to share it.
# Then you can just put it in the ./data folder and skip this step.
!data/scripts/fetch_selective_search_data.sh
Other option would be to use precomuted RPN proposals. Fetch them from here and unpack to the ./data
folder. The advantage of using them is that you only need 300 entities per image (instead of 2000) to get a good performance.
Finally, you need to download one of the pretrained NNs that we will use a starting point for our object detection pipeline. As we are using Lasagne, it's a good idea to check out Lasagne's ModelZoo. We suggest not to go crazy and use a moderately-sized model like caffe_reference
(or even shallower). We all want to set the new state-of-the-art for Object Detection, but let's start small.
The most interesting part of the assignment is, of course, the custom model injection. As we already mentioned, the original code uses Caffe. We (the TAs) did our best to get rid of this nasty dependency. Now, spend some time comparing the code provided by us and the Faster R-CNN repo. The code makes heavy use of the configuration files available by calling from fast_rcnn.config import cfg
. The cfg
contains various parameters that govern the behaviour of the whole system. The standard configuration is defined in ./lib/fast_rcnn/config.py
. In order to override it use ./experiments/cfgs/rcnn.yml
.
You'll have to
./lib/fast_rcnn/train.py
(this is where the main training loop is)../custom/net.py
) and the solver (./custom/solver.py
) class respecting the interface described in the source files. You are NOT required to code up the bounding-box regression. The main purpose of the Net
class is to hold a symbolic representation of your neural network. This is where you load a pretrained NN optionally modifying its architecture or weights.
The Solver
is used to define actual training procedure:
Net
Your solver is likely to require some kind of training data source. In the orginial Girshick's code this data is obtained via ROIDataLayer
. The ROIDataLayer
takes care of the dataset management and produces tensors that are ready for the neural net processing. We are providing you with an adapted version of this layer suitable for incorporating into your custom modules. The typical usage is outlined below:
In [ ]:
# Do NOT run this cell. This serves just as an illustration.
from roi_data_layer.layer import RoIDataLayer
# Create an instance and run initial setup.
roi_data_layer = RoIDataLayer()
roi_data_layer.setup()
# This is important! You need to supply the roidb. This one is available in ./lib/fast_rcnn/train.py via SolverWrapper (look at the __init__).
# By the way, SolverWrapper.__init__ is the place where you should instantiate your solver.
roi_data_layer.set_roidb(roidb)
# This is what you'd call at every training iteration.
# This populates roi_data_layer.top list with with tensors that you might need for conducting a step of you solver.
roi_data_layer.forward()
# First three tensors are:
# 1. data (ndarray): (cfg.TRAIN.IMS_PER_BATCH, 3, image_height, image_width) tensor containing the whole scenes
# 2. rois (ndarray): (cfg.TRAIN.BATCH_SIZE, 5) tensor containg ROIs; rois[:, 0] are indices of scenes in data, the rest
# are (left, top, right, bottom) coordinates
# 3. labels (ndarray): (cfg.TRAIN.BATCH_SIZE,) tensor contaning correct labels (those are float32, convert if needed)
data, rois, labels = roi_data_layer.top[: 3]
RGB
/BGR
), range of values ($ [0, 1] $/ $ [0, 255] $), etc.Adam
seems to require less tuning but may give slightly worse results than SGD
.
In [ ]:
# Change this to adjust the number of training iterations.
NUM_ITERS=40000
!./experiments/scripts/train_rcnn.sh {NUM_ITERS}
Let's asses the quality of trained model on the test set of PASCAL VOC 2007. In order to do this, you have to implement the Tester
encapsulating a trained network. The Tester
's sole purpose is to produce confidence scores for bounding-boxes at, well, test time.
After the coding is done, the evaluation can be run by invoking:
In [ ]:
# Put the path to the network snapshot here:
SNAPSHOT=''
!./experiments/scripts/test_rcnn.sh {SNAPSHOT}
The test set is rather large (4952 images) so it may take some time for the evaluation to finish. Just like in any serious ML research, it's a good idea to leave this thing running overnight.
Then you are welcome to replace an older R-CNN pipeline with a newer Fast R-CNN approach (see this paper and, again, this presentation). The only missing piece is the so-called ROI Pooling. Fortunately, you can find a Theano implementation here (it's still somewhat untested). Roughly speaking, you need to do the following:
ROIDataLayer
) instead of stack of patches of a fixed-size.