J. Redmon, S. Divvala, R. Girschick, A. Farhadi

Introduction

Unlike other, existing object detectors, YOLO doesn't use any region proposal methods to localize objects. It recasts object detection as a single regression problem and hence is remarkably easy to train and use.

A single CNN predicts the bounded boxes and the corresponding class scores.

Why YOLO?

YOLO is fast. The authors claim 45 FPS without batch processing on an NVIDIA Titan X.
YOLO looks at the whole image in one go. So it implicitly learns the contextual information between classes.
YOLO is very generalizable. Testing YOLO, that was trained on natural images, on artwork outperforms other methods.
But the accuracy is not state-of-the-art (YOLOv2 is better).

Unified Detection

YOLO divides an input image into an $S \times S$ grid. The grid cell which contains the center of an object is responsible for detecting that object.

YOLO predicts $B$ bounding boxes per cell alongwith confidence scores for these boxes. Confidence score is defined as follows:

$$ Pr(\text{Object}) \cdot \text{IOU}^{\text{ truth}}_{\text{ pred}} $$

If a grid cell doesn't contain any object's center, then the above score should be zero. Otherwise, it is the IOU of the predicted bounding box with the ground truth.

Each bounding box is 5-D vector: $(x, y, w, h, c)$. $(x, y)$ is the center of the box relative to the bounds of the grid cell. $(w, h)$ are the dimensions of the box relative to dimensions of the image. $c$ is the confidence score.

Each cell also predicts class probabilities ($Pr(\text{Class}_i|\text{Object})$) which are conditional on the existence of an object's center in that grid cell. YOLO only predicts one set of class probabilities regardless of the number of bounding boxes $B$.

During inference, conditional class probabilities and individual confidences are multiplied to get class-specific scores for each bounding box.

$$ Pr(\text{Class}_i|\text{Object}) \cdot Pr(\text{Object}) \cdot \text{IOU}^{\text{ truth}}_{\text{ pred}} = Pr(\text{Class}_i) \cdot \text{IOU}^{\text{ truth}}_{\text{ pred}} $$

$S, B$ are hyperparameters. The authors used $S = 7, B = 2$ in their experiments.

FAQ

Why use a grid instead of, say, letting the network predict some $n$ bounding boxes?

That kind of an architecture doesn't enforce spatial variation in bounding boxes. Each bounding box is predicted by a different set of neurons (at least in the last layer) and not making each set solely responsible for predicting boxes in a particular region of input image might make all the bounding box predictions to be that of the most prominent object in the input image.
Why predict $B$ bounding boxes?

This allows the network to predict objects closeby.

Why output a confidence score for each bounding box?

The network is going to output some box or the other. Confidence score tells how confident the network is about the box being an object, in other words, it tells whether the bounding box is legitimate.

Network Design

YOLO uses a 24 layer convolutional layers followed by 2 fully connected layers. The architecture is inspired by GoogLeNet. For the values of $S, B$ chosen, YOLO outputs a $7\times7\times30$ tensor

Training

The first 20 layers are pretrained on ImageNet (1000-class dataset) with a average-pooling layer and a fully connected layer attached. The remaining 4 convolutional layers and 2 fully connected layers are randomly initialized. Though ImageNet has $224\times224$ images, YOLO uses $448\times448$ images for fine-grained visual information.

Loss Function

$$ \lambda_{\text{coord}}\sum_{i=0}^{S^2}\sum_{j=0}^{B}\mathbb{1}^{\text{obj}}_{ij}\left[\left(x_i-\hat{x}_i\right)^2 + \left(y_i-\hat{y}_i\right)^2 + \left(\sqrt{w_i}-\sqrt{\hat{w}_i}\right)^2 + \left(\sqrt{h_i}-\sqrt{\hat{h}_i}\right)^2\right] \\ + \sum_{i=0}^{S^2}\sum_{j=0}^{B}\mathbb{1}^{\text{obj}}_{ij}\left(C_i-\hat{C}_i\right)^2 + \lambda_{\text{noobj}}\sum_{i=0}^{S^2}\sum_{j=0}^{B}\mathbb{1}^{\text{noobj}}_{ij}\left(C_i-\hat{C}_i\right)^2 + \sum_{i=0}^{S^2}\mathbb{1}^{\text{obj}}_{i}\sum_{c \in \text{classes}}\left(p_i(c)-\hat{p}_i(c)\right)^2 $$

Breaking down the loss function

It's a weighted squared loss function.
$\lambda_{\text{coord}} = 5, \lambda_{\text{noobj}} = 0.5$ is chosen to give more weight to localization loss and to care less about class scores in grid cells with no objects. This is because most grid cells don't have any objects and not weighing those down can cause these to overpower the gradients over those of the cells that do contain objects.
The loss has differences between the square roots of $w,h$ to equalise the effects of errors between small and large bounding boxes.
Note that class probabilities are included in the loss only if there is an object which is why they are conditional probabilities (conditioned on the presence of an object).

Things to try

YOLO has a lot of localization errors as described in the paper. Try using IOU directly in the loss function instead of squared loss.
Inception-ResNet-v2 instead of GoogLeNet.

You Only Look Once: Unified, Real-Time Object Recognition