Face detection by finding facial keypoints

In this tutorial, we will learn a network that segments approximate location of three facial landmarks on human faces: the eyes and mouth. This is illustarted here:

input image	target

The output tensor has the shape $H\times W\times 3$ for an input image of $H\times W$ pixels. Each channel is used as a heat map for a different facial landmark. If a facial landmark is on a coordinate $(x^*, y^*)$, then the heat map has the following form: $$ M(x, y)= \exp\left(-\frac{(x^* - x)^2 + (y^* - y)^2}{2\cdot s\cdot\sigma}\right) $$ where $s$ is the scale of the face and $\sigma$ is a constant.

The L2 hinge loss is defined as follows for the value $\delta\in\mathbb{R}$: $$ \text{loss}(\delta)= \begin{cases} \frac{1}{2}\delta^2 & \text{if $|\delta|\geq\epsilon$}\\ 0 & \text{else} \end{cases} $$ where $\epsilon$ is a small positive number (e.g., $0.1$). This loss is applied on a per-element level to tensor $$ \Delta= \mathbf{X} - \mathbf{Y}, $$ where $\mathbf{X}$ is the output of the network and $\mathbf{Y}$ is the ground truth. Our learning criterion is obtained by averaging all the elements of $\Delta$.