Video compression fundamentals

Memory requirements of PCM video

In RGB (PCM) video, each color pixel need at least 24 bpp (bits/pixel).
The memory requirements of RGB video are enormous. For example, an hour of $640\times 480\times 25$ Hz true-color of PCM video needs:

\begin{equation} 25\frac{\text{images}}{\text{second}}\times 640\cdot 480\frac{\text{pixels}}{\text{image}}\times 24\frac{\text{bits}}{\text{pixel}}= 184{.}320{.}000\frac{\text{bits}}{\text{second}} \end{equation}\begin{equation} 184{.}320{.}000\frac{\text{bits}}{\text{second}} \times 3{.}600\frac{\text{seconds}}{\text{hour}} \times \frac{1~\text{G}}{1{.}024^3}\times \frac{1~\text{byte}}{8~\text{bits}} \approx 77~\text{Gbytes} \end{equation}

Video coding techniques should be used to compress this data. Most of these techniques are bases on Block-based Motion Estimation.

Sources of redundancy

Spatial redundancy: Pixels are very similar in its neighborhood or tends to repeat textures.
Temporal redundancy: Temporally adjacent images are typically very alike.
Visual redundancy: Humans hardly perceive high spatial and temporal frequencies (we like more low frequencies).

Block-based ME (Motion Estimation)

Usually, only performed by the encoder.
ME removes temporal redundancy. A predicted image can be encoded as the difference between it and another image called prediction image which is a motion compensated projection of one or more images named reference images. ME tries to generate residue images as close as possible to the null images.
Usually, the reference image/s is/are divided in blocks of $16\times 16$ pixels called macroblocks.
Each reference block is searched in the predicted image and the best match is indicated by mean of a motion vector.
Depending on the success of the search and the number of reference images, the macroblocks are classified into:
- I (intra): When the compression of residue block generates more bits than the original (predicted) one.
- P (predicted): When it is better to compress the residue block and there is only one reference macroblock.
- B (bidirectionally predicted): The same, but if we have two reference macroblocks.
- S (skipped): When the energy of the residue block is smaller than a given threshold.
I-pictures are composed of I macroblocks, only.
P-pictures do not have B macrobocks.
B-pictures can have any type of macroblocks.

Sub-pixel accuracy

The motion estimation can be carried out using integer pixel accuracy or a fractional (sub-) pixel accuracy.
For example, in MPEG-1, the motion estimation can have up to 1/2 pixel accuracy. A bi-linear interpolator is used:

Matching criteria (similitude between macroblocks)

Let $a$ and $b$ the macroblocks which we want to compare. Two main distortion metrics are commonly used:
- Mean Square Error:
  
  \begin{equation} \frac{1}{16\times 16}\sum_{i=1}^{16}\sum_{j=1}^{16}(a_{ij}-b_{ij})^2 \end{equation}
- Mean Absolute Error:
  
  \begin{equation} \frac{1}{16\times 16}\sum_{i=1}^{16}\sum_{j=1}^{16}|a_{ij}-b_{ij}| \end{equation}
These similitude measures are used only by the compressor. Therefore, any other one with similar effects (such as the error variance or the error entropy) could be used also.

Searching strategies

Only performed by the compressor.
- Full search: All the possibilities are checked. Advantage: the best compression. Disadvantage: CPU killer.
- Logaritmic search: It is a version of the full search algorithm where the macro-blocks and the search area are sub-sampled. After finding the best coincidence, the resolution of the macro-block is increased in a power of 2 and the previous match is refined in a search area of $\pm 1$, until the maximal resolution (even using subpixel accuracy) is reached.
- Telescopic search: Any of the previously described techniques can be speeded up if the searching area is reduced. This can be done supposing that the motion vector of the same macro-block in two consecutive images is similar.

The GOP (Group Of Pictures) concept

The temporal redundancy is exploited by blocks of images called GOPs. This means that a GOP can be decoded independently of the rest of GOPs. Here an example:

Lossy predictive video coding

Let $V_i$ the i-th image of the video sequence and $V^{[q]}_i$ and approximation of $V_i$ with quality $q$ (most video compressors are lossy). In this context, an hybrid video codec (t+2d) ("t" means ''temporal redundancy supression'', and 2d ``spatial redundancy supression'') has the following structure:

MCTF (Motion Compensated Temporal Filtering)

This is a DWT where the input samples are the original video images and the output is a sequence of residue images.

t+2d vs. 2d+t vs. 2d+t+2d

t+2d: The sequence of images is decorrelated first along the time (t) and the residue images are compressed, exploiting the remaining spatial (2d) redundancy. Examples: MPEG and H.26 codecs (except H.264/SVC).
2d+t: The spatial (2d) redudancy is explited first (using typically the DWT) and next the coefficients are decorrelated along the time (t). To date this has only been experimental setup because most transformed domains are not invariant to the displacement.
2d+t+2d: The fist step creates a Laplacian Pyramid (2d), which is invariant to the displacement. Next, each level of the pyramid is decorrelated along the time (t) and finally, the remaining spatial redundancy is removed (2d). Example: H.264/SVC.

Deblocking filtering

Block based video encoders (those than use block-based temporal decorrelation) improve their performance if a deblocking filter in used to create the quantized prediction predictions.

*The low-pass filter is applied only on the block boundaries.

Bit-rate allocation

Under a constant quantization level (constant video quality), the number of bits that each compressed image needs depends on the image content. Example:

The encoder must decide how much information will be stored in each residue image, taking into account that this image can serve as a reference for other images.

Quality scalability

Ideal for remote visualization environments.
In reversible codecs, $V_i^{[0]}=V_i$.

Temporal scalability

It holds that: \begin{equation} V^{t}=\{V_{2^t\times i};~0\le i < \frac{\text{#}V}{2^t}\}=\{V_{2i}^{t-1};~0\le i < \frac{\text{#}V^{t-1}}{2}\}, \end{equation} where $\text{#}V$ is the number of pixtures in $V$ and $t$ denotes the Temporal Resolution Level (TRL).
Notice that $V=V^{0}$.
Useful for fast random access.

Spatial scalability

Useful for low-resolution devices.
In reversible codecs, $V_i=V_i^{<0>}$ and $V_i^{<s>}$ has a $\frac{Y}{2^s}\times \frac{X}{2^s}$ resolution, where $X\times Y$ is the resolution of $V_i$.

Media encoding models

1. Single Layer Coding (SLC)

Most audio and image/video codecs generate non-scalable streams. In the case of video, only one quality, resolution and picture-rate are available at decoding time.
The decoding of a single layered stream generates a reconstruction whose quality is linearly proportional to the amount of decoded data.

2. Multiple Layer Coding (MLC)

A media encoded in several layers can be decoded to provide (in the case of video) different picture-rates (time scalability), different resolutions (spatial scalability) and different qualities (quality scalability).
In some codecs (such as JPEG 2000), spatial random access it is available ROI (Region-Of-Interest) or WOI (Window-Of-Interest) scalability. ROI is used in special imaging, such as mammography. WOI can useful in the retrieving of high-resolution video sequences such as JHelioviewer.

3. Multiple Description Coding (MDC)

Multiple description codecs provides a set of partially redundant streams so that the quality of the reconstructions improve with the number of descriptions decoded.
An example of this type of encoding is the scene segmentation (video object coding) provided by MPEG-4.

4. Media simulcast

In transmission scenarios, a source can store several copies of the same media, althought variying the temporal resolution, spatial resolution and/or quality.
Obviously, this is quite redundant at the source side. However, as it will be shown after, adaptive services can be provided with this technique, such as in YouTube.



In [ ]: