In RGB (PCM) video, each color pixel need at least 24 bpp (bits/pixel).
The memory requirements of RGB video are enormous. For example, an hour of $640\times 480\times 25$ Hz true-color of PCM video needs:
Let $a$ and $b$ the macroblocks which we want to compare. Two main distortion metrics are commonly used:
Mean Square Error:
\begin{equation} \frac{1}{16\times 16}\sum_{i=1}^{16}\sum_{j=1}^{16}(a_{ij}-b_{ij})^2 \end{equation}
Mean Absolute Error:
\begin{equation} \frac{1}{16\times 16}\sum_{i=1}^{16}\sum_{j=1}^{16}|a_{ij}-b_{ij}| \end{equation}
These similitude measures are used only by the compressor. Therefore, any other one with similar effects (such as the error variance or the error entropy) could be used also.
Only performed by the compressor.
Full search: All the possibilities are checked. Advantage: the best compression. Disadvantage: CPU killer.
Logaritmic search: It is a version of the full search algorithm where the macro-blocks and the search area are sub-sampled. After finding the best coincidence, the resolution of the macro-block is increased in a power of 2 and the previous match is refined in a search area of $\pm 1$, until the maximal resolution (even using subpixel accuracy) is reached.
Telescopic search: Any of the previously described techniques can be speeded up if the searching area is reduced. This can be done supposing that the motion vector of the same macro-block in two consecutive images is similar.
t+2d: The sequence of images is decorrelated first along the time (t) and the residue images are compressed, exploiting the remaining spatial (2d) redundancy. Examples: MPEG and H.26 codecs (except H.264/SVC).
2d+t: The spatial (2d) redudancy is explited first (using typically the DWT) and next the coefficients are decorrelated along the time (t). To date this has only been experimental setup because most transformed domains are not invariant to the displacement.
2d+t+2d: The fist step creates a Laplacian Pyramid (2d), which is invariant to the displacement. Next, each level of the pyramid is decorrelated along the time (t) and finally, the remaining spatial redundancy is removed (2d). Example: H.264/SVC.
It holds that: \begin{equation} V^{t}=\{V_{2^t\times i};~0\le i < \frac{\text{#}V}{2^t}\}=\{V_{2i}^{t-1};~0\le i < \frac{\text{#}V^{t-1}}{2}\}, \end{equation} where $\text{#}V$ is the number of pixtures in $V$ and $t$ denotes the Temporal Resolution Level (TRL).
Notice that $V=V^{0}$.
Useful for fast random access.
Most audio and image/video codecs generate non-scalable streams. In the case of video, only one quality, resolution and picture-rate are available at decoding time.
The decoding of a single layered stream generates a reconstruction whose quality is linearly proportional to the amount of decoded data.
A media encoded in several layers can be decoded to provide (in the case of video) different picture-rates (time scalability), different resolutions (spatial scalability) and different qualities (quality scalability).
In some codecs (such as JPEG 2000), spatial random access it is available ROI (Region-Of-Interest) or WOI (Window-Of-Interest) scalability. ROI is used in special imaging, such as mammography. WOI can useful in the retrieving of high-resolution video sequences such as JHelioviewer.
Multiple description codecs provides a set of partially redundant streams so that the quality of the reconstructions improve with the number of descriptions decoded.
An example of this type of encoding is the scene segmentation (video object coding) provided by MPEG-4.
In transmission scenarios, a source can store several copies of the same media, althought variying the temporal resolution, spatial resolution and/or quality.
Obviously, this is quite redundant at the source side. However, as it will be shown after, adaptive services can be provided with this technique, such as in YouTube.
In [ ]: