Audio compression fundamentals

Differential encoding

Motivation

Usually happens that the differences between consecutive (quantized or not) (Pulse-Code Modulation) PCM [A.V. Oppenheim, 1999] samples tend to have a smaller variance than the original samples: $$ \text{var}(e) < \text{var}(s) $$ where, $e[n] = s[n] - s[n-1]$.
It also holds that: $$ H(e)<H(s), $$ which provided potentially better lossless compression ratios for $e$ than for $s$.
Finally, notice that by definition, if quantization is not used along the encoding process, differential encoding is a fully reversible process: $$ s[n] = e[n] + s[n-1]. $$

DPCM (Differential PCM) [Gibson98}}

Lossless.
A DPCM stream is a sequence of differences between consecutive samples: \begin{equation} o[n] = i[n] - P(i,n) \end{equation} where for the simplest case: \begin{equation} P(i,n) = i[n-1] \end{equation}
Modest compression ratio (depending on the predictor).
Used (for example), in the G.726 standard.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_ylo=1998&as_yhi=1998&q=audio+perceptual+coder+1998&btnG=

PAC (Perceptual Audio Coder) Rather than minimizing analytic measures of distortion, suc h as signal-to-noise ratio, p erceptual co ders attempt to minimize p erceiv ed distortion.

P A C , the P erceptual Audio Co der [Johnston & F erreira, 1992 ], emplo ys source co ding tec hniques to remo v e signal redundancy and p erceptual co ding tec hniques to remo v e sig- nal irrelev ancy .

The PAC algorithm has its roots in a study done by Johnston [Johnston, 1988a], [Johnston, 1988b] on the perceptual entropy (PE) versus the statistical entropy of music. Exploiting the fact that the perceptual entropy (the entropy of that portion of the music signal above the masking threshold) was less than the statistical entropy resulted in the perceptual transform coder (PXFM) [Johnston, 1988b] [Quackenbush et al., 1989]. This algorithm used a 2048 point real FFT with 1=16 overlap, which gave good frequency resolution (for redundancy removal) but had some coding loss due to the window overlap. The next-generation algorithm was ASPEC [Brandenburg & Johnston, 1991], which used the modied discrete-cosine transform (MDCT) lterbank [Princen & Bradlen, 1986] instead of the FFT, and a more elaborate bit allocation and buer control mechanism as a means of generating constant-rate output. The MDCT is a critically sampled lterbank, and so does not suer the 1=16 overlap loss that the PXFM coder did. In addition ASPEC employed an adaptive window size of 1024 or 256 to control noise spreading resulting from 2

quantization. However its frequency resolution was half that of PXFM's resulting in some loss in the coding eciency (c.f., section 3). PAC as rst proposed in [Johnston & Ferreira, 1992] is a third-generation algorithm learning from ASPEC and PXFM-Stereo [Johnston, 1989]. In its current for, it uses a long transform window size of 2048 for better redundancy removal together with window switching for noise spreading control. It adds composite stereo coding in a exible and easily controlled form, and introduces improvements in noiseless compression and threshold calculation methods as well. Additional threshold calculations are made for stereo signals to eliminate the problem of binaural noise unmasking. PAC supports encoders of varying complexity and quality. Broadly speaking, PAC consists of a core codec augmented by various enhancement. The full capability algorithm is sometimes also referred to as Enhanced PAC (or EPAC). EPAC is easily congurable to (de)activate some or all of the enhancements depending upon the computational budget.

One of the major enhancements in the EPAC codec is geared towards improving the quality at lower bit rates of signals with sharp attacks (e.g., castanets, triangles, drums, etc.). Distortion of attacks is a particularly noticeable artifact at lower bit rates. In EPAC, a signal adaptive switched lterbank which switches between a MDCT and a wavelet transform is employed for analysis and synthesis [Sinha & Johnston, 1996]. Wavelet transforms oer natural advantages for the encoding of transient signals and the switched lterbank scheme allows EPAC to merge this advantage with the advantages of MDCT for stationary audio segments.

http://www.atc-labs.com/acepages/crcchapter.pdf

Predictive Audio Coding

Analysis-by-synthesis coding (CELP)

Delta Modulation

Employs a first-order predictor $$ p(s,k) = ap(s,k-1) $$ so, $P(z) = az^{-1}$, where $z^{-1}$ represents a unit delay.
The quantizer has only 2 levels, and the quantizer step size $\Delta(k)$ is usually adaptive.
Prediction error signal: $$ e(k) = s(k) - ap(s,k-1) $$

Quantizer output: $$ e_q(k) = \Delta(k)\text{sgn}(e(k)) = \left{

\begin{array}{ll}
  +\Delta(k), & \mbox{$e(k) \geq 0$};\\
  -\Delta(k), & \mbox{$e(k) < 0$}.
\end{array} \right.

Reconstructed signal: $$ \hat{s}(k) = a\hat{s}(k-1)+e_q(k) $$
Quantization error (noise): $$ n_q(k) = e_q(k) - e(k) = \hat{s}(k) - s(k) $$
Typically, step size evolves according to the rule: $$ \Delta(k) = \beta\Delta(k-1)+(1-\beta)\Delta_\text{min}+f(k) $$ where $\Delta_\text{min}$ is the minimum quantizer step size, $0 < \beta < 1$ is a parameter to be selected, and $$ f(k) = \left{
```
\begin{array}{ll}
  (1-\beta)(\Delta_\text{max} - \Delta_\text{min}), & \mbox{if $b(k)=b(k-1)=b(k-2)$}\\
  0 & \mbox{otherwise},
\end{array} \right.
```
$$ where $$ b(k) = \left{
```
\begin{array}{ll}
  +1, & \mbox{if $e_q(k) = +\Delta(k)$}\\
  -1, & \mbox{if $e_q(k) = -\Delta(k)$}.
\end{array} \right.
```
$$ (transmitted bit sequence), and $\Delta_\text{max}\triangleq$ maximum step size.

\section{ADPCM (Adaptive DPCM)~\cite{Gibson98}} %{{{

\begin{itemize} \item An adaptive quantizer $Q$ remove ``the noise signal'': \begin{equation} o[n] = Q\big(i[n] - P(i,n)\big) \end{equation} \item Notice that: \begin{equation} Q^{-1}(Q(x)) \approx x \end{equation} i.e. the codec is irreversible (lossy). It is not possible to recover $i[n]$ from $o[n]$. \item Used, e.g., by the G.726 standard. \end{itemize}

%}}}

\section{AAC (Advanced Audio Coding)~\cite{Bosi}} %{{{

\begin{itemize} \item Standardized by ISO and IEC, as part of the MPEG-2 and MPEG-4 specifications. \item 48 full-bandwidth (up to 96 kHz) audio channels in one stream plus 16 low frequency effects (LFE, limited to 120 Hz) channels, up to 16 ``coupling'' or dialog channels, and up to 16 data streams. \item Coding algorithm: \begin{enumerate} \item Transform the audio from the time domain to the frequency domain. \item Quantization of the frequencies using a psycho-acoustic model. \item (Lossless) Huffman coding of the remaining data. \end{enumerate} \end{itemize}

%}}}

\section{Dolby Digital (AC-3)~\cite{AC-3}} %{{{

\begin{itemize} \item Also known as ATSC (Advanced Television Systems Committee) AC-3, was developed by Dolby Labs in 1992 (Batman Returns). \item Used (together with DTS) extensively in cinema theaters, DVDs, Blu-rays, DVB, etc. \item Five channels for normal-range speakers (20 Hz - 20,000 Hz) (right, center, left, right surround, left surround) and one channel (20 Hz – 120 Hz allotted audio) for the sub-woofer. \item Coding algorithm: \begin{enumerate} \item Transform the audio from the time domain to the frequency domain. \item Quantization of the frequencies using a psycho-acoustic model. \item (Lossless) Huffman coding of the remaining data. \end{enumerate} \end{itemize}

%}}}

\section{Digital DTS (Digital Theater Systems) Surround~\cite{DTS, DTS-DolbyDigital}} %{{{

\begin{itemize} \item Developed by DTS in 1993 (Jurassic Park). \item Used (together with Dolby Digital) extensively in cinema theaters, DVDs, Blu-rays, etc. \item Lossy compression of up to five channels for normal-range speakers (20 Hz - 20,000 Hz) (right, center, left, right surround, left surround) and one channel (20 Hz – 120 Hz allotted audio) for the sub-woofer. \item Coherent Acoustics Algorithm: \begin{enumerate} \item Split the original PCM sequence into 32 PCM sequences, each one for a specific frequency subband. \item Determine the quantization values for each subband using a psycho-acoustic model. \item Lossy compress each subband PCM sequence using ADPCM. \end{enumerate} \end{itemize}

%}}}

\section{MP3 (MPEG-1 Audio Layer 3)~\cite{Rao}} %{{{

\begin{itemize} \item Developed by the MPEG Audio group (Fraunhofer IIS, University of Hannover, AT\&T-Bell Labs, Thomson-Brandt and CCETT (Centre Commun d'\'Etudes de T\'el\'evision et T\'el\'ecommunications)) in 1992. \item Used in MP3 players, DVDs, etc. \item Encoding algorithm: \begin{enumerate} \item Split the original PCM sequence into 32 PCM sequences, each one for a specific frequency subband. \item Transform each subband using the MDCT. \item Determine the quantization values for each subband using a psycho-acoustic model. \item Lossless compress each MDCT subband sequence using Huffman coding. \end{enumerate} \end{itemize}

%}}}

\section{Vorbis} %{{{

\begin{itemize} \item Vorbis is a free and open-source software project headed by the Xiph.Org Foundation since 2000. \item Used for example in Spotify, Firefox and Chrome. \item Encoding algorithm: \begin{enumerate} \item Transform the PCM audio sequence using the MDCT. \item Divide the MDCT information in 20 barks and determine the quantization values for each bark using a psycho-acoustic model. \item Quantify each MDCT coefficient. \item Lossy compress the remaining MDCT information using VQ and Huffman. \end{enumerate} \end{itemize}

%}}}

\section{FLAC (Free Lossless Audio Codec)~\cite{Vorbis}} %{{{

\begin{itemize} \item FLAC was started in 2000. In 2003 it becomes as a project in the Xiph.Org Foundation. \item Used when no loss of quality is desired. Typically compress 2:1. \item Encoding algorithm: \begin{enumerate} \item Decorrelate PCM samples using linear prediction. \item Encode the residuals using RLE and Golomb Coding. \end{enumerate} \end{itemize}

%}}}

\section{CELP (Code-Excited Linear Prediction)~\cite{CELP}} %{{{

\begin{itemize} \item Proposed by M.R. Schroeder and B.S. Atal in 1985. \item Specifically designed for the human voice. \item Used mainly in VoIP applications, Speex and MPEG-4. \item Very low bit-rates (less than 4 Kbps). \item Encoding algorithm: \begin{enumerate} \item Split the PCM signal in small PCM chunks. \item For each chunk: \begin{enumerate} \item Search a set of code-words (and their corresponding gains) that minimize the distortion in a decoder that use the code-word indexes (and the gains) to synthesize a voice signal by beans of a model of the human vocal tract. \end{enumerate} \item Encode the residuals using RLE and Golomb Coding. \end{enumerate} \end{itemize}

Lossless/noiseless (source coding) and lossy

Perceptual coders are based on a psychoacoustic model and take advantage of the masking properties of the human auditory system.
Most audio compression algorithms typically segment the input signal into blocks of 2ms up to 50ms duration. A time-frequency analysis then decomposes each analysis block in the encoder. This transformation or subband filtering scheme compacts the energy into a few transform coefficients and therefore de-correlates successive samples. These coefficients, subband samples or parameters are quantized and encoded according to perceptual criteria. Depending on the system objectives, the time-frequency analysis section might contain:
- Unitary transform (MDCT, FFT)
- Polyphase filterbank with uniform bandpass filters
- Time-varying, critically sampled bank of nonuniform bandpass filters
- Hybrid transform/filterbank scheme
- Harmonic/sinusoidal signal analyzer
- Source system analysis (LPC)
The time-frequency analysis approach always involves a fundamental tradeoff between time and frequency resolution requirements. The choice of the time-frequency analysis method additionally determines the amount of coding delay introduced, a parameter which may become important in duplex broadcast and live-events applications.
Lossless audio compression mostly is realized by using a combination of linear prediction or a transformation followed by entropy coding. The linear predictor attempts to minimize the variance of the difference signal between the predicted sample value and its actual value. The entropy coder (Huffmann, LZW) allocates short codewords to samples with high probability of occurance and longer codewords to samples with lower probability and in this way reduces the average bit consumption.
In contrast to lossless audio coding which is based on removing redundancy, lossy coding techniques make use of removing redundancy and irrelevancy based on perceptual criteria (http://staff.fh-hagenberg.at/schaffer/avt2/Docs/Perceptual_Audio_Coding_Tutorial.pdf). In contrast to a lossless coding system, a lossy compression schemes not only exploits the statistical redundancies but also the perceptual irrelevancies of the signal, as they result from the properties of the human auditory system.

Text compression = Entropy coding

In the context of signal encoding, entropy coding refeers to assign short codewords to hightly probable (quantization levels) levels and longer code-words to less probable levels, yilding a short average code-word length.

Quantization

How big is audio data?

Mobile: up to $13$ Kbps.
(Terrestial) telephony: $64$ Kbps.
CD quality: $1.441$ Mbps.
AC-3 (Dolby Digital): up to $6.144$ Mbps.
DTS: up to $1509.75$ Kbps.

How to reduce the bit-rate?

Reducing the sampling rate (less bandwidth).
Reducing the number of channels.
Reducing the bits/sample (quantization).
Using audio compression.

What is an audio codec (COder/DECoder)?

PCM   +---------+        +---------+ PCM
----->| Encoder |------->| Decoder |----->
audio +---------+ stream +---------+ audio'

              audio != audio'
                (usually)

Typical encoder steps

Overlaped subband analysis (usually with MDCT). Goes from the temporal to a frequency domain.
Quantization. Basically, removes pure signals of low amplitude but taking also into account the SAM (pSycho Acoustic Model) of the HAS (Human Auditory System). Noise use to be of low power!
Entropy coding.

Overlaped processing

0              N-1            2N-1            3N-1
+---------------+---------------+---------------+ s[n]
<--------Transform Step--------->
                <---------Transform Step-------->

Each transform step inputs $2N$ samples and outputs $N$ MDCT coeficients.
$N$ can vary depending on the characteristics of the sound. For \emph{complex} sounds without clear armonics (such as a plosive sound), shortened windows improve the performance. For \emph{simple} sounds (such as a music instrument), large windows are better.

MDCT

Equivalent to apply a bank of $N$ filters.
Determines the correlation between a set of $2N$ numbers (samples) and $N$ orthogonal (two functions/signals are orthogonal if it is impossible to obtain one of them by means of the other.) cosine functions. Therefore, at the input of the DCT there are $2N$ samples and at the output, $N$ coefficients.
MDCT coefficients $S[w]$ of the PCM samples $s[n]$ are defined as: \begin{equation} S[w] = \sum_{n=0}^{2N-1}s[n]cos\Big[\frac{\pi}{N}(n+\frac{1}{2}+\frac{N}{2})(w+\frac{1}{2})\Big]. \label{eq:MDCT} \end{equation}

SAM (pSycho Acoustic Model) of the HAS (Human Auditory System)

ATH (Absolute Threshold of Hearing) model [Terhardt, 1979]

Fletcher H., “Auditory Patterns”, Rev. Mod. Phys., January 1940, pp. 47-65. reported on test results for a large range of listeners, showing their absolute threshold of hearing, depending on the stimulus frequency. This threshold in quiet is well approximated by the nonlinear function:

$Tq(f) = 3.64(f/1000)^{-0.8} 6.5e^{-0.6(f/1000-3.3)^2} + 10`{-3}(f/1000)^4 [dB-SPL] $

The absolute threshold of hearing is used in some coding schemes in order to scale the tolerable injected quantization noise.

(Absolute Threshold of hearing )

This means that humans ear better those sounds that contains audio signals with frequencies that ranges between 3 KHz and 4 KHz.

Frequency resolution and simultaneous masking

The HAS has a limited frequency resolution. Psychoacoustic experiments have demonstrated that the audible frequencies can be grouped into \href{../../../Perception/Auditive_perception/index.html#x1-50004}{barks}.
Each bark defines the group of frequencies that excite the same cochlear area, i.e., those frequencies that can be masked by the tone with the highest energy (in that bark).
In the cochlea, a frequency-to-place transformation takes place which leads to the notion of critical bands Critical bandwidth can be considered as the bandwidth at which subjective responses change abruptly. For example, the perceived loudness of a narrow band noise at constant sound pressure level remains constant within a critical band and then, begins to increase, once the bandwidth of the stimulus is increased beyond one critical band [5] (Zwicker E., Fastl H, “Psychoacoustics, Facts and Models” Springer Verlag, 1990 ).

Bark Scale | Lower edge | Upper edge | Bandwidth | Center Frequency

Temporal masking

The human auditory system has inertia: \href{../../../Perception/Auditive_perception/index.html#x1-70006}{sounds are not instantly perceived and remains after they are disapered}.

Channel coupling

Most of the time, similar sounds are transported in the channels of a non-mono audio signal. Channel coupling decreases inter-channel redundancy, usually, using prediction techniques.

Quantization

Depending on the desired output bit-rate and the frequency (see the ATH model), the SAM applies a different quantization step to barks (see Section~\ref{sec:ATH}). Roughly speaking, the higher the compression ratio, the larger the quantization step and therefore, the quantization noise; and the higher the frequency, the wider the bark. Notice also that the perception of a tone in a bark depends also on the temporal masking.
At decoding time, those barks that suffered the biggest lossess are usually filled with white noise in order to ncrease the perceived quality.

Entropy Coding

Usually, a variable bit-rate (VBR) lossless encoding algorithm asigns code-words of less bits to those code-vectors (one or more quantized MDCT coefficients) with a high probability, and viceversa, producing an effective reduction of the bit-rate.