How does HTK compute features?

This notebook demonstrates the MFCC_HTK class and explains how the features are computed in detail.

Algorithm breakdown

  1. split the window in frames and perform the following steps on each frame
  2. perform pre-emphasis of the signal frame
  3. multiply the frame by a Hamming window
  4. perform FFT to get the spectrum of the signal - use only half of the FFT for further computation
  5. apply triangular filters to compress the spectrum into a set of filter-banks
  6. take log of filterbank
  7. apply DCT to the filterbank to get RAW MFCCs
  8. normalize the MFCCs
  9. apply liftering
  10. optionally compute energy (either from signal or from spectrum)
  11. compute deltas and acceleration coefficients

Importing the neccesary classes

First we will import the neccessary classes and libraries.

In [1]:
import sys

from HTKFeat import MFCC_HTK
from HTK import HCopy, HTKFile

import numpy as np
%matplotlib inline
import matplotlib.pyplot as P

Loading a sample audio

In the test folder, we have a sample audio file taken from the VoxForge project:

We can load it using the simple method load_raw_signal included in the class MFCC_HTK.

We will also plot the oscilogram and the spectrogram of the audio below.

In [2]:

signal = mfcc.load_raw_signal('../python-test/file.raw')

sig_len=signal.size/16000 #in seconds


(0, 6)

DC removal

There are several things that can be done to an audio signal before it is analyzed by the system. One such thing is to make sure that all the data is cenetered around zero. This can easily be done by subtracting the mean of the signal, thus making the new mean zero. Often (as is the case here) a properly recorded signal will already have this property, but if it doesn't, it can mess up some of the computation below. This includes errors in computing energy and spectrogram. This artefact, known as the DC-offset, will be entirely stored in the 0'th frequency point of the FFT. Usually it will be ignored by the later steps, but it may cause issues if you forget about it. This library doesn't implement DC offset removal automatically (HTK does), but it's a trivial enough task to do yourself, if you need to:

In [3]:
print "Before: "+str(np.mean(signal))

signal = signal - np.mean(signal)

print "After: "+str(np.mean(signal))

Before: -1.07105
After: 2.73394107353e-14

Splitting the signal into frames

This operation takes two parameters win_len and win_shift. The shift will determine how many windows we will get from the audio. If the shift is every 10ms, there are going to be 100 frames per second - this is the most common setting for anaylzing speech.

The length is also an important parameter. While you could set the length to be equal to the shift, by making it longer you can actually overlap the frames a little bit to better model the transitions between them and avoid loosing any information on the edges of frames (as discussed in windowing below). Many researchers will use "half-overlap" (meaning that frame length is double of frame shift), but there are other aspects worth noting about this parameter. The length of the window will first and foremost determine the amount of data we use for analysis. If the window is too short, some frequency components (especially the low ones) will not be modelled correctly - a very low frequency component would get split into several windows and wouldn't be represented accurately by any one of them. For speech we generally don't need to model anything lower than a couple of dozen Hz, so a window length of 25 ms is completely sufficient.

To compute the number of frames given the length of the signal, we need to account for the cases where not all frames can be full populated by the signal. If the last few frames cannot be fully populatd by the signal, we can either fill the rest of them with zeros (or other data) or we can discard them. In the case of this library, we discard them.

In [4]:
#these are a part of the configuration:

#the values above are in samples
#so 10ms * 16000 samples per second and 25ms * 16000 samples per second respectively


#this is how we compute the window number while discarding the ones that don't fit the signal completely

for w in range(win_num):
    #these are the start and end of each window
    #we need the copy of the data, because the numpy array slicing gives us a view of the data
    #and we don't won't to mess up the original signal when we start modifying the windows


Pre-emphasis is given by the formula:

\begin{equation} \hat{s}_t=s_t-k\cdot s_{t-1} \end{equation}

where $k$ is the pre-emphasis coefficient in the range $<0,1>$ given in the configuration. This operation is used to improve the SNR of the high frequency content of the signal, which is usually recorded at a lower quality. If we look at the process as a very simple FIR filter, its amplitude reponse would look something like this:

In [5]:
h = [1,-k]

P.ylabel('Amplitude correction')

<matplotlib.text.Text at 0x7f6cfad74710>

In python we perform the pre-emphasis very easily, but we need to copy the first sample twice to match HTK's implementation for handling index -1 at the start of the operation:

In [6]:
for win in wins:


Windowing is a very important step in the Fourier anaylsis of the signal. Since we are extracting a very short term window from a long signal, this is equivalent of multiplying the signal with a rectangular window of the length win_len positioned at a specific location. From signal processing theory, we know that multiplying two signals in the time domain is the same as performing convolution in the frequency domain. That is why we would like the function we use to extract the window in the spectrum domain to be as close to a dirac delta function as possible. While in practice this is imposible to achieve perfectly (apart from using a sync function, which has infinite length and therefore unusable in this case), there are many functions that try to approximate this as close as possbile. You can find an extensive list of such functions here:

Below you will see the rectangular function and its frequency response and the Hamming function, as used in HTK. Note the values (in logarithm scale) for the frequency resposne of both functions.

In [7]:
rect = np.ones(400)

P.title('Square window')

P.ylabel('Amplitude (log)')

hamm = np.hamming(400)

P.title('Hamming function')

P.ylabel('Amplitude (log)')

/usr/local/lib/python2.7/dist-packages/ipykernel/ RuntimeWarning: divide by zero encountered in log
/usr/local/lib/python2.7/dist-packages/ipykernel/ RuntimeWarning: divide by zero encountered in log
<matplotlib.text.Text at 0x7f6cf87eb610>

We apply the windowing function to all the frames here:

In [8]:
for win in wins:


The Fourier Transform is used to compute the spectral content of the signal. FFT is usually computed to a power of two. Although not strictly neccessary for every implementation of this function, the algorithm is optimal if computed on vectors of length being a power of two. To get this value, we take the length of the window and find the smallest value greater than the window length and also a power of two. For window length of 400, this value is 512.

Numpy's FFT implemntation includes a bunch of functions prefixed 'r' that compute only half of the transform, since the result is symmetrical and we are only interested in the first half. The rfft method returns one value extra, so we discard that. Finally, we compute the absolute value to get the amplitude of the frequency response and discard the phase altogether.

In [9]:

for win in wins:



(0, 256)


The spectrum has quite a lot of points and we want a fairly small number of values to represent the signal at each point. That is why we reduce the number of points from 256, down to 26. Now we could do this by averaging the neighborhood values evenly thought the frequency range, but research has shown that human hearing is more sensitive to certain frequencies than others. In fact, there are several empirically discovered scales that describe our frequency sensitivity pretty well.

One such scale is the Mel scale and it was devised by letting lots of people listen to tones and judging which tones sound as having an equal distance from one another. Using this dataset, a formula was devised that fit the data very well. There are some criticisms behind this method, but it is still widely used by many people.

\begin{equation} m = 2595 \log_{10} \left(1 + \frac{f}{700}\right) \end{equation}

Or in a different base:

\begin{equation} m = 1127 \ln \left(1 + \frac{f}{700}\right) \end{equation}

In [10]:
freq2mel = lambda freq: 1127*(np.log(1+((freq)/700.0)))

f = np.linspace(0,8000,1000)
m = freq2mel(f)


<matplotlib.text.Text at 0x7f6cf844e7d0>

Another similar concept is the Bark scale, which was used to determine the critical frequency bandwidths of hearing. One of the conculsion of that experiment was that there are 24 critical bands that play an important role in how humans percieve sound and that is why such a number is used for modeling speech. This is not to say, that humans perceive only 24 different frequencies - it is clear that we have a difficulty to perceive sounds whose pitch is very similar, but in real life the position of these bands is far from static. The model used in MFCC computation is much more simplified than reality.

Even though the number of critical frequencies is 24, many researchers decide to use 26 filterbanks instead. It is not entirely clear why (probably empirical reasons), but even HTKBook mentions 26 filters in their tutorial section. That is also a reason why many researchers copy this setup when writing their papers. For consistency, we will stick with 26. Again, the difference is probably insignificant anyway.

As far as the filters themselves, HTK uses a very crude model to generate the filterbanks. Filterbanks can generally be computed in one of two ways - by desiging their coefficients in the time domain (which can be difficult), or by averaging the spectrum (which is what is used in HTK). Since the amplitude spectrum itself can be interpreted as a very fine-grained filterbank averaging the points that correspond to the frequencies in the band of the filter can produce a very close approximation.

HTK goes one step further and uses triangular filters that are half-overalpping. To compute these filters, we need to detrmine the center frequencies of each of them (in the mel-scale) and set their bandwidths to correspond to the center frequencies of their neighbors. See below:

In [11]:
#this method creates the filters for our class
#normally it is run from the constructor

for f in mfcc.filter_mat.T:

(0, 256)

Also worth noting is that the start and the end of the filterbank is determined by the lo_freq and hi_freq parameters of the system.

Now here is where things get a bit strange. It is not too difficult to design the filters that match the image above and this is exactly what is displayed in the HTK Book chapter on the filterbank, however the actual implementation differs slightly.

As the author of the Python library, I am not too certain of why that is (is it a bug or a feature), so I have generated a set of 26 filters from HTK's C code and saved them in a CSV file. The C program (called gen_filt.c) is also included in the repo. If you want to have absolutely identical features as HTK, you should use that file. Otherwise, the differences are so small that it doesn't make much difference. This is what the HTK filters actually look like:

In [12]:

for f in mfcc.filter_mat.T:

(0, 256)

Once we have created a matrix with all the filters, it is enough to multiply the spectrum (calculated using FFT) with this matrix in order to get the output of the filters:

In [13]:
for f in ffts:
    m =,mfcc.filter_mat)


(0, 26)

Now this image is obviously very similar to the one above, but much less detailed. One problem with it is that it's mostly black - a lot of it has a very small value. A common method to amplify the details in data, while at the same time attenuating the highlights is to use a logarithm. In the image below, note that there is a lot of information that was hidden before. Each word is clearly visible now.

In [14]:
melspec = np.log(melspec)


(0, 26)


Once we computed the filterbanks we have a perfectly reasonable set of features to use in our speech recognition models or whatever else we may need, but we can still go a bit further. If we take another Fourier Transform of the data we have so far, we will have stepped from the spectrum domain and into the cepstrum domain (note the play of words).

Since again we are only interested in getting the amplitude of the Fourier Trasform and there is far less data to compute (only 26 input points) we can considerably simplify this procedure by using a Discrete Cosine Transfrom instead of the FFT. Using this formula, we can further compress our 26 spectral features into just 12 cepstral features. The formula for DCT is given like so:

\begin{equation} X_k=\sum_{n=0}^{N-1} x_n \cdot \cos\left[\frac{\pi}{N} \left(n+\frac{1}{2}\right) k\right] \text{ , for }k\in[0,N-1] \end{equation}

The whole right part of the formula (begining with cosine) can be pre-computed and stored into a matrix:

In [15]:

for m in range(mfcc_num):

<matplotlib.text.Text at 0x7f6cf8715ad0>

It may seem a bit harsh to use only half as many MFCCs as there are filterbanks, but it is generally understood that the higher coefficients bring far less useful information than the lower ones. After a certain point, we just stop seeing any benefits:

In [16]:

for m in range(mfcc_num):

f.suptitle('Lower coefficients',fontsize=14)
for i in range(12):
f.suptitle('Higher coefficients',fontsize=14)
for i in range(12,26):