Preparing the TIMIT database

If you are lucky enough to own the TIMIT database, or are willing to buy it from here, you can use this simple script to prepare the HDF5 files, similarly to how it was done with the Voxforge dataset. For a detailed explanation of all the steps, refer to the VoxforgeDataPrep notebook.

To make things brief, a lot of functions were implemented as a library and stored in the timit.py script in the python directory. We will include these methods here.


In [1]:
import sys

sys.path.append('../python')

from timit import *

TIMIT alignments

We begin by loading a list of files and their alignemts. The methods below will load the PHN files and their corresponding audio and store them in a list, just like with the VoxForge database. The method here requires a list of files to load and their location. A list of files for the corresponding dataset is included in the repository. More about the actual counts is written in the sections below.


In [2]:
train_corp=prepare_corp_dir('../data/TIMIT_train.list','../TIMIT/train')
dev_corp=prepare_corp_dir('../data/TIMIT_dev.list','../TIMIT/test')
test_corp=prepare_corp_dir('../data/TIMIT_test.list','../TIMIT/core_test')



CNTK alignemnts

NOTE: use this section only if you need to do sequence decoding

Another source of information is a Deep Learning library project called CNTK released by Microsoft. The files in question are located on their Github page in the CNTK/Examples/Speech/Miscellaneous/TIMIT/lib/mlf/ folder. I have copied this into my data folder. What we're interested in are the MLF files that have state-level alignments, that is each phoneme is modeled by a sequence of 3 states (ending in s2, s3 and s4; 1 and 5 are non-emitting states not present in the output).

Many papers use this model of 183 so-called sonemes instead of the 61 phonemes present in the original corpus. If you are doing the sequence decoding task, you may want to use these alignemnts instead. For framewise phoneme classificaion use the alignemnts above, as these are not accurate enough for that task.


In [ ]:
train_mlf=load_mlf('../data/mlf/TIMIT.train.align_cistate.mlf.cntk')
dev_mlf=load_mlf('../data/mlf/TIMIT.dev.align_cistate.mlf.cntk')
test_mlf=load_mlf('../data/mlf/TIMIT.core.align_cistate.mlf.cntk')

The prepare_corp method will load all the audio and prepare the soneme lists from the MLF into the same format as presented in the VoxforgeDataPrep notebook. Again, run this only if you need to do sequence decoding:


In [ ]:
train_corp=prepare_corp(train_mlf,'../data/mlf/TIMIT.statelist','../TIMIT/train')
dev_corp=prepare_corp(dev_mlf,'../data/mlf/TIMIT.statelist','../TIMIT/test')
test_corp=prepare_corp(test_mlf,'../data/mlf/TIMIT.statelist','../TIMIT/core_test')

About the corpus

TIMIT was originally split into several parts. The largest is the trianing portion with 3696 utterances spoken by 462 different people. The test set contains 1344 utterances, but most paper use a smaller portion of this, known as the "core test set" which has 192 files. Finally, there is also a portion that has many different speakers reading two identical sentences, known as the "SA" dataset. This last one is of little use for studying ASR, but may be interesting for research on speaker variablity and such stuff.

The Microsoft people use the standard 3969 training set - as do most other researchers presenting their results on TIMIT. They also use the core test set of 192 as everyone else. For the dev data, they use a collection of 400 sentences from the test set that aren't in the core set. Here we will use the same:


In [3]:
print 'Train utterance num: {}'.format(len(train_corp))
print 'Dev utterance num: {}'.format(len(dev_corp))
print 'Test utterance num: {}'.format(len(test_corp))


Train utterance num: 3696
Dev utterance num: 400
Test utterance num: 192

The datastructure describing the corpus is a list with objects containing the following information:


In [4]:
print train_corp[0].name
print train_corp[0].data
print train_corp[0].data.shape
print train_corp[0].phones
print train_corp[0].ph_lens


mfrm0_si1155
[ 5 -2  0 ...,  3  4  8]
(61748,)
[14, 36, 9, 59, 26, 51, 53, 0, 28, 39, 32, 46, 5, 49, 52, 18, 49, 59, 48, 7, 49, 37, 5, 10, 54, 40, 59, 48, 3, 38, 5, 10, 54, 40, 29, 59, 11, 50, 12, 22, 21, 44, 5, 48, 41, 33, 1, 0, 52, 5, 10, 54, 24, 59, 14]
[12, 4, 8, 5, 4, 5, 4, 7, 4, 6, 7, 3, 3, 6, 11, 13, 8, 8, 4, 11, 6, 2, 7, 4, 7, 16, 11, 6, 10, 2, 5, 7, 5, 14, 3, 6, 6, 8, 8, 4, 5, 2, 4, 6, 9, 9, 7, 15, 9, 5, 5, 3, 9, 15, 11]

Feature extraction

Here we extract the simple 39 MFCC feature set. This is the same method as in the VoxforgeDataPrep notebook. It stores the processed data into an HDF5 file containg an IN (for the features) and OUT (for phoneme classes) arrays for each utterance:


In [5]:
extract_features(train_corp, '../data/TIMIT_train.hdf5')
extract_features(dev_corp, '../data/TIMIT_dev.hdf5')
extract_features(test_corp, '../data/TIMIT_test.hdf5')



Here we normalize the data, same as before. This simply augments the database with a NORM array which is a per-utterance normalized version of the IN aray:


In [6]:
normalize('../data/TIMIT_train.hdf5')
normalize('../data/TIMIT_dev.hdf5')
normalize('../data/TIMIT_test.hdf5')