Preparing the Voxforge database

This notebook will demonstrate how to prepare the free Voxforge database for training. This database is a medium sized (~80 hours) database available online for free under the GPL license. A much more common database used in most research is the TIMIT, but that costs $250 and also much smaller (~4h - although much more professionally developed than Voxforge). The best alternative today is the Librispeech database, but that has a few dozen GB of data (almost 1000h) and wouldn't be sensible for a simple demo. So Voxforge it is...

First thing to do is realize what a speech corpus actually is: in its simplest form it is a collection of audio files (containing preferably speech only) with a set of transcripts of the speech. There are a few extensions to this that are worth noting:

  • phonemes - transcripts are usually presented as a list of words - although not a rule, it is often easier to start the recognition process with phonemes and go from there. Voxforge defines a list of 39 phonemes (+ silence) and contains a lexicon mapping the words into phonemes (more about that below)
  • aligned speech - the transcripts are usually just a sequence of words/phonemes, but they don't denote which word/phoneme occurs when - there are models that can learn from that (seq. learning, e.g. CTC, attention models), but having alignments is usually a big plus. TIMIT was hand-aligned by a group of professionals (which is why its a popular resource for research), but Voxforge wasn't. Fortunately, we can use one of the many available tools to do this automatically (with a margin of error - more on that below)
  • meta-data - each recording session in the Voxforge database contains a readme file with useful information about the speaker and the environment that the recording took place in. When making a serious speech recognizer, this information can be very useful (e.g. for speaker adaptation - taking into account the speaker id, gender, age, etc...)

Downloading the corpus

To start working with the corpus, it needs to be downloaded first. All the files can be found in the download section of the Voxforge website under this URL:

http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/

There are 2 versions of the main corpus: sampled at 16kHz and 8kHz. The 16 kHz one is of better quality and is known as "desktop quality speech". While the original recordings were made at an even higher quality (44.1 kHz), 16k is completely sufficient for recoginzing speech (higher quality doesn't help much). 8 kHz is known as the telephony quality and is a standard value for the old (uncompressed, aka T0) digital telephone signal. If you are making a recognizer that has to work in the telephony environment, you should use this data instead

To download the whole dataset, a small program in Python is included in this demo. Be warned, this can take a long time (I think Voxforge is throttling the speed to save on costs) and restarts may be neccessary. The python method does check for failed downloads (compares file sizes) and restarts whatever wasn't downloaded completely, so you can run the method 2-3 times to make sure everything is ok.

Alternatively, wou can use a program like wget and enter this command (where "audio" is the dir to save the data to):

wget -P audio -l 1 -N -nd -c -e robots=off -A tgz -r -np http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit

First lets import all the voxforge methods from the python directory. These will need the following libraries installed on your system:

  • numpy - for working with data
  • random, urllib, lxml, os, tarfile, gzip, re, pickle, shutil - these are standard system libraries and anyone should have them
  • scikits.audiolab - to load the audio files from the database (WAV and FLAC files)
  • tqdm - a simple library for progressbars that you can install using pip

In [1]:
import sys

sys.path.append('../python')

from voxforge import *


/usr/local/lib/python2.7/dist-packages/scikits/audiolab/soundio/play.py:48: UserWarning: Could not import alsa backend; most probably, you did not have alsa headers when building audiolab
  warnings.warn("Could not import alsa backend; most probably, "

Ignore any warnings above (I coudn't be bothered to compile audiolab with Alsa). Below you will find the method to download the Voxforge database. You only need to do this once, so you can run it either here or from a console or use wget. Be warned that it takes a long time (as mentioned earlier) so it's a good idea to leave it running over night.


In [ ]:
downloadVoxforgeData('../audio')

Loading the corpus

Once the data is downloaded and stored in the 'audio' subdir of the main project dir, we can start loading the data into a Python datastructure. There are several methods that can be used for that. The following method will load a file and display its contents:


In [12]:
f=loadFile('../audio/Joel-20080716-qoz.tgz')
print f.props
print f.prompts
print f.data

%xdel f


{'Pronunciation dialect': 'American English', 'File type': 'wav', 'Age Range': 'Adult', 'Speaker Characteristics': '', 'Language': 'EN', 'File Info': '', 'Gender': 'Male', 'Audio Recording Software': 'VoxForge Speech Submission Application', 'Audio card type': 'unknown', 'User Name': 'Joel', 'Sample rate format': '16', 'Number of channels': '1', 'O/S': '', 'Microphone make': 'n/a', 'Path': 'Joel-20080716-qoz', 'Sampling Rate': '48000', 'Microphone type': 'USB Headset mic', 'Recording Information': '', 'Audio card make': 'unknown'}
{'b0076': ['BEFORE', 'PHILIP', 'COULD', 'RECOVER', 'HIMSELF', "JEANNE'S", 'STARTLED', 'GUARDS', 'WERE', 'UPON', 'HIM'], 'b0077': ['IT', 'IS', 'THE', 'NEAREST', 'REFUGE'], 'b0074': ['YET', 'BEHIND', 'THEM', 'THERE', 'WAS', 'ANOTHER', 'AND', 'MORE', 'POWERFUL', 'MOTIVE'], 'b0075': ['IN', 'THAT', 'CASE', 'HE', 'COULD', 'NOT', 'MISS', 'THEM', 'IF', 'HE', 'USED', 'CAUTION'], 'b0081': ['YOU', 'WERE', 'GOING', 'TO', 'LEAVE', 'AFTER', 'YOU', 'SAW', 'ME', 'ON', 'THE', 'ROCK'], 'b0080': ['TOMORROW', 'IT', 'WILL', 'BE', 'STRONG', 'ENOUGH', 'FOR', 'YOU', 'TO', 'STAND', 'UPON'], 'b0083': ['IN', 'IT', 'THERE', 'WAS', 'SOMETHING', 'THAT', 'WAS', 'ALMOST', 'TRAGEDY'], 'b0082': ['HE', 'BIT', 'HIS', 'TONGUE', 'AND', 'CURSED', 'HIMSELF', 'AT', 'THIS', 'FRESH', 'BREAK'], 'b0078': ['THERE', 'WAS', 'PRIDE', 'AND', 'STRENGTH', 'THE', 'RING', 'OF', 'TRIUMPH', 'IN', 'HIS', 'VOICE'], 'b0079': ['THE', 'TRUTH', 'OF', 'IT', 'SET', 'JEANNE', 'QUIVERING']}
{'b0076': array([-1, -1, -1, ..., -1, -1, -2], dtype=int16), 'b0077': array([-1, -1, -1, ...,  0,  0, -1], dtype=int16), 'b0074': array([-1, -1, -2, ..., -1, -2, -2], dtype=int16), 'b0075': array([-1, -2, -1, ..., -2, -2, -1], dtype=int16), 'b0078': array([-1, -1, -1, ..., -1, -1, -2], dtype=int16), 'b0079': array([ 0, -1, -1, ..., -1, -1, -1], dtype=int16), 'b0083': array([-1, -2, -2, ..., -1, -2, -2], dtype=int16), 'b0082': array([-1, -1, -1, ..., -2, -2, -2], dtype=int16), 'b0081': array([-1, -1, -2, ..., -1, -2, -1], dtype=int16), 'b0080': array([-1,  0, -1, ..., -2, -1, -1], dtype=int16)}

The loadBySpeaker method will load the whole folder and organize its contents by speakers (as a dictionary). Each utterance contains only the data and the prompts. For this demo, only 30 files are read - as this isn't a method we are going to ultimately use.


In [13]:
corp=loadBySpeaker('../audio', limit=30)



The corpus can also be extended by the phonetic transcription of the utterances using a lexicon file. Voxforge does provide such a file on its website and it is downloaded automatically (if it doesn't already exist).

Note that a single word can have several transcriptions. In the lexicon, these alternatives will have sequential number suffixes added to the word (word, word2, word3, etc), but this particular function will do nothing about that. Choosing the right pronounciation variant has to be done either manually, or by using a more sophisticated program (a pre-trained ASR system) to choose the right version automatically.


In [14]:
addPhonemesSpk(corp,'../data/lex.tgz')

In [15]:
print corp.keys()

spk=corp.keys()[0]

print corp[spk]

%xdel corp


['Bitbrit', 'Perygryne', 'dw96', 'bugsysservant', 'Krellis', 'nalbion', 'camdixon', 'anonymous_9', 'anonymous_8', 'anonymous_5', 'Splitlocked', 'anonymous_7', 'anonymous_6', 'anonymous_1', 'corno1979', 'anonymous_3', 'anonymous_2', 'ductapeguy', 'Dimon', 'poorsquinky', 'anonymous_4', 'anonymous_11', 'anonymous_10', 'rilomino', 'Robertcz', 'joel', 'snblitz', 'inequation', 'pcsnpny', 'Bahoke']
{'b0179': [array([ 0,  0,  0, ..., 29, 18,  3], dtype=int16), ['LET', 'THEM', 'GO', 'OUT', 'AND', 'EAT', 'WITH', 'MY', 'BOYS'], ['l', 'eh', 't', 'dh', 'eh', 'm', 'g', 'ow', 'aw', 't', 'ah', 'n', 'd', 'iy', 't', 'w', 'ih', 'dh', 'm', 'ay', 'b', 'oy', 'z']], 'b0178': [array([  0,   0,   0, ..., -27, -65, -57], dtype=int16), ['ALSO', 'I', 'WANT', 'INFORMATION'], ['ao', 'l', 's', 'ow', 'ay', 'w', 'aa', 'n', 't', 'ih', 'n', 'f', 'ao', 'r', 'm', 'ey', 'sh', 'ah', 'n']], 'b0184': [array([  0,   0,   0, ..., -13, -18,  -7], dtype=int16), ['SUCH', 'THINGS', 'IN', 'HER', 'BRAIN', 'WERE', 'LIKE', 'SO', 'MANY', 'OATHS', 'ON', 'HER', 'LIPS'], ['s', 'ah', 'ch', 'th', 'ih', 'ng', 'z', 'ih', 'n', 'hh', 'er', 'b', 'r', 'ey', 'n', 'w', 'er', 'l', 'ay', 'k', 's', 'ow', 'm', 'eh', 'n', 'iy', 'ow', 'dh', 'z', 'aa', 'n', 'hh', 'er', 'l', 'ih', 'p', 's']], 'b0185': [array([  0,   0,   0, ..., -14,  -7, -10], dtype=int16), ['YOUR', 'BEING', 'WRECKED', 'HERE', 'HAS', 'BEEN', 'A', 'GODSEND', 'TO', 'ME'], ['y', 'ao', 'r', 'b', 'iy', 'ih', 'ng', 'r', 'eh', 'k', 't', 'hh', 'iy', 'r', 'hh', 'ae', 'z', 'b', 'ih', 'n', 'ah', 'g', 'aa', 'd', 's', 'eh', 'n', 'd', 't', 'uw', 'm', 'iy']], 'b0182': [array([  0,   0,   0, ..., 141,  98,  56], dtype=int16), ['I', 'WAS', 'IN', 'NEW', 'YORK', 'WHEN', 'THE', 'CRASH', 'CAME'], ['ay', 'w', 'aa', 'z', 'ih', 'n', 'n', 'uw', 'y', 'ao', 'r', 'k', 'w', 'eh', 'n', 'dh', 'ah', 'k', 'r', 'ae', 'sh', 'k', 'ey', 'm']], 'b0183': [array([ 0,  0,  0, ..., -8, -1, 23], dtype=int16), ['NO', 'I', 'DID', 'NOT', 'FALL', 'AMONG', 'THIEVES'], ['n', 'ow', 'ay', 'd', 'ih', 'd', 'n', 'aa', 't', 'f', 'ao', 'l', 'ah', 'm', 'ah', 'ng', 'th', 'iy', 'v', 'z']], 'b0180': [array([  0,   0,   0, ..., -71,   2, -43], dtype=int16), ['I', 'I', 'BEG', 'PARDON', 'HE', 'DRAWLED'], ['ay', 'ay', 'b', 'eh', 'g', 'p', 'aa', 'r', 'd', 'ah', 'n', 'hh', 'iy', 'd', 'r', 'ao', 'l', 'd']], 'b0181': [array([  0,   0,   0, ...,  -5, -50, -32], dtype=int16), ['AND', 'YOU', 'PREFERRED', 'A', 'CANNIBAL', 'ISLE', 'AND', 'A', 'CARTRIDGE', 'BELT'], ['ah', 'n', 'd', 'y', 'uw', 'p', 'r', 'ah', 'f', 'er', 'd', 'ah', 'k', 'ae', 'n', 'ah', 'b', 'ah', 'l', 'ay', 'l', 'ah', 'n', 'd', 'ah', 'k', 'aa', 'r', 't', 'r', 'ah', 'jh', 'b', 'eh', 'l', 't']], 'b0177': [array([ 0,  0,  0, ...,  0, -3,  1], dtype=int16), ['HER', 'GRAY', 'EYES', 'WERE', 'FLASHING', 'AND', 'HER', 'LIPS', 'WERE', 'QUIVERING'], ['hh', 'er', 'g', 'r', 'ey', 'ay', 'z', 'w', 'er', 'f', 'l', 'ae', 'sh', 'ih', 'ng', 'ah', 'n', 'd', 'hh', 'er', 'l', 'ih', 'p', 's', 'w', 'er', 'k', 'w', 'ih', 'v', 'er', 'ih', 'ng']], 'b0176': [array([ 0,  0,  0, ...,  9, -1, -1], dtype=int16), ["I'LL", 'SEE', 'TO', 'POOR', 'HUGHIE'], ['ay', 'l', 's', 'iy', 't', 'uw', 'p', 'uw', 'r', 'hh', 'y', 'uw', 'iy']]}

Aligned corpus

As mentioned earlier, this sort or cropus has it's downsides. For one, we don't know when each phoneme occurs so we cannot train the system discriminatavely. While it's still possible, it would be nice if we could start with a simpler example. Another problem is choosing the right pronounciation variant mentioned above.

To solve these issues, an automatic alignement was created using a different ASR system called Kaldi. This system is a very good ASR solution that implements various types of models. It also contains simple out-of-the-box scripts for training on Voxforge data.

To create the alignments using Kaldi, a working system had to be trained first and what's interesting, the same Voxforge data was used to train the system. How was this done? Well, Kaldi uses (among other things) a classic Gaussian Mixture Model and trains it using the EM algorithm. Initially the alignment is assumed to be even, throughout the file, but as the system is trained iteratively, the model gets better and thus the alignment gets more accurate. The system is trained with gradually better models to achieve even more accurate results and the provided solution here is generated using the "tri3b" model, as described in the scripts.

The alignments in Kaldi are stored in special binary files, but there are simple tools to help convert them into something more easier to use. The type of file chosen for this example is the CTM file, which contains a series of lines in a text file, each line describing a single word or phoneme. The description has 5 columns: encoded file name, unused id (always 1), segment start, segment length and segment text (i.e. word of phoneme name/value). This file was generated using Kaldi, compressed using gzip and stored in 'ali.ctm.gz' in the 'data' directory of this project.

Please note, that the number of files in this aligned set is smaller than the acutal count in the whole Voxforge dataset. This is because there is a small percentage of errors in the database (around a 100 files or so) and some recordings are of such poor quality that Kaldi couldn't generate a reasonable alignemnet for these files. We can simply ignore them here. This, however, doesn't mean that all the alignments present in the CTM are 100% accurate. There can still be mistakes there, but hopefully they are unlikely enough to not cause any issue.

While this file contains everything that we need, it'd be useful to convert it into a datastructure that can be easily used in Python. The convertCTMToAli method is used for that:


In [ ]:
convertCTMToAli('../data/ali.ctm.gz','../data/phones.list','../audio','../data/ali.pklz')

We store the generated datastructure into a gzipped and pickled file, so we don't need to perform this more than once. This file is already included in the repository, so you can skip the step above.

We can read the file like this:


In [4]:
import gzip
import pickle
with gzip.open('../data/ali.pklz') as f:
    ali=pickle.load(f)
    
print 'Number of utterances: {}'.format(len(ali))


Number of utterances: 59274

Here is an example of the structure and its attributes loaded from that file:


In [5]:
print ali[100].spk
print ali[100].phones
print ali[100].ph_lens
print ali[100].archive
print ali[100].audiofile
print ali[100].data


Aaron
[0, 10, 11, 28, 38, 23, 1, 31, 3, 23, 6, 25, 31, 3, 3, 35, 31, 28, 34, 32, 17, 23, 17, 31, 0]
[480, 70, 70, 50, 70, 70, 90, 90, 30, 60, 170, 90, 90, 90, 80, 50, 120, 60, 80, 100, 40, 60, 80, 140, 720]
Aaron-20080318-ngh
b0346
None

Please note that the audio data is not yet loaded at this step (it's set to None).

Test data

Before we go on, we need to prepare our test set. This needs to be completely independent from the training data and it needs to be the same for all the experiemnts we want to do, if we want to be able to make them comparable in any way. The test set also needs to be "representable" of the whole data we are working on (so they need to be chosen randomly from all the data).

This isn't the only way we could perform our experiments - very often people use what is known as "k-fold cross validation", but that would take a lot of time to do for all our experiemnts, so choosing a single representative evaluation set is a more convinient option.

Now, generally most corpora have a designated evaluation set: for example, TIMIT has just such a set of 192 files that is used by most papers on the subject. Voxforge doesn't have anything like that and there aren't many papers out there using it as a resource anyway. One of the most advanced uses of Voxforge is in Kaldi and there they only shuffle the training set and choose 20-30 random speakers from that. To make our experiemtns at least "comparable" to TIMIT, we will try and do a similar thing here, but we will save the list of speakers (and their files) so anyone can use the same when conducting their experiemnts.

WARNING If you want to compare the results of your own experiemnts to the ones from these notebooks, then don't run the code below and use the files provided in the repo. If you run the code below, you will reset the test ordering and your experiements won't be strictly comparable to the ones from these notebooks.


In [26]:
import random
from sets import Set

#make a list of speaker names
spk=set()
for utt in ali:
    spk.add(utt.spk)

print 'Number of speakers: {}'.format(len(spk))

#choose 20 random speakers
tst_spk=list(spk)
random.shuffle(tst_spk)
tst_spk=tst_spk[:20]


#save the list for reference - if anyone else wants to use our list (will be saved in the repo)
with open('../data/test_spk.list', 'w') as f:
    for spk in tst_spk:
        f.write("{}\n".format(spk))

ali_test=filter(lambda x: x.spk in tst_spk, ali)
ali_train=filter(lambda x: not x.spk in tst_spk, ali)

print 'Number of test utterances: {}'.format(len(ali_test))
print 'Number of train utterances: {}'.format(len(ali_train))

#shuffle the utterances, to make them more uniform
random.shuffle(ali_test)
random.shuffle(ali_train)

#save the data for future use
with gzip.open('../data/ali_test.pklz','wb') as f:
    pickle.dump(ali_test,f,pickle.HIGHEST_PROTOCOL)
    
with gzip.open('../data/ali_train.pklz','wb') as f:
    pickle.dump(ali_train,f,pickle.HIGHEST_PROTOCOL)


Number of speakers: 1953
Number of test utterances: 328
Number of train utterances: 58946

To make things more managable for this demo, we will take 5% of the training set and work using that instead of the whole 80 hours. 5% should give us an amount similar to TIMIT. If you wish to re-run the experiments using the whole dataset, go to the bottom of this notebook for further instructions.


In [27]:
num=int(len(ali_train)*0.05)

ali_small=ali_train[:num]

with gzip.open('../data/ali_train_small.pklz','wb') as f:
    pickle.dump(ali_small,f,pickle.HIGHEST_PROTOCOL)

Here we load additional data using the loadAlignedCorpus method. It loads the alignment and the appropriate audio datafile for each utterance (it can take a while for larger corpora):


In [2]:
corp=loadAlignedCorpus('../data/ali_train_small.pklz','../audio')



We have to do the same for the test data:


In [3]:
corp_test=loadAlignedCorpus('../data/ali_test.pklz','../audio')



Now we can check if we have all the neccessary data: phonemes, phoneme alignments and data.


In [4]:
print 'Number of utterances: {}'.format(len(corp))

print 'List of phonemes:\n{}'.format(corp[0].phones)
print 'Lengths of phonemes:\n{}'.format(corp[0].ph_lens)
print 'Audio:\n{}'.format(corp[0].data)

samp_num=0
for utt in corp:
    samp_num+=utt.data.size

print 'Length of cropus: {} hours'.format(((samp_num/16000.0)/60.0)/60.0)


Number of utterances: 2947
List of phonemes:
[0, 16, 17, 38, 29, 21, 17, 22, 16, 2, 23, 38, 15, 28, 17, 27, 31, 10, 18, 11, 19, 17, 38, 3, 35, 10, 3, 31, 13, 7, 3, 21, 0]
Lengths of phonemes:
[930, 90, 70, 60, 180, 90, 70, 90, 120, 280, 90, 110, 70, 50, 40, 90, 30, 50, 120, 90, 120, 50, 30, 90, 50, 40, 60, 110, 140, 70, 40, 180, 510]
Audio:
[  6  16  28 ..., -31  16  21]
Length of cropus: 4.01091307292 hours

Feature extraction

To perform a simple test, we will use a standard set of audio features used in many, if not most papers on speech recognition. This set of features will first split each file into a bunch of small chunks of equal size, giving about a 100 of such frames per second. Each chunk will then be converted into a vector of 39 real values. Furhtermore, each vector will be assigned a phonetic class (value from 0..39) thanks to the alignemnt created above. The problem can then be sovled as a simple classification problem that maps a real vector to a phonetic class.

This particular set of features is calculated to match a specification developed in a classic toolkit known as HTK. All the details on this feature set can be found under the linked repository here. If you want to experiment with this feature set (highly encouraged) please read the description there.

In the code below, we extract the set of features for each utterance and store the results, together with the classification decision for each frame. For performance reasons, we will store all the files in HDF5 format using the h5py library. This will allow us to read data directly from the drive without wasting too much RAM. This isn't as important when doing small experiemnts, but it will get relevant for doing the large ones.

The structure of the HDF5 file is broken into utterances. The file contains a list of utterances sotred as groups in the root and each utterance has 2 datasets: inputs and outputs. Later also normalized inputs are added.

Since we intend to use this procedure more than once, we will encapsulate it into a function:


In [2]:
import sys

sys.path.append('../PyHTK/python')

import numpy as np
from HTKFeat import MFCC_HTK
import h5py

from tqdm import *

def extract_features(corpus, savefile):
    
    mfcc=MFCC_HTK()
    h5f=h5py.File(savefile,'w')

    uid=0
    for utt in tqdm(corpus):

        feat=mfcc.get_feats(utt.data)
        delta=mfcc.get_delta(feat)
        acc=mfcc.get_delta(delta)

        feat=np.hstack((feat,delta,acc))
        utt_len=feat.shape[0]

        o=[]
        for i in range(len(utt.phones)):
            num=utt.ph_lens[i]/10
            o.extend([utt.phones[i]]*num)

        # here we fix an off-by-one error that happens very inrequently
        if utt_len-len(o)==1:
            o.append(o[-1])

        assert len(o)==utt_len

        uid+=1
        #instead of a proper name, we simply use a unique identifier: utt00001, utt00002, ..., utt99999
        g=h5f.create_group('/utt{:05d}'.format(uid))
        
        g['in']=feat
        g['out']=o
        
        h5f.flush()
    
    h5f.close()

Now let's process the small training and test datasets:


In [6]:
extract_features(corp,'../data/mfcc_train_small.hdf5')
extract_features(corp_test,'../data/mfcc_test.hdf5')



Normalization

While usable as-is, many machine learning models will perform badly if the data isn't standardized. Standarization or normalization stands for making sure that all the samples are distributed to a reasonable scale - usually centered around 0 (with a mean of 0) and spread to a standard deviation of 1. The reason for this is because the data can come from various sources - some people are louder, some are quiter, some have higher pitched voices, some lower, some used a more sensitive microphones than other, etc. That is why the audio between sessions can have various ranges of values. Normalization makes sure that all the recordings are tuned to a similar scale before processing.

To perform normalization we simply need to compute the mean and standard deviation of the given signal and then subtract the mean and divide by the standard deviation (thus making the new mean 0 and new stdev 1). A common question is what signal do we use to perform these calculations? Do we calculate it once for the whole corpus, or once per each utterance? Maybe once per speaker or once per session? Or maybe several times per utterance?

Generally, the longer the signal we do this on, the better (the statistics get more accurate), but performing it only once on the whole corpus doesn't make much sense because of what is written above. The reason we normalize the data is to remove the differences between recording sessions, so at minimum we should normalize each session seperately. In practice, it's easier to just normalize each utterance as they are long enough on their own. This is known as "batch normalization" (where each utterance is one batch).

But this makes one assumption, that the recording conditions don't change significantly throughout the whole utterance. In certain cases, it may actually be a good idea to split the utterance into several parts and normalize them sepeartely, in case the volume changes throught the recording, or maybe there is more than one speaker in a single file. This is solved best by using a technique known as "online normalization" which uses a sliding window to compute the statistics and can react to rapid changes in the values. This is, however, beyond the scope of this simple demo (and shouldn't really be neccessary for this corpus anyway).


In [3]:
def normalize(corp_file):
    
    h5f=h5py.File(corp_file)

    b=0
    for utt in tqdm(h5f):
        
        f=h5f[utt]['in']
        n=f-np.mean(f)
        n/=np.std(n)        
        h5f[utt]['norm']=n
        
        h5f.flush()
        
    h5f.close()

In [14]:
normalize('../data/mfcc_train_small.hdf5')
normalize('../data/mfcc_test.hdf5')



To see what's inside we can run the following command in the terminal:


In [15]:
!h5ls ../data/mfcc_test.hdf5/utt00001


in                       Dataset {486, 39}
norm                     Dataset {486, 39}
out                      Dataset {486}

Simple classification example

To finish here, we will use a simple SGD classifier from the scikit.learn library to classify the phonemes from the database. We have all the datasets prepared above, so all we need to do is load the preapared arrays. We will use a special class called Corpus included in the data.py file. In the constructor we provide the path to the file and say that we wish to load the normalized inputs. Next we use the get() method to load the list of all the input and out values. This method returns a tuple - one value for inputs and one for outputs. Each of these is a list of arrays corresponding to individual utterances. We can then convert it into a single contiguous array using the concatenate and vstack methods:


In [4]:
from data import Corpus
import numpy as np

train=Corpus('../data/mfcc_train_small.hdf5',load_normalized=True)
test=Corpus('../data/mfcc_test.hdf5',load_normalized=True)

g=train.get()
tr_in=np.vstack(g[0])
tr_out=np.concatenate(g[1])

print 'Training input shape: {}'.format(tr_in.shape)
print 'Training output shape: {}'.format(tr_out.shape)

g=test.get()
tst_in=np.vstack(g[0])
tst_out=np.concatenate(g[1])

print 'Test input shape: {}'.format(tst_in.shape)
print 'Test output shape: {}'.format(tst_in.shape)

train.close()
test.close()


Training input shape: (1438429, 39)
Training output shape: (1438429,)
Test input shape: (156679, 39)
Test output shape: (156679, 39)

Here we create the SGD classifier model. Please note that the settings below work on the version 0.17 of scikit-learn, so it's recommended to upgrade. If you can't, then feel free to modify the settings to something that works for you. You may also turn on verbose to get more information on the training process. Here it's off to preserve space in the notebook.


In [5]:
import sklearn
print sklearn.__version__

from sklearn.linear_model import SGDClassifier

model=SGDClassifier(loss='log',n_jobs=-1,verbose=0,n_iter=100)


0.17

Here we train the model. It took 4 minutes for me:


In [6]:
%time model.fit(tr_in,tr_out)


CPU times: user 56min 14s, sys: 7.4 s, total: 56min 21s
Wall time: 5min 31s
Out[6]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=100, n_jobs=-1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

Here we get about ~52% accuracy which is pretty bad for phoneme recogntion. In other notebooks, we will try to improve on that.


In [7]:
acc=model.score(tst_in,tst_out)
print 'Accuracy: {:%}'.format(acc)


Accuracy: 51.881873%

Other data

Here we will also prepare the rest of the data to perform other experiments. If you wish to make only simple experiments and don't want to waste time on preparing large datasets and wasting a lot of time, feel free to skip these steps. Be warned that the dataset for the full 80 hours of training data takes up to 10GB in so you will need that much memory in RAM as well as your drive to make it work using the code present in this notebook.


In [ ]:
corp=loadAlignedCorpus('../data/ali_train.pklz','../audio')
extract_features(corp,'../data/mfcc_train.hdf5')
normalize('../data/mfcc_train.hdf5')


  0%|          | 136/58946 [00:20<53:25, 18.35it/s]

In [ ]: