Data preprocessing


In [ ]:
%load_ext autoreload
%autoreload 2
import numpy as np #scientific library
import scipy #scientific library
import scipy.io.wavfile
import sys, os #file reading, directory parsing routines
import matplotlib #to plot
import matplotlib.pyplot as plt

Audio file reading

We can use scipy.io to read and write audio files, without any external library. In this example we read a stereo file of shape (132299, 2), where the first dimension is the time and the second represents the left and right channels.


In [ ]:
path_to_irmas = './IRMAS-Sample/'

#read audio file
audio_file = '001__[vio][nod][cou_fol]2194__1.wav'
sampleRate, audioObj = scipy.io.wavfile.read(os.path.join(path_to_irmas,'Training','vio',audio_file)) 

print audioObj.shape
plt.plot(audioObj)
plt.show()

As you can see in the plot above, the array is not normalized.

We included a function to read an audio file in util.py, which does this normalization step and also returns the bitrate.


In [ ]:
import util

audio, sampleRate, bitrate = util.readAudioScipy(os.path.join(path_to_irmas,'Training','vio',audio_file)) 

plt.plot(audio.sum(axis=1))
plt.show()

If a single channel audio is desired, we can sum the two channels: audio.sum(axis=1)

Feature computation

We extract features from audio which can be used for training the neural network.

In transform.py you can find a function which computes the Short-term Fourier transform (STFT) of a single-channel audio. The STFT spectrogram is computed for overlapping windows comprising nfft=1024 samples (0.011s). The overlap between these windows is hopsize=512 samples (0.022s).


In [ ]:
import transform

spectrogram = transform.stft_norm(audio.sum(axis=1), window=np.hanning(1024), hopsize=512, nfft=1024, fs=float(sampleRate))

spectrogram.shape

The function returns 261 vectors for each overlapping frame. Each vector has a feature size of 513.

Usually, we discard the phase, and we obtain the magnitude spectrogram (real numbers).

We also need to normalize by the size of the window.


In [ ]:
print spectrogram[0,0]
mag = np.abs(spectrogram)
print mag[0,0]

mag = mag  / np.sqrt(1024) #normalization
mag.shape

Let's plot this spectrogram as an image with matplotlib.


In [ ]:
#print plt.style.available
plt.style.use('classic')
plt.imshow(np.log10(1+100*mag).T,interpolation='none', origin='lower')
plt.show()

One can easily distinguish evenly-spaces lines which correspond to harmonic partials of a played instrument note. Take a look at how different notes appear in the image and what is the relation between the harmonic partials. We will try to learn this information from spectrograms to discriminate between instruments.

There are many more time-frequency representations than the STFT, some of them relying on the computation of STFT, as the mel spectrogram, widely used in speech processing.

To obtain a mel spectrogram we need to obtain a set of mel filters. We can use librosa for that and we included the mel function from this library.


In [ ]:
mel_basis = transform.mel(sampleRate, n_fft=1024, n_mels=96, fmin=0, fmax=float(sampleRate)/2, htk=False,norm=1)
mel_basis.shape

Then we multiply the STFT magnitude spectrogram with the mel filters to obtain the mel spectrogram.

Be careful with the dimensions. Some library consider first dimension time and frequency the second, while this might change for other libraries.


In [ ]:
melspec = np.dot(mel_basis, mag.T)
melspec.shape

Let's plot the mel spectrogram. As you can see we reduced the number of features from 513 to 96 and we put more emphasis on the lower frequencies which are better discriminated by the human perception.


In [ ]:
plt.imshow(np.log10(1+100*melspec),interpolation='none', origin='lower')
plt.show()

Next, we save the mel spectrogram to a binary file. We also need to save the shape to a text file, so we can reshape the numpy array when we load the file. This way of saving data to disk is faster than using pickle or npz files.

The functions saveTensor and loadTensor can be found in util.py .


In [ ]:
feature_dir_train = os.path.join(path_to_irmas,'features','Training')
if not os.path.exists(feature_dir_train):
    os.makedirs(feature_dir_train)

#save the features to file
melspec.tofile(os.path.join(feature_dir_train,audio_file.replace('.wav','.data')))
#load the features from file
melspecin = np.fromfile(os.path.join(feature_dir_train,audio_file.replace('.wav','.data')))
print 'input spectrogram shape '+str(melspecin.shape)
#we need to save the shape
melspecin = melspecin.reshape(melspec.shape)

In transform.py we provide a class that does this feature computation for you: transformMEL. This class has an associated method compute_transform which can compute the mel spectrogram and save it to a '.data' file.


In [ ]:
from transform import transformMEL
tr = transformMEL(bins=96, frameSize=1024, hopSize=512)

Let's compute the training features for all the audio files in the training dataset. We go through the instrument list and we read all the wave files from the corresponding directories. Then, we compute the features using compute_transform. We also need to save the labels for each sound example.

The suffix in compute_transform is used to discriminate between various features. In this case we use 'mel' for mel spectrogram and 'label' for the label associated with the instrument.


In [ ]:
d=os.path.join(path_to_irmas,'Training')
instruments = sorted(filter(lambda x: os.path.isdir(os.path.join(d, x)), os.listdir(d)))

for count,inst in enumerate(instruments):
    for f in os.listdir(os.path.join(d,inst)):
        if os.path.isfile(os.path.join(d,inst, f)) and f.endswith('.wav'):
            audio, sampleRate, bitrate = util.readAudioScipy(os.path.join(d,inst,f)) 
            tr.compute_transform(audio.sum(axis=1),out_path=os.path.join(feature_dir_train,f.replace('.wav','.data')),suffix='_mel_',sampleRate=sampleRate)
            util.saveTensor(np.array([count],dtype=float),out_path=os.path.join(feature_dir_train,f.replace('.wav','.data')),suffix='_label_')

Note that saveTensor and loadTensor work with float numbers, thus we have to convert our labels from int to float.

Batch generation

Batch generation is a pre-training stage in which we read '.data' files from disk and we group them into batches. Other steps such as normalization, PCA whitening can be implemented at this stage. We also need to shuffle all examples as presenting them grouped by instruments can create bias in training.

Training happens sequentially batch by batch until there is no data left. Then we shuffle the data again and we repeat the process.

Building a list of the .data files

Let's make a list of the '.data' files we computed previously. We use suffix_in and suffix_out to filter the feature files. In this case, we have mel spectrogram as input ('_mel') and instrument labels as output ('label').


In [ ]:
#build a list with all the .data files in the training dataset 
feature_dir_train = os.path.join(path_to_irmas,'features','Training')
suffix_in='_mel_'
suffix_out='_label_'
file_list = [f for f in os.listdir(feature_dir_train) 
            if f.endswith(suffix_in+'.data') and 
            os.path.isfile(os.path.join(feature_dir_train,f.replace(suffix_in,suffix_out))) ]
print 'training file list: \n'+str(file_list)

We can plot the first spectrogram to confirm that the data is correct.


In [ ]:
#let's load the first file
melspec = util.loadTensor(out_path=os.path.join(feature_dir_train,file_list[0]))
plt.imshow(np.log10(1+100*melspec.T),interpolation='none', origin='lower')
plt.show()
print 'input spectrogram shape '+str(melspec.shape)
label = util.loadTensor(out_path=os.path.join(feature_dir_train,file_list[0].replace('mel','label')))
print 'label of the instrument '+str(label)+', representing '+instruments[int(label)]

Allocating memory for the data

Convolutional neural networks (CNN) usually work with images of a specific size. Usually if we change the resolution or the shape of the image, we need to train a new model. Therefore, we need to cut or spectrogram into time_context (x-axis) blocks. Similarly, if we increase the resolution of the spectrogram, we have more features on the y-axis and we need to re-train the network.

In this case, the CNN models a time_context of 128 frames (128*0.011s=1.4s). We split each example in blocks of this size. We can create more data by overlapping these blocks. In this case we use an step of 50 time frames. In total, we generate 3 examples (instances or training data points) from the previously loaded '.data' file.


In [ ]:
#parameters to generate blocks of data from spectrograms
time_context=128 #time context modeled by the network
step = 50 #step to generate more blocks

noinstances = np.maximum(1,int(np.ceil((float(melspec.shape[0])-time_context)/ step)))

print "number of instances: "+str(noinstances)

This is a function that computes the number of examples we can generate for a given file.

We can call it for all the features '.data' files to see how many training examples we have.


In [ ]:
def getNumInstances(infile,time_context=100,step=25):
    """
    For a single .data file computes the number of examples of size \"time_context\" that can be created
    """
    shape = util.get_shape(os.path.join(infile.replace('.data','.shape')))
    length_file = float(shape[0])
    return np.maximum(1,int(np.ceil((length_file-time_context)/ step))) 

noinstances=getNumInstances(infile=os.path.join(feature_dir_train,file_list[0]),time_context=time_context,step=step)
print "number of instances: "+str(noinstances)

total_noinstances = np.cumsum(np.array([0]+[ getNumInstances(os.path.join(feature_dir_train,infile),time_context=time_context,step=step) for infile in file_list], dtype=int))
print "cumulative sum of number of instances per .data file: "+str(total_noinstances)
total_points = total_noinstances[-1]
print "total number of instances "+str(total_points)

Similarly, we can determine the feature_size for the input features. We can read this from the '.shape' file which is associated with a '.data' file.


In [ ]:
def getFeatureSize(infile):
    """
    For a single .data file return the number of feature, e.g. number of spectrogram bins
    """
    shape = util.get_shape(os.path.join(infile.replace('.data','.shape')))
    return shape[1]

feature_size = getFeatureSize(infile=os.path.join(feature_dir_train,file_list[0]))
print "feature size "+str(feature_size)

We use this information to allocate data for the features and the labels. If you use a GPU like TitanX, allocate with np.float32 as it saves memory.


In [ ]:
floatX=np.float32
####features
features=np.zeros((total_points,time_context,feature_size),dtype=floatX)
####labels
labels = np.zeros((total_points,1),dtype=floatX)

Slicing data

Let's slice the first .data file with index id=0 and store it in the features array.


In [ ]:
id = 0
spec = util.loadTensor(out_path=os.path.join(feature_dir_train,file_list[id]))
print 'shape of spectrogram '+str(spec.shape)
lab = util.loadTensor(out_path=os.path.join(feature_dir_train,file_list[id].replace(suffix_in,suffix_out)))
#we need to put the features in the self.features array starting at this index
idx_start = total_noinstances[id]
#and we stop at this index in self.feature
idx_end = total_noinstances[id+1]

#copy each block of size (time_contex,feature_size) in the self.features
idx = 0 #starting index of each block
start = 0 #starting point for each block in frames
fig, ax = plt.subplots(nrows=1,ncols=idx_end-idx_start)
while idx<(idx_end-idx_start):
    print 'segment '+str(idx) + ' from '+str(start)+ ' to '+ str(start+time_context)
    features[idx_start+idx] = spec[start:start+time_context]
    plt.subplot(idx_end-idx_start,3,idx+1) 
    plt.imshow(np.log10(1+100*features[idx_start+idx].T),interpolation='none', origin='lower')
    start = start + step
    idx = idx + 1
labels[idx_start:idx_end] = lab[0]
plt.show()

We populate this vector with the examples we generate from each file.


In [ ]:
def fetchFile(self,id):
    #load the data files
    spec = util.loadTensor(out_path=os.path.join(self.feature_dir,self.file_list[id]))
    lab = util.loadTensor(out_path=os.path.join(self.feature_dir,self.file_list[id].replace(self.suffix_in,self.suffix_out)))
    #we need to put the features in the self.features array starting at this index
    idx_start = self.total_noinstances[id]
    #and we stop at this index in self.feature
    idx_end = self.total_noinstances[id+1]

    #copy each block of size (time_contex,feature_size) in the self.features
    idx=0 #starting index of each block
    start = 0 #starting point for each block in frames
    while idx<(idx_end-idx_start):
        self.features[idx_start+idx] = spec[start:start+self.time_context]
        start = start + self.step
        idx = idx + 1
    self.labels[idx_start:idx_end] = lab[0]
    spec = None
    lab = None

Returning batches

Wrapping everything together, we can put all these routines in a class, which reads the data from the disk and returns batches for the training.

If we have a total number of 36 training examples we keep 20% for validation(train_percent=0.8), then we are left with 29 training examples. With a batch_size=6, this means that we will have 4 batches of 6 examples each.


In [ ]:
import dataset
from dataset import MyDataset

db=MyDataset(feature_dir=feature_dir_train, batch_size=6, time_context=128, step=50, 
             suffix_in='_mel_',suffix_out='_label_',floatX=np.float32,train_percent=0.8)
print "total number of instances: "+str(db.total_points)
print "batch_size: "+str(db.batch_size)
print "iteration size: "+str(db.iteration_size)
print "feature shape: "+str(db.features.shape)
print "labels shape: "+str(db.labels.shape)
print "feature validation shape: "+str(db.features_valid.shape)
print "labels validation shape: "+str(db.labels_valid.shape)

We can even implement a 'call' method which returns a training batch each time the object is called.

What do we return each call? We keep track of the current example using the iteration_step and we shift with batch_size.

This is what the call method returns at each interation step.


In [ ]:
#what do we return at each call
feat=db.features[db.iteration_step*db.batch_size:(db.iteration_step+1)*db.batch_size]
feat=db.features[db.iteration_step*db.batch_size:(db.iteration_step+1)*db.batch_size,np.newaxis]
lab=db.labels[db.iteration_step*db.batch_size:(db.iteration_step+1)*db.batch_size]

When working with CNN, we should add an axis for the channels: 3 (R,G,B) for image processing and in 1 in this case as the spectrogram can be regarded as a monochromatic image. Thus, we use np.newaxis to add a new axis.

Different deep learning frameworks require different order for the axes. You can change it with np.swapaxes .

We can plot the first example, which is different every time because of the batch shuffling.


In [ ]:
features,labels = db()
#we did  np.swapaxes for tensorflow in self.iterate() so we do it backwards
features = np.swapaxes(features,1,3)

print "iteration step "+str(db.iteration_step)
print features.shape
print labels.shape
#we go back from categorical to numerical labels
label = np.nonzero(labels[2,:])[0]
print 'instrument: '+instruments[int(label)]

plt.imshow(np.log10(1+100*features[2,0,:,:].T),interpolation='none', origin='lower')
plt.show()

What if you don't need the call method and you want the object to return the whole data at the same time?

We can do this with db.getData() and db.getValidation().


In [ ]:
features, labels, features_valid, labels_valid = db.getData()
print "training features shape "+str(features.shape)
print "validation features shape "+str(features_valid.shape)
features_valid, labels_valid = db.getValidation()

Open issues

  • Datasets that don't fit into memory

Make a list with all the files and the number of examples you can extract for each one, as we previously did. Then, load only a part of them in memory according to some memory limit.


In [ ]:
memory_limit=np.minimum(30,total_points)
floatX=np.float32
####features
features=np.zeros((memory_limit,time_context,feature_size),dtype=floatX)
####labels
labels = np.zeros((memory_limit,1),dtype=floatX)
  • Parallel computation