this is a notebook for speech siamese. the goal is to add siamese network after the speech command network to make a one-shot speech command model. with this model, take two piece of audio as input, the model will tell if it is the same speech command or not. if the accuracy is good enough, we make take it input product for voice trigger or voice command which are useful for all kind of product.

the trick may be if siamese can make one shot accure enough.


In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import hashlib
import math, time, datetime
import os.path
import random
import re
import sys
import tarfile

import matplotlib.pyplot as plt
import numpy as np
import librosa as rosa
import librosa.display
from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, Dropout, Flatten, Lambda, BatchNormalization, Activation
#from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
#from tensorflow.python.ops import io_ops
#from tensorflow.python.platform import gfile
#from tensorflow.python.util import compat

default_number_of_mfcc=128
default_sample_rate=16000
default_hop_length=512 
default_wav_duration=1 # 1 second
default_train_samples=10000
default_test_samples=100
default_epochs=10
default_batch_size=32
default_wanted_words=["one", "two", "bed", "backward", "bird", "cat", "dog", "eight", "five", "follow", "forward", "four", "go", "happy", "house", "learn", "left", "marvin", "nine", "no", "off", "right", "seven", "sheila", "stop", "three", "tree", "visual", "wow", "zero","up"]
#for mac
#speech_data_dir="/Users/hermitwang/Downloads/speech_dataset"
#default_model_path="/Users/hermitwang/Downloads/pretrained/speech_siamese"
#for ubuntu
speech_data_dir="/home/hermitwang/TrainingData/datasets/speech_dataset"
default_model_path="/home/hermitwang/TrainingData/pretrained/speech_siamese"

One shot keyword trigger

Here is another implementation of one-shot learning of keyword trigger with librosa mfcc. librosa cannot put into tensorflow graph. so mfcc computation will be done before conv network. that means load_wav_mfcc has to convert all wav file to mfcc vector. Here i have to understand 1, what is the good mfcc vector dimension. 20, 127 may not be the right input for conv network. 2, even the mfcc output of librosa is not the same as tensorflow contrib.decode wav, it is enough if it has all audio feature. put librosa mfcc output as input of conv net, it will do good learning about feature abstraction. 3, conv net may not be that difficult. just like conv2d -> maxpooling -> conv2d->flatten->dense with softmax. 4, build the train network with librosa and conv net. 5, take the dense vector output as feature extractor. 6, build siamese network with the feature extractor. 7, may add couples of dense layer to learn the feature mapping and comparation of siamese. 8, if that works, we get an one-shot learning for key word trigger... 9, in reality, we still have to work out how to split the audio stream into audio clip as the input the librosa mfcc.

MFCC

extract MFCC from wav file what is the wav parameter for MFCC output

tensorflow speech command parameter {'desired_samples': 16000, 'window_size_samples': 480, 'window_stride_samples': 160, 'spectrogram_length': 98, 'fingerprint_width': 40, 'fingerprint_size': 3920, 'label_count': 12, 'sample_rate': 16000, 'preprocess': 'mfcc', 'average_window_width': -1}

Mel-frequency cepstral coefficients (MFCCs) Parameters: y:np.ndarray [shape=(n,)] or None audio time series sr:number > 0 [scalar] sampling rate of y S:np.ndarray [shape=(d, t)] or None log-power Mel spectrogram n_mfcc: int > 0 [scalar] number of MFCCs to return Returns:
M:np.ndarray [shape=(n_mfcc, t)] MFCC sequence

need more study about MFCC output

How to calculate the lenght of mfcc vector

Short Answer

You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)

data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20) Long Answer

Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).

All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.

All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()

By Default, the Mel-scaled power spectrogram window and hop length are the following:

n_fft=2048

hop_length=512

So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:

output length = (seconds) * (sample rate) / (hop_length)

(1319) * (22050) / (512) = 56804 samples

the mfcc vector size is 128 * 32

1 * 16000/512 = 31.25 = 32


In [ ]:
def load_wav_mfcc(filename):
    wav_loader, sample_rate = rosa.load(filename, sr=default_sample_rate)
    #print(rosa.get_duration(wav_loader, sample_rate))
    wav_mfcc = rosa.feature.mfcc(y=wav_loader, sr=default_sample_rate, n_mfcc=default_number_of_mfcc)
    return wav_mfcc

def get_default_mfcc_length(default_wav_duration=1):
    length = int(math.ceil(default_wav_duration * default_sample_rate / default_hop_length))
    return length

def mfcc_display(mfccs):
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(mfccs, x_axis='time')
    plt.colorbar()
    plt.title('MFCC')
    plt.tight_layout()
    
wav_mfcc_data = load_wav_mfcc(speech_data_dir + "/six/ffd2ba2f_nohash_3.wav")
print(wav_mfcc_data.shape)
mfcc_display(wav_mfcc_data)

wav_mfcc_data = load_wav_mfcc(speech_data_dir + "/six/ffd2ba2f_nohash_3.wav") #""/five/56eab10e_nohash_0.wav")
print(wav_mfcc_data.shape)
mfcc_display(wav_mfcc_data)

Wav MFCC loader

Wav file loader and export mfcc sequence.

0, go throught all wav file to add background voice into command wav file 1, go through all wav file and convert to MFCC sequence 2, construct pair of MFCC sequence and a target (0 or 1, 0 for different command, 1 for the same command) the same word 1000, random generate key index, the first index of wav, and the second index of wav. the diff word 1000, random generae two key index, the first index of wav, and the second index of wav. the format will be [mfcc 1, mfcc 2, 0/1 for the same or different] 3, prepare pair of MFCC and targets according to batch size.


In [ ]:
class WavMFCCLoader(object):
    def __init__(self, data_dir, wanted, validation_percentage=0, testing_percentage=0):
        self.data_dir = data_dir
        self.wanted = wanted
        self.default_mfcc_length=get_default_mfcc_length(default_wav_duration)
        self.wav_files = dict()
        self.wav_file_index()
        
    def wav_file_index(self):
        for dirpath, dirnames, files in os.walk(self.data_dir):
            for name in files:
                if name.lower().endswith('.wav'):
                    word_name = dirpath.rsplit('/', 1)[1];
                    if word_name in self.wanted:
                        file_name = os.path.join(dirpath, name)
                        #print(file_name, dirpath, word_name)
    
                        if word_name in self.wav_files.keys():
                            self.wav_files[word_name].append(file_name)
                        else:
                            self.wav_files[word_name] = [file_name]
                    
        return self.wav_files


    def wavs_to_mfcc_pair(self):
        how_many_words = len(self.wanted)
        a_index = random.randint(0, how_many_words - 1)
        b_index = random.randint(0, how_many_words - 1)
        a_wav_index = b_wav_index = -1
        mfcc_pair = np.array([3, 1])
        if (a_index > b_index):
            a_wav_index = random.randint(0, len(self.wav_files[self.wanted[a_index]]) - 1)
            b_wav_index = random.randint(0, len(self.wav_files[self.wanted[b_index]]) - 1)
            mfcc_1 = load_wav_mfcc(self.wav_files[self.wanted[a_index]][a_wav_index])
            mfcc_2 = load_wav_mfcc(self.wav_files[self.wanted[b_index]][b_wav_index])
            mfcc_pair = 0            
        else:
            a_wav_index = random.randint(0, len(self.wav_files[self.wanted[a_index]]) - 1)
            b_wav_index = random.randint(0, len(self.wav_files[self.wanted[a_index]]) - 1)
            mfcc_1 = load_wav_mfcc(self.wav_files[self.wanted[a_index]][a_wav_index])
            mfcc_2 = load_wav_mfcc(self.wav_files[self.wanted[a_index]][b_wav_index])
            mfcc_pair = 1
            
        #print("aaa", mfcc_1.shape, mfcc_2.shape)    
        return mfcc_1, mfcc_2, mfcc_pair
        
    def get_mfcc_pairs(self, how_many):
        mfcc1_data = np.zeros((how_many, default_number_of_mfcc, self.default_mfcc_length))
        mfcc2_data = np.zeros((how_many, default_number_of_mfcc, self.default_mfcc_length))
        same_data = np.zeros(how_many)
        for i in range(0, how_many - 1):
            
            mfcc1_data_, mfcc2_data_, same_data[i] = self.wavs_to_mfcc_pair()
            mfcc1_data[i, :, 0:mfcc1_data_.shape[1]] = mfcc1_data_
            mfcc2_data[i, :, 0:mfcc2_data_.shape[1]] = mfcc2_data_
            #np.append(mfcc1_data, mfcc1_)
            #np.append(mfcc2_data, mfcc2_)
            #np.append(same_data, same_)          
        #print(mfcc_pairs)
        return mfcc1_data, mfcc2_data, same_data
        
loader = WavMFCCLoader(speech_data_dir, wanted=["one", "two", "bed", "backward", "bird", "cat", "dog", "eight", "five", "follow", "forward", "four", "go", "happy", "house", "learn", "left", "marvin", "nine", "no", "off", "right", "seven", "sheila", "stop", "three", "tree", "visual", "wow", "zero"])
#wav_list = loader.wav_file_index()
mfcc1_data, mfcc2_data, same_pair = loader.get_mfcc_pairs(100)
print(same_pair)

Conv Network

create a keras conv network, take mfcc vector as input.

the speech command mfcc input shape is (?, mfcc_number, hop_number, 1)


In [ ]:
def create_keras_model(fingerprint_shape, is_training=True):
    model = Sequential()
    model.add(Conv2D(input_shape=fingerprint_shape, filters=64, kernel_size=3, use_bias=False))
    model.add(BatchNormalization())
    model.add(Activation("relu"))
    model.add(MaxPooling2D())
    #if (is_training):
    #    model.add(Dropout(0.5))
    model.add(Conv2D(filters=64, kernel_size=3, use_bias=False)) 
    model.add(BatchNormalization())
    model.add(Activation("relu"))

    model.add(MaxPooling2D())
    #if (is_training):
    #    model.add(Dropout(0.5))
    model.add(Conv2D(filters=64, kernel_size=3, use_bias=False))
    model.add(BatchNormalization())
    model.add(Activation("relu"))

    model.add(MaxPooling2D())
    
    model.add(Flatten())
    model.add(Dense(4096))
    model.add(BatchNormalization())
    model.add(Activation("sigmoid"))    
    if (is_training):
        model.add(Dropout(0.5))
    #model.add(Dense(labels_count, activation="softmax"))
    
    return model

In [ ]:
def create_siamese_model(input_shape, siamese_mode = 'concat'):
    right_input = Input(input_shape)
    left_input = Input(input_shape)
    keras_model = create_keras_model(input_shape)
    
    right_encoder = keras_model(right_input)
    left_encoder = keras_model(left_input)
    if (siamese_mode == 'minus'):
        concatenated_layer = Lambda(lambda x: x[0]-x[1], output_shape=lambda x: x[0])([right_encoder, left_encoder])
    elif (siamese_mode == 'abs'):
        concatenated_layer = Lambda(lambda x: tf.abs(x[0]-x[1]), output_shape=lambda x: x[0])([right_encoder, left_encoder])
    #elif (siamese_mode == "eu"):
    #    concatenated_layer = Lambda(lambda x: tf.sqrt(tf.reduce_sum(tf.square(x[0]-x[1]), 2)), output_shape=lambda x: x[0])([right_encoder, left_encoder])
    else:
        raise ValueError("unknown siamese_mode")
        
    output_layer = Dense(1, activation='sigmoid')(concatenated_layer)
    siamese_model = Model([right_input, left_input], output_layer)
    return siamese_model
    
def siamese_train(local_siamese_mode='abs', train_samples=default_train_samples, wanted_words=default_wanted_words):
    default_mfcc_length = get_default_mfcc_length(default_wav_duration)
    siamese_model = create_siamese_model((default_number_of_mfcc,default_mfcc_length,1), siamese_mode=local_siamese_mode)

    siamese_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    loader = WavMFCCLoader(speech_data_dir, wanted=wanted_words)
    mfcc1_data, mfcc2_data, pairs = loader.get_mfcc_pairs(train_samples)
    x1_train = mfcc1_data.reshape((train_samples, default_number_of_mfcc, default_mfcc_length, 1)) #np.random.random((1000, 98, 40, 1))
    x2_train = mfcc2_data.reshape((train_samples, default_number_of_mfcc, default_mfcc_length, 1)) #np.random.random((1000, 98, 40, 1))
    y_train = pairs  #keras.utils.to_categorical(pairs, num_classes=1)
    
    
    siamese_model.fit([x1_train, x2_train], y_train, epochs=default_epochs, batch_size=default_batch_size)
    
    
    mfcc1_test, mfcc2_test, pairs_test = loader.get_mfcc_pairs(default_test_samples)
    x1_test = mfcc1_test.reshape((default_test_samples, default_number_of_mfcc,default_mfcc_length, 1))
    x2_test = mfcc2_test.reshape((default_test_samples, default_number_of_mfcc,default_mfcc_length, 1))
    y_test = pairs_test 
    
    loss, accuracy = siamese_model.evaluate([x1_test, x2_test], y_test)    
    
    siamese_model.save(default_model_path+"/speech_siamese"+str(datetime.date.today())+".h5")

    print(loss)
    return accuracy

def siamese_test(test_samples=default_test_samples, wanted_words=default_wanted_words):
    default_mfcc_length = get_default_mfcc_length(default_wav_duration)
    loader = WavMFCCLoader(speech_data_dir, wanted=wanted_words)
    siamese_model = keras.models.load_model(default_model_path+"/speech_siamese"+str(datetime.date.today())+".h5")    
    mfcc1_test, mfcc2_test, pairs_test = loader.get_mfcc_pairs(test_samples)
    x1_test = mfcc1_test.reshape((test_samples, default_number_of_mfcc,default_mfcc_length, 1))
    x2_test = mfcc2_test.reshape((test_samples, default_number_of_mfcc,default_mfcc_length, 1))
    y_test = pairs_test 
    
    loss, accuracy = siamese_model.test_on_batch(x=[x1_test, x2_test], y=y_test)
    print(loss)
    return accuracy

Siamese Network

main


In [ ]:
#wav_mfcc = load_wav_mfcc("/Users/hermitwang/Downloads/speech_dataset/backward/0a2b400e_nohash_0.wav")
#print(wav_mfcc.shape) 
score=siamese_train(local_siamese_mode='abs', train_samples=1000, wanted_words=["one", "two", "cat", "dog", "bed", "backward", "eight", "five", "follow", "forward", "four", "go", "happy", "house", "learn", "left", "marvin", "nine", "no", "off", "right", "seven", "sheila", "stop", "three", "tree", "visual", "wow", "zero","up"])
print(score)
score=siamese_test(wanted_words=["five", "follow", "bird"])

print(score)

Result

conv2d 3->conv2d 3->dense 1024->dropout 0.5 ->siamese with abs ->dense with sigmoid train: 0.6590 test: 0.69

conv2d 7->conv2d 5 -> conv2d 3->dense 1024->dropout 0.5 ->siamese with abs ->dense with sigmoid train: 0.6750 test 0.61

the accuracy is not good enough so far. basically it is not because of overfit, it is underfit. that means the model cannot express the problem. there are two possible way to improve. 1, improve the data quality. add bn into data. 1021-2018::
try 1: Batch normalizaiton improve performance quite better. from Train 0.68->0.94, test from 0.69->0.77 the first conv2d, 64 77, 64 55, 64 33
try 2: add more filter, the first conv2d layer filter count from 64 ->256, the second from 64->128. filter size 3
3. it take much longer to train. train:.95, test .69. it is overfit. try 3, keep filter count 64 for all conv2d. filter size 3*3. train .95. test .8 try 4, add dropout to improve overfit. the performance is quite poor. both of train and test. like .62 and .44 try 5, keep dropout in .5 but add more filters in the first two conv2d layers. 64->128. no improvement.

the fact

1, if the test words are in the train set, the performance will be not to bad. training .85, evaluate .77, test .59 2, if the test words are not in the train set, the test performance will be quite bad, it will be trainning .85, test .45. 3, if some test words are in the train set, the test performance will be training .85, test .58....