this is a notebook for speech siamese. the goal is to add siamese network after the speech command network to make a one-shot speech command model. with this model, take two piece of audio as input, the model will tell if it is the same speech command or not. if the accuracy is good enough, we make take it input product for voice trigger or voice command which are useful for all kind of product.
the trick may be if siamese can make one shot accure enough.
In [ ]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import hashlib
import math, time, datetime
import os.path
import random
import re
import sys
import tarfile
import matplotlib.pyplot as plt
import numpy as np
import librosa as rosa
import librosa.display
from six.moves import urllib
from six.moves import xrange # pylint: disable=redefined-builtin
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, Dropout, Flatten, Lambda, BatchNormalization, Activation
#from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
#from tensorflow.python.ops import io_ops
#from tensorflow.python.platform import gfile
#from tensorflow.python.util import compat
default_number_of_mfcc=128
default_sample_rate=16000
default_hop_length=512
default_wav_duration=1 # 1 second
default_train_samples=10000
default_test_samples=100
default_epochs=10
default_batch_size=32
default_wanted_words=["one", "two", "bed", "backward", "bird", "cat", "dog", "eight", "five", "follow", "forward", "four", "go", "happy", "house", "learn", "left", "marvin", "nine", "no", "off", "right", "seven", "sheila", "stop", "three", "tree", "visual", "wow", "zero","up"]
#for mac
#speech_data_dir="/Users/hermitwang/Downloads/speech_dataset"
#default_model_path="/Users/hermitwang/Downloads/pretrained/speech_siamese"
#for ubuntu
speech_data_dir="/home/hermitwang/TrainingData/datasets/speech_dataset"
default_model_path="/home/hermitwang/TrainingData/pretrained/speech_siamese"
Here is another implementation of one-shot learning of keyword trigger with librosa mfcc. librosa cannot put into tensorflow graph. so mfcc computation will be done before conv network. that means load_wav_mfcc has to convert all wav file to mfcc vector. Here i have to understand 1, what is the good mfcc vector dimension. 20, 127 may not be the right input for conv network. 2, even the mfcc output of librosa is not the same as tensorflow contrib.decode wav, it is enough if it has all audio feature. put librosa mfcc output as input of conv net, it will do good learning about feature abstraction. 3, conv net may not be that difficult. just like conv2d -> maxpooling -> conv2d->flatten->dense with softmax. 4, build the train network with librosa and conv net. 5, take the dense vector output as feature extractor. 6, build siamese network with the feature extractor. 7, may add couples of dense layer to learn the feature mapping and comparation of siamese. 8, if that works, we get an one-shot learning for key word trigger... 9, in reality, we still have to work out how to split the audio stream into audio clip as the input the librosa mfcc.
extract MFCC from wav file what is the wav parameter for MFCC output
tensorflow speech command parameter {'desired_samples': 16000, 'window_size_samples': 480, 'window_stride_samples': 160, 'spectrogram_length': 98, 'fingerprint_width': 40, 'fingerprint_size': 3920, 'label_count': 12, 'sample_rate': 16000, 'preprocess': 'mfcc', 'average_window_width': -1}
Mel-frequency cepstral coefficients (MFCCs)
Parameters:
y:np.ndarray [shape=(n,)] or None
audio time series
sr:number > 0 [scalar]
sampling rate of y
S:np.ndarray [shape=(d, t)] or None
log-power Mel spectrogram
n_mfcc: int > 0 [scalar]
number of MFCCs to return
Returns:
M:np.ndarray [shape=(n_mfcc, t)]
MFCC sequence
need more study about MFCC output
Short Answer
You can specify the change the length by changing the parameters used in the stft calculations. The following code will double the size of your output (20 x 113658)
data = librosa.feature.mfcc(y=y, sr=sr, n_fft=1012, hop_length=256, n_mfcc=20) Long Answer
Librosa's librosa.feature.mfcc() function really just acts as a wrapper to librosa's librosa.feature.melspectrogram() function (which is a wrapper to librosa.core.stft and librosa.filters.mel functions).
All of the parameters pertaining to segementation of the audio signal - namely the frame and overlap values - are specified utilized in the Mel-scaled power spectrogram function (with other tune-able parameters specified for nested core functions). You specify these parameters as keyword arguments in the librosa.feature.mfcc() function.
All extra **kwargs parameters are fed to librosa.feature.melspectrogram() and subsequently to librosa.filters.mel()
By Default, the Mel-scaled power spectrogram window and hop length are the following:
n_fft=2048
hop_length=512
So assuming you used the default sample rate (sr=22050), the output of your mfcc function makes sense:
output length = (seconds) * (sample rate) / (hop_length)
(1319) * (22050) / (512) = 56804 samples
the mfcc vector size is 128 * 32
1 * 16000/512 = 31.25 = 32
In [ ]:
def load_wav_mfcc(filename):
wav_loader, sample_rate = rosa.load(filename, sr=default_sample_rate)
#print(rosa.get_duration(wav_loader, sample_rate))
wav_mfcc = rosa.feature.mfcc(y=wav_loader, sr=default_sample_rate, n_mfcc=default_number_of_mfcc)
return wav_mfcc
def get_default_mfcc_length(default_wav_duration=1):
length = int(math.ceil(default_wav_duration * default_sample_rate / default_hop_length))
return length
def mfcc_display(mfccs):
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()
wav_mfcc_data = load_wav_mfcc(speech_data_dir + "/six/ffd2ba2f_nohash_3.wav")
print(wav_mfcc_data.shape)
mfcc_display(wav_mfcc_data)
wav_mfcc_data = load_wav_mfcc(speech_data_dir + "/six/ffd2ba2f_nohash_3.wav") #""/five/56eab10e_nohash_0.wav")
print(wav_mfcc_data.shape)
mfcc_display(wav_mfcc_data)
Wav file loader and export mfcc sequence.
0, go throught all wav file to add background voice into command wav file 1, go through all wav file and convert to MFCC sequence 2, construct pair of MFCC sequence and a target (0 or 1, 0 for different command, 1 for the same command) the same word 1000, random generate key index, the first index of wav, and the second index of wav. the diff word 1000, random generae two key index, the first index of wav, and the second index of wav. the format will be [mfcc 1, mfcc 2, 0/1 for the same or different] 3, prepare pair of MFCC and targets according to batch size.
In [ ]:
class WavMFCCLoader(object):
def __init__(self, data_dir, wanted, validation_percentage=0, testing_percentage=0):
self.data_dir = data_dir
self.wanted = wanted
self.default_mfcc_length=get_default_mfcc_length(default_wav_duration)
self.wav_files = dict()
self.wav_file_index()
def wav_file_index(self):
for dirpath, dirnames, files in os.walk(self.data_dir):
for name in files:
if name.lower().endswith('.wav'):
word_name = dirpath.rsplit('/', 1)[1];
if word_name in self.wanted:
file_name = os.path.join(dirpath, name)
#print(file_name, dirpath, word_name)
if word_name in self.wav_files.keys():
self.wav_files[word_name].append(file_name)
else:
self.wav_files[word_name] = [file_name]
return self.wav_files
def wavs_to_mfcc_pair(self):
how_many_words = len(self.wanted)
a_index = random.randint(0, how_many_words - 1)
b_index = random.randint(0, how_many_words - 1)
a_wav_index = b_wav_index = -1
mfcc_pair = np.array([3, 1])
if (a_index > b_index):
a_wav_index = random.randint(0, len(self.wav_files[self.wanted[a_index]]) - 1)
b_wav_index = random.randint(0, len(self.wav_files[self.wanted[b_index]]) - 1)
mfcc_1 = load_wav_mfcc(self.wav_files[self.wanted[a_index]][a_wav_index])
mfcc_2 = load_wav_mfcc(self.wav_files[self.wanted[b_index]][b_wav_index])
mfcc_pair = 0
else:
a_wav_index = random.randint(0, len(self.wav_files[self.wanted[a_index]]) - 1)
b_wav_index = random.randint(0, len(self.wav_files[self.wanted[a_index]]) - 1)
mfcc_1 = load_wav_mfcc(self.wav_files[self.wanted[a_index]][a_wav_index])
mfcc_2 = load_wav_mfcc(self.wav_files[self.wanted[a_index]][b_wav_index])
mfcc_pair = 1
#print("aaa", mfcc_1.shape, mfcc_2.shape)
return mfcc_1, mfcc_2, mfcc_pair
def get_mfcc_pairs(self, how_many):
mfcc1_data = np.zeros((how_many, default_number_of_mfcc, self.default_mfcc_length))
mfcc2_data = np.zeros((how_many, default_number_of_mfcc, self.default_mfcc_length))
same_data = np.zeros(how_many)
for i in range(0, how_many - 1):
mfcc1_data_, mfcc2_data_, same_data[i] = self.wavs_to_mfcc_pair()
mfcc1_data[i, :, 0:mfcc1_data_.shape[1]] = mfcc1_data_
mfcc2_data[i, :, 0:mfcc2_data_.shape[1]] = mfcc2_data_
#np.append(mfcc1_data, mfcc1_)
#np.append(mfcc2_data, mfcc2_)
#np.append(same_data, same_)
#print(mfcc_pairs)
return mfcc1_data, mfcc2_data, same_data
loader = WavMFCCLoader(speech_data_dir, wanted=["one", "two", "bed", "backward", "bird", "cat", "dog", "eight", "five", "follow", "forward", "four", "go", "happy", "house", "learn", "left", "marvin", "nine", "no", "off", "right", "seven", "sheila", "stop", "three", "tree", "visual", "wow", "zero"])
#wav_list = loader.wav_file_index()
mfcc1_data, mfcc2_data, same_pair = loader.get_mfcc_pairs(100)
print(same_pair)
create a keras conv network, take mfcc vector as input.
the speech command mfcc input shape is (?, mfcc_number, hop_number, 1)
In [ ]:
def create_keras_model(fingerprint_shape, is_training=True):
model = Sequential()
model.add(Conv2D(input_shape=fingerprint_shape, filters=64, kernel_size=3, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D())
#if (is_training):
# model.add(Dropout(0.5))
model.add(Conv2D(filters=64, kernel_size=3, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D())
#if (is_training):
# model.add(Dropout(0.5))
model.add(Conv2D(filters=64, kernel_size=3, use_bias=False))
model.add(BatchNormalization())
model.add(Activation("relu"))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(4096))
model.add(BatchNormalization())
model.add(Activation("sigmoid"))
if (is_training):
model.add(Dropout(0.5))
#model.add(Dense(labels_count, activation="softmax"))
return model
In [ ]:
def create_siamese_model(input_shape, siamese_mode = 'concat'):
right_input = Input(input_shape)
left_input = Input(input_shape)
keras_model = create_keras_model(input_shape)
right_encoder = keras_model(right_input)
left_encoder = keras_model(left_input)
if (siamese_mode == 'minus'):
concatenated_layer = Lambda(lambda x: x[0]-x[1], output_shape=lambda x: x[0])([right_encoder, left_encoder])
elif (siamese_mode == 'abs'):
concatenated_layer = Lambda(lambda x: tf.abs(x[0]-x[1]), output_shape=lambda x: x[0])([right_encoder, left_encoder])
#elif (siamese_mode == "eu"):
# concatenated_layer = Lambda(lambda x: tf.sqrt(tf.reduce_sum(tf.square(x[0]-x[1]), 2)), output_shape=lambda x: x[0])([right_encoder, left_encoder])
else:
raise ValueError("unknown siamese_mode")
output_layer = Dense(1, activation='sigmoid')(concatenated_layer)
siamese_model = Model([right_input, left_input], output_layer)
return siamese_model
def siamese_train(local_siamese_mode='abs', train_samples=default_train_samples, wanted_words=default_wanted_words):
default_mfcc_length = get_default_mfcc_length(default_wav_duration)
siamese_model = create_siamese_model((default_number_of_mfcc,default_mfcc_length,1), siamese_mode=local_siamese_mode)
siamese_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
loader = WavMFCCLoader(speech_data_dir, wanted=wanted_words)
mfcc1_data, mfcc2_data, pairs = loader.get_mfcc_pairs(train_samples)
x1_train = mfcc1_data.reshape((train_samples, default_number_of_mfcc, default_mfcc_length, 1)) #np.random.random((1000, 98, 40, 1))
x2_train = mfcc2_data.reshape((train_samples, default_number_of_mfcc, default_mfcc_length, 1)) #np.random.random((1000, 98, 40, 1))
y_train = pairs #keras.utils.to_categorical(pairs, num_classes=1)
siamese_model.fit([x1_train, x2_train], y_train, epochs=default_epochs, batch_size=default_batch_size)
mfcc1_test, mfcc2_test, pairs_test = loader.get_mfcc_pairs(default_test_samples)
x1_test = mfcc1_test.reshape((default_test_samples, default_number_of_mfcc,default_mfcc_length, 1))
x2_test = mfcc2_test.reshape((default_test_samples, default_number_of_mfcc,default_mfcc_length, 1))
y_test = pairs_test
loss, accuracy = siamese_model.evaluate([x1_test, x2_test], y_test)
siamese_model.save(default_model_path+"/speech_siamese"+str(datetime.date.today())+".h5")
print(loss)
return accuracy
def siamese_test(test_samples=default_test_samples, wanted_words=default_wanted_words):
default_mfcc_length = get_default_mfcc_length(default_wav_duration)
loader = WavMFCCLoader(speech_data_dir, wanted=wanted_words)
siamese_model = keras.models.load_model(default_model_path+"/speech_siamese"+str(datetime.date.today())+".h5")
mfcc1_test, mfcc2_test, pairs_test = loader.get_mfcc_pairs(test_samples)
x1_test = mfcc1_test.reshape((test_samples, default_number_of_mfcc,default_mfcc_length, 1))
x2_test = mfcc2_test.reshape((test_samples, default_number_of_mfcc,default_mfcc_length, 1))
y_test = pairs_test
loss, accuracy = siamese_model.test_on_batch(x=[x1_test, x2_test], y=y_test)
print(loss)
return accuracy
Siamese Network
In [ ]:
#wav_mfcc = load_wav_mfcc("/Users/hermitwang/Downloads/speech_dataset/backward/0a2b400e_nohash_0.wav")
#print(wav_mfcc.shape)
score=siamese_train(local_siamese_mode='abs', train_samples=1000, wanted_words=["one", "two", "cat", "dog", "bed", "backward", "eight", "five", "follow", "forward", "four", "go", "happy", "house", "learn", "left", "marvin", "nine", "no", "off", "right", "seven", "sheila", "stop", "three", "tree", "visual", "wow", "zero","up"])
print(score)
score=siamese_test(wanted_words=["five", "follow", "bird"])
print(score)
conv2d 3->conv2d 3->dense 1024->dropout 0.5 ->siamese with abs ->dense with sigmoid train: 0.6590 test: 0.69
conv2d 7->conv2d 5 -> conv2d 3->dense 1024->dropout 0.5 ->siamese with abs ->dense with sigmoid train: 0.6750 test 0.61
the accuracy is not good enough so far. basically it is not because of overfit, it is underfit.
that means the model cannot express the problem.
there are two possible way to improve.
1, improve the data quality. add bn into data.
1021-2018::
try 1: Batch normalizaiton improve performance quite better. from Train 0.68->0.94, test from 0.69->0.77
the first conv2d, 64 77, 64 55, 64 33
try 2: add more filter, the first conv2d layer filter count from 64 ->256, the second from 64->128. filter size 33. it take much longer to train. train:.95, test .69. it is overfit.
try 3, keep filter count 64 for all conv2d. filter size 3*3. train .95. test .8
try 4, add dropout to improve overfit. the performance is quite poor. both of train and test. like .62 and .44
try 5, keep dropout in .5 but add more filters in the first two conv2d layers. 64->128. no improvement.
1, if the test words are in the train set, the performance will be not to bad. training .85, evaluate .77, test .59 2, if the test words are not in the train set, the test performance will be quite bad, it will be trainning .85, test .45. 3, if some test words are in the train set, the test performance will be training .85, test .58....