The goal of synthetic speech detection is to determine whether a speech segment $S$ is natural or synthetic/converted speeach.
This notebook implements a Gaussian mixture model maximum likelihood (GMM-ML) classifier for synthetic (spoofed) speech detection. This approach uses regular mel frequency cepstral coefficients (MFCC) features and gives the best performance on the ASVspoof 2015 dataset among the standard classifiers (GMM-SV, GMM-UBM, ...). For more background information see: Hanilçi, Cemal, Tomi Kinnunen, Md Sahidullah, and Aleksandr Sizov. "Classifiers for synthetic speech detection: a comparison." In INTERSPEECH 2015. The scripts use the Python package Bob.Bio.SPEAR 2.04 for speaker recogntion.
This work is part of the "DDoS Resilient Emergency Dispatch Center" project at the University of Houston, funded by the Department of Homeland Security (DHS).
April 19, 2015
Lorenzo Rossi
(lorenzo [dot] rossi [at] gmail [dot] com)
In [48]:
import os
import time
import numpy as np
import pandas as pd
from bob.bio.spear import preprocessor, extractor
from bob.bio.gmm import algorithm
from bob.io.base import HDF5File
from bob.learn import em
from sklearn.metrics import classification_report, roc_curve, roc_auc_score
WAV_FOLDER = 'Wav/' #'ASV2015dataset/wav/' # Path to folder containing speakers .wav subfolders
LABEL_FOLDER = 'CM_protocol/' #'ASV2015dataset/CM_protocol/' # Path to ground truth csv files
EXT = '.wav'
%matplotlib inline
Load the dataframes (tables) with the labels for the training, development and evaluation (hold out) sets. Each subfolder corresponds to a different speaker. For example, T1 and D4 indicate the subfolders associated to the utterances and spoofed segments of speakers T1 and D4, respectively in training and development sets. Note that number of evaluation samples >> number of development samples >> testing samples.
You can either select the speakers in each set one by one, e.g.:
train_subfls = ['T1', 'T2']
will only load segments from speakers T1 and T2 for training,
or use all the available speakers in a certain subset by leaving the list empty, e.g.:
devel_subfls = []
will load all the available Dx speaker segments for the development stage. If you are running this notebook for the first time, you may want to start only with 2 or so speakers per set for sake of quick testing. All the scripts may take several hours to run on the full size datsets.
In [19]:
train_subfls = ['T1']#, 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9', 'T13'] #T13 used instead of T10 for gender balance
devel_subfls = ['D1']#, 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10']
evalu_subfls = ['E1']#, 'E2', 'E3', 'E4', 'E5', 'E6','E7', 'E8', 'E9', 'E10']
train = pd.read_csv(LABEL_FOLDER + 'cm_train.trn', sep=' ', header=None, names=['folder','file','method','source'])
if len(train_subfls): train = train[train.folder.isin(train_subfls)]
train.sort_values(['folder', 'file'], inplace=True)
devel = pd.read_csv(LABEL_FOLDER + 'cm_develop.ndx', sep=' ', header=None, names=['folder','file','method','source'])
if len(devel_subfls): devel = devel[devel.folder.isin(devel_subfls)]
devel.sort_values(['folder', 'file'], inplace=True)
evalu = pd.read_csv(LABEL_FOLDER +'cm_evaluation.ndx', sep=' ', header=None, names=['folder','file','method','source'])
if len(evalu_subfls): evalu = evalu[evalu.folder.isin(evalu_subfls)]
evalu.sort_values(['folder', 'file'], inplace=True)
label_2_class = {'human':1, 'spoof':0}
print('training samples:',len(train))
print('development samples:',len(devel))
print('evaluation samples:',len(evalu))
Silence removal and MFCC feature extraction for training segments. More details about the bob.bio.spear involved libraries at: https://www.idiap.ch/software/bob/docs/latest/bioidiap/bob.bio.spear/master/implemented.html
You can also skip this stage and load a set of feaures (see Loading features cell).
In [20]:
# Parameters
n_ceps = 60 # number of ceptral coefficients (implicit in extractor)
silence_removal_ratio = .1
In [21]:
subfolders = train_subfls
ground_truth = train
# initialize feature matrix
features = []
y = np.zeros((len(ground_truth),))
print("Extracting features for training stage.")
vad = preprocessor.Energy_Thr(ratio_threshold=silence_removal_ratio)
cepstrum = extractor.Cepstral()
k = 0
start_time = time.clock()
for folder in subfolders[0:n_subfls]:
print(folder, end=", ")
folder = "".join(('Wav/',folder,'/'))
f_list = os.listdir(folder)
for f_name in f_list:
# ground truth
try:
label = ground_truth[ground_truth.file==f_name[:-len(EXT)]].source.values[0]
except IndexError:
continue
y[k] = label_2_class[label]
# silence removal
x = vad.read_original_data(folder+f_name)
vad_data = vad(x)
if not vad_data[2].max():
vad = preprocessor.Energy_Thr(ratio_threshold=silence_removal_ratio*.8)
vad_data = vad(x)
vad = preprocessor.Energy_Thr(ratio_threshold=silence_removal_ratio)
# MFCC extraction
mfcc = cepstrum(vad_data)
features.append(mfcc)
k += 1
Xf = np.array(features)
print(k,"files processed in",(time.clock()-start_time)/60,"minutes.")
In [ ]:
np.save('X.npy',Xf)
np.save('y.npy',y)
print('Feature and label matrices saved to disk')
In [ ]:
# Load already extracter features to skip the preprocessing-extraction stage
Xf = np.load('train_features_10.npy')
y = np.load('y_10.npy')
Train the GMMs for natural and synthetic speach. For documentation on bob.bio k-means and GMM machines see: https://pythonhosted.org/bob.learn.em/guide.html
You can also skip the training stage and load an already trained GMM model (see cell Loading GMM Model).
In [22]:
# Parameters of the GMM machines
n_gaussians = 128 # number of Gaussians
max_iterats = 25 # maximum number of iterations
In [5]:
# Initialize and train k-means machine: the means will initialize EM algorithm for GMM machine
start_time = time.clock()
kmeans_nat = em.KMeansMachine(n_gaussians,n_ceps)
kmeansTrainer = em.KMeansTrainer()
em.train(kmeansTrainer, kmeans_nat, np.vstack(Xf[y==1]), max_iterations = max_iterats, convergence_threshold = 1e-5)
#kmeans_nat.means
# initialize and train GMM machine
gmm_nat = em.GMMMachine(n_gaussians,n_ceps)
trainer = em.ML_GMMTrainer(True, True, True)
gmm_nat.means = kmeans_nat.means
em.train(trainer, gmm_nat, np.vstack(Xf[y==1]), max_iterations = max_iterats, convergence_threshold = 1e-5)
#gmm_nat.save(HDF5File('gmm_nat.hdf5', 'w'))
print("Done in:", (time.clock() - start_time)/60, "minutes")
print(gmm_nat)
In [6]:
# initialize and train k-means machine: the means will initialize EM algorithm for GMM machine
start_time = time.clock()
kmeans_synt = em.KMeansMachine(n_gaussians,n_ceps)
kmeansTrainer = em.KMeansTrainer()
em.train(kmeansTrainer, kmeans_synt, np.vstack(Xf[y==0]), max_iterations = max_iterats, convergence_threshold = 1e-5)
# initialize and train GMM machine
gmm_synt = em.GMMMachine(n_gaussians,n_ceps)
trainer = em.ML_GMMTrainer(True, True, True)
gmm_synt.means = kmeans_synt.means
em.train(trainer, gmm_synt, np.vstack(Xf[y==0]), max_iterations = max_iterats, convergence_threshold = 1e-5)
print("Done in:", (time.clock() - start_time)/60, "minutes")
#gmm_synt.save(HDF5File('gmm_synt.hdf5', 'w'))
print(gmm_synt)
In [4]:
gmm_nat = em.GMMMachine()
gmm_nat.load(HDF5File('gmm_nat.hdf5', 'r'))
gmm_synt = em.GMMMachine()
gmm_synt.load(HDF5File('gmm_synt.hdf5','r'))
In [ ]:
np.save('p_gmm_ml_eval_10.npy',llr_score)
np.save('z_gmm_ml_eval_est_10.npy',z_gmm)
In [24]:
status = 'devel' # 'devel'(= test) OR 'evalu'(= hold out)
start_time = time.clock()
if status == 'devel':
subfolders = devel_subfls
ground_truth = devel
elif status == 'evalu':
subfolders = evalu_subfls
ground_truth = evalu
n_subfls = len(subfolders)
# initialize score and class arrays
llr_gmm_score = np.zeros(len(ground_truth),)
z_gmm = np.zeros(len(ground_truth),)
print(status)
vad = preprocessor.Energy_Thr(ratio_threshold=.1)
cepstrum = extractor.Cepstral()
k = 0
thr = .5
speaker_list = ground_truth.folder.unique()
for speaker_id in speaker_list:
#speaker = ground_truth[ground_truth.folder==speaker_id]
f_list = list(ground_truth[ground_truth.folder==speaker_id].file)
folder = "".join(['Wav/',speaker_id,'/'])
print(speaker_id, end=',')
for f in f_list:
f_name = "".join([folder,f,'.wav'])
x = vad.read_original_data(f_name)
# voice activity detection
vad_data = vad(x)
if not vad_data[2].max():
vad = preprocessor.Energy_Thr(ratio_threshold=.08)
vad_data = vad(x)
vad = preprocessor.Energy_Thr(ratio_threshold=.1)
# MFCC extraction
mfcc = cepstrum(vad_data)
# Log likelihood ratio computation
llr_gmm_score[k] = gmm_nat(mfcc)-gmm_synt(mfcc)
z_gmm[k] = int(llr_gmm_score[k]>0)
k += 1
ground_truth['z'] = ground_truth.source.map(lambda x: int(x=='human'))
ground_truth['z_gmm'] = z_gmm
ground_truth['score_gmm'] = llr_gmm_score
print(roc_auc_score(ground_truth.z, ground_truth.z_gmm))
print(k,"files processed in",(time.clock()-start_time)/60,"minutes.")
# Performance evaluation
humans = z_gmm[z_dvl==0]
spoofed = z_gmm[z_dvl==1]
fnr = 100*(1-(humans<thr).sum()/len(humans))
fpr = 100*(1-(spoofed>=thr).sum()/len(spoofed))
print("ROC AUC score:", roc_auc_score(z_dvl,z_gmm))
print("False negative rate %:", fnr)
print("False positive rate %:", fpr)
print("EER %: <=", (fnr+fpr)/2)
In [47]:
thr = -.115
pz = llr_gmm_score
spoofed = pz[np.array(ground_truth.z)==1]
humans = pz[np.array(ground_truth.z)==0]
fnr = 100*(humans>thr).sum()/len(humans)
fpr = 100*(spoofed<=thr).sum()/len(spoofed)
print("False negative vs positive rates %:", fnr, fpr)
print("FNR - FPR %:", fnr-fpr)
if np.abs(fnr-fpr) <.25:
print("EER =", (fnr+fpr)/2,"%")
else:
print("EER ~", (fnr+fpr)/2,"%")