Spoofed Speech Detection via Maximum Likelihood Estimation of Gaussian Mixture Models

The goal of synthetic speech detection is to determine whether a speech segment $S$ is natural or synthetic/converted speeach.

This notebook implements a Gaussian mixture model maximum likelihood (GMM-ML) classifier for synthetic (spoofed) speech detection. This approach uses regular mel frequency cepstral coefficients (MFCC) features and gives the best performance on the ASVspoof 2015 dataset among the standard classifiers (GMM-SV, GMM-UBM, ...). For more background information see: Hanilçi, Cemal, Tomi Kinnunen, Md Sahidullah, and Aleksandr Sizov. "Classifiers for synthetic speech detection: a comparison." In INTERSPEECH 2015. The scripts use the Python package Bob.Bio.SPEAR 2.04 for speaker recogntion.

This work is part of the "DDoS Resilient Emergency Dispatch Center" project at the University of Houston, funded by the Department of Homeland Security (DHS).

April 19, 2015

Lorenzo Rossi

(lorenzo [dot] rossi [at] gmail [dot] com)


In [48]:
import os
import time
import numpy as np
import pandas as pd
from bob.bio.spear import preprocessor, extractor
from bob.bio.gmm import algorithm
from bob.io.base import HDF5File
from bob.learn import em
from sklearn.metrics import classification_report, roc_curve, roc_auc_score

WAV_FOLDER = 'Wav/' #'ASV2015dataset/wav/' # Path to folder containing speakers .wav subfolders
LABEL_FOLDER = 'CM_protocol/' #'ASV2015dataset/CM_protocol/' # Path to ground truth csv files

EXT = '.wav'
%matplotlib inline

Loading the Ground Truth

Load the dataframes (tables) with the labels for the training, development and evaluation (hold out) sets. Each subfolder corresponds to a different speaker. For example, T1 and D4 indicate the subfolders associated to the utterances and spoofed segments of speakers T1 and D4, respectively in training and development sets. Note that number of evaluation samples >> number of development samples >> testing samples.

You can either select the speakers in each set one by one, e.g.:

train_subfls = ['T1', 'T2']

will only load segments from speakers T1 and T2 for training,

or use all the available speakers in a certain subset by leaving the list empty, e.g.:

devel_subfls = []

will load all the available Dx speaker segments for the development stage. If you are running this notebook for the first time, you may want to start only with 2 or so speakers per set for sake of quick testing. All the scripts may take several hours to run on the full size datsets.


In [19]:
train_subfls = ['T1']#, 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9', 'T13']  #T13 used instead of T10 for gender balance
devel_subfls = ['D1']#, 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10']
evalu_subfls = ['E1']#, 'E2', 'E3', 'E4', 'E5', 'E6','E7',  'E8', 'E9', 'E10']
train = pd.read_csv(LABEL_FOLDER + 'cm_train.trn', sep=' ', header=None, names=['folder','file','method','source'])
if len(train_subfls): train = train[train.folder.isin(train_subfls)]
train.sort_values(['folder', 'file'], inplace=True)
devel = pd.read_csv(LABEL_FOLDER + 'cm_develop.ndx', sep=' ', header=None, names=['folder','file','method','source'])
if len(devel_subfls): devel = devel[devel.folder.isin(devel_subfls)]
devel.sort_values(['folder', 'file'], inplace=True)
evalu = pd.read_csv(LABEL_FOLDER +'cm_evaluation.ndx', sep=' ', header=None, names=['folder','file','method','source'])
if len(evalu_subfls): evalu = evalu[evalu.folder.isin(evalu_subfls)]

evalu.sort_values(['folder', 'file'], inplace=True)

label_2_class = {'human':1, 'spoof':0}

print('training samples:',len(train))
print('development samples:',len(devel))
print('evaluation samples:',len(evalu))


training samples: 655
development samples: 1525
evaluation samples: 4185

Speech Preprocessing and MFCC Extraction

Silence removal and MFCC feature extraction for training segments. More details about the bob.bio.spear involved libraries at: https://www.idiap.ch/software/bob/docs/latest/bioidiap/bob.bio.spear/master/implemented.html

You can also skip this stage and load a set of feaures (see Loading features cell).


In [20]:
# Parameters
n_ceps = 60 # number of ceptral coefficients (implicit in extractor)
silence_removal_ratio = .1

In [21]:
subfolders = train_subfls
ground_truth = train

# initialize feature matrix
features = []
y = np.zeros((len(ground_truth),))
print("Extracting features for training stage.")

vad = preprocessor.Energy_Thr(ratio_threshold=silence_removal_ratio)
cepstrum = extractor.Cepstral()

k = 0
start_time = time.clock()

for folder in subfolders[0:n_subfls]:
    print(folder, end=", ")
    folder = "".join(('Wav/',folder,'/'))
    f_list = os.listdir(folder)
    for f_name in f_list:
        # ground truth
        try: 
            label = ground_truth[ground_truth.file==f_name[:-len(EXT)]].source.values[0]
        except IndexError:
            continue
        y[k] = label_2_class[label]
        # silence removal
        x = vad.read_original_data(folder+f_name)
        vad_data = vad(x)
        if not vad_data[2].max():
            vad = preprocessor.Energy_Thr(ratio_threshold=silence_removal_ratio*.8)
            vad_data = vad(x)
            vad = preprocessor.Energy_Thr(ratio_threshold=silence_removal_ratio)
        # MFCC extraction 
        mfcc = cepstrum(vad_data)
        features.append(mfcc)
        k += 1

Xf = np.array(features)
print(k,"files processed in",(time.clock()-start_time)/60,"minutes.")


Extracting features for training stage.
T1, 655 files processed in 1.8766250499999992 minutes.

Saving features


In [ ]:
np.save('X.npy',Xf)
np.save('y.npy',y)
print('Feature and label matrices saved to disk')

Loading features


In [ ]:
# Load already extracter features to skip the preprocessing-extraction stage
Xf = np.load('train_features_10.npy')
y = np.load('y_10.npy')

GMM - ML Classification

GMM Training

Train the GMMs for natural and synthetic speach. For documentation on bob.bio k-means and GMM machines see: https://pythonhosted.org/bob.learn.em/guide.html

You can also skip the training stage and load an already trained GMM model (see cell Loading GMM Model).


In [22]:
# Parameters of the GMM machines
n_gaussians = 128 # number of Gaussians
max_iterats = 25 # maximum number of iterations

GMM for natural speech


In [5]:
# Initialize and train k-means machine: the means will initialize EM algorithm for GMM machine
start_time = time.clock()
kmeans_nat = em.KMeansMachine(n_gaussians,n_ceps)
kmeansTrainer = em.KMeansTrainer()
em.train(kmeansTrainer, kmeans_nat, np.vstack(Xf[y==1]), max_iterations = max_iterats, convergence_threshold = 1e-5)
#kmeans_nat.means

# initialize and train GMM machine
gmm_nat = em.GMMMachine(n_gaussians,n_ceps)
trainer = em.ML_GMMTrainer(True, True, True)
gmm_nat.means = kmeans_nat.means
em.train(trainer, gmm_nat, np.vstack(Xf[y==1]), max_iterations = max_iterats, convergence_threshold = 1e-5)
#gmm_nat.save(HDF5File('gmm_nat.hdf5', 'w'))
print("Done in:", (time.clock() - start_time)/60, "minutes")
print(gmm_nat)


Done in: 1.7726312666666666 minutes
<bob.learn.em.GMMMachine object at 0x7fc500be12d0>

GMM for synthetic speech


In [6]:
# initialize and train k-means machine: the means will initialize EM algorithm for GMM machine
start_time = time.clock()
kmeans_synt = em.KMeansMachine(n_gaussians,n_ceps)
kmeansTrainer = em.KMeansTrainer()
em.train(kmeansTrainer, kmeans_synt, np.vstack(Xf[y==0]), max_iterations = max_iterats, convergence_threshold = 1e-5)

# initialize and train GMM machine
gmm_synt = em.GMMMachine(n_gaussians,n_ceps)
trainer = em.ML_GMMTrainer(True, True, True)
gmm_synt.means = kmeans_synt.means
em.train(trainer, gmm_synt, np.vstack(Xf[y==0]), max_iterations = max_iterats, convergence_threshold = 1e-5)
print("Done in:", (time.clock() - start_time)/60, "minutes")
#gmm_synt.save(HDF5File('gmm_synt.hdf5', 'w'))
print(gmm_synt)


Done in: 6.424915316666667 minutes
<bob.learn.em.GMMMachine object at 0x7fc500be1330>

Loading GMM model


In [4]:
gmm_nat = em.GMMMachine()
gmm_nat.load(HDF5File('gmm_nat.hdf5', 'r'))
gmm_synt = em.GMMMachine()
gmm_synt.load(HDF5File('gmm_synt.hdf5','r'))

In [ ]:
np.save('p_gmm_ml_eval_10.npy',llr_score)
np.save('z_gmm_ml_eval_est_10.npy',z_gmm)

GMM-ML Scoring

Extract the features for the testing data, compute the likelihood ratio test and compute ROC AUC and estimated EER scores.


In [24]:
status = 'devel' # 'devel'(= test) OR 'evalu'(= hold out)
start_time = time.clock()

if status == 'devel':
    subfolders = devel_subfls
    ground_truth = devel
elif status == 'evalu':
    subfolders = evalu_subfls
    ground_truth = evalu
n_subfls = len(subfolders)
# initialize score and class arrays
llr_gmm_score = np.zeros(len(ground_truth),)
z_gmm = np.zeros(len(ground_truth),)
print(status)

vad = preprocessor.Energy_Thr(ratio_threshold=.1)
cepstrum = extractor.Cepstral()

k = 0
thr = .5
speaker_list = ground_truth.folder.unique()

for speaker_id in speaker_list:
    #speaker = ground_truth[ground_truth.folder==speaker_id]
    f_list = list(ground_truth[ground_truth.folder==speaker_id].file)
    folder = "".join(['Wav/',speaker_id,'/'])
    print(speaker_id, end=',')

    for f in f_list:
        f_name = "".join([folder,f,'.wav'])
        x = vad.read_original_data(f_name)
        # voice activity detection
        vad_data = vad(x)
        if not vad_data[2].max():
            vad = preprocessor.Energy_Thr(ratio_threshold=.08)
            vad_data = vad(x)
            vad = preprocessor.Energy_Thr(ratio_threshold=.1)
        # MFCC extraction 
        mfcc = cepstrum(vad_data)
        # Log likelihood ratio computation
        llr_gmm_score[k] = gmm_nat(mfcc)-gmm_synt(mfcc)
        z_gmm[k] = int(llr_gmm_score[k]>0)
        k += 1

ground_truth['z'] = ground_truth.source.map(lambda x: int(x=='human'))
ground_truth['z_gmm'] = z_gmm
ground_truth['score_gmm'] = llr_gmm_score
print(roc_auc_score(ground_truth.z, ground_truth.z_gmm))
print(k,"files processed in",(time.clock()-start_time)/60,"minutes.")

# Performance evaluation
humans = z_gmm[z_dvl==0]
spoofed = z_gmm[z_dvl==1]
fnr = 100*(1-(humans<thr).sum()/len(humans))
fpr = 100*(1-(spoofed>=thr).sum()/len(spoofed))
print("ROC AUC score:", roc_auc_score(z_dvl,z_gmm))
print("False negative rate %:", fnr)
print("False positive rate %:", fpr)
print("EER %: <=", (fnr+fpr)/2)


devel
D1,0.950263157895
1525 files processed in 5.7873589999999995 minutes.
ROC AUC score: 0.950263157895
False negative rate %: 2.94736842105
False positive rate %: 7.0
EER %: <= 4.97368421053

EER computation

Adjust the threshold $thr$ to reduce $FNR-FPR$ for a more accurate estimate of the $EER$.

The Equal Error Rate ($EER$) is the value where the false negative rate ($FNR$) equals the false positive rate ($FPR$). It's an error metric commonly used to characterize biometric systems.


In [47]:
thr = -.115
pz = llr_gmm_score
spoofed = pz[np.array(ground_truth.z)==1]
humans = pz[np.array(ground_truth.z)==0]
fnr = 100*(humans>thr).sum()/len(humans)
fpr = 100*(spoofed<=thr).sum()/len(spoofed)
print("False negative vs positive rates %:", fnr, fpr)
print("FNR - FPR %:", fnr-fpr)
if np.abs(fnr-fpr) <.25:
    print("EER =", (fnr+fpr)/2,"%")
else:
    print("EER ~", (fnr+fpr)/2,"%")


False negative vs positive rates %: 3.36842105263 3.0
FNR - FPR %: 0.368421052632
EER ~ 3.18421052632 %