Phoneme-Agnostic Automatic English Accent Classification

Alexander Huras

Group 32?

"Speech Recognition is Hard"

-pretty much everyone

The typical speech recognition pipeline:

1. Get audio

2. Filter like its going out of style ...

3. Extract acoustic information (MFCC, LPC, various derivatives)

4. Segment the acoustic feature stream into phonological units, or sections likely to be phonemes

5. Pass the stream into a massive HHMM/GMM (may-or-may-not also contain a hilariously deep NN) ...

6. Label the phonemes

7. Segment the phonemes into words (typically using EM on another HHMM)

8. Profit (Google, Apple, Microsoft, Nuance ...).

Well that was easy!

-said no one ever.

There is a problem here:

The same phonemes can be pronouced differently by different people/groups of people. This is colloquially referred to as an "accent".

The models used throughout SR pipelines typically have to be exceptionally well tuned to be robust to phonological variation. This is addressed implicitly through the use of HHMM.

A reliable accent classification system can reduce the problem space considerably.

Repeat after me:

"pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ wɪf hɜ fɹʌ̃ɱ ðə stɔɹ siks spunz̥ əv̥ fɹɪʃ sn̥oʊ piːs faɪf θɪk slæb̥s əv blu ʧiːz ɛn măɪbi ɜ snæk˺ foɹ̥ hɜ bɹɑɾə̆ ʔə brʌðə bɑp wi ɔlˠsŏ nid ə smɔlˠ plæstɪk sneɪk ɛn ə bɪk tʊi fɹɔɡ̥ fɛ̆ ðə kids̥ ʃi kɛ̆n skøp ðiz θɪ̃ŋs ɪntu fɹi ɹɛd bæɡz̥ ɛn wi wɪl ɡoʊ miːd ɜ̆ wĕnz̥d̥eɪ ɛt d̪ə tɹeɪn steɪʃən"

"Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."

The Question:

Can we identify a speaker's accent using purely acoustic features?

The Answer

There has been successes in the field using a variety of mixed-acoustic/HHMM techniques, but none that exploits restricted elicitations.

There have been successes separating 'obvious dialects' using acoustic information, but involve very small testing sets.

I'm using the GMU Speech Accent Archive, which contains ~3000 labelled recordings of the elicitation, traditionally used by and for linguistic research.

The System:

  1. Resample Audio Streams to 22kHz
  2. Extract MFCC coefficients for each stream, using ~4ms frames
  3. Generate first, and second-derivative MFCC frames, concatenate them.
  4. Apply a High-Energy-Pass filter (Remove the 'weak' frames, typically associated with silence and dramatic ... pauses).
  5. Cluster each Stream's MFCC coefficients (elicitation is good signal) into k-clusters.
  6. Take the cluster centroids, and order them by magnitude (in ~40D space). The resulting matrix/image is the tone palette of the speaker.
  7. The assumption is these features are relatively descriptive/separable, do ML to identify accents (using the SAA metadata as a label). It is entirely possible this isn't the case ...

In [4]:
%pylab --no-import-all inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import features as F
import training_data
import extract;
eng3, fs = training_data.get_data(4, 'english')
eng_mfcc = F.mfcc_atomic(eng3, fs)
eng_grad = F.stack_double_deltas(eng_mfcc)
eng_filtered = F.low_energy_filter(eng_grad, 15)
tone_palette = F.norm_ordered_centroids(eng_filtered, k=40);

f1, (a1, a2) = plt.subplots(2, 1, figsize=(14, 4))
a1.plot(eng3[5000:30000])
a2.specgram(eng3[4000:31000])
a1.set_title("1-Second of Speech")
a2.set_title("Spectrogram")
a1.get_xaxis().set_visible(False)
a2.get_xaxis().set_visible(False)
a1.get_yaxis().set_visible(False)
a2.get_yaxis().set_visible(False)

f2, a1 = plt.subplots(1, 1, figsize=(14, 4))
a1.imshow(np.log2(np.abs(tone_palette.T)))
a1.set_title("Full Stream Tone Palette of Single Speaker");


Populating the interactive namespace from numpy and matplotlib
/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/io/wavfile.py:42: WavFileWarning: Unknown wave file format
  warnings.warn("Unknown wave file format", WavFileWarning)
Out[4]:
<matplotlib.text.Text at 0x10660de50>

In [5]:
f1  #  Sample from recording


Out[5]:

... Shenanigans ...


In [6]:
f2  #  Tone Palette from full recording


Out[6]:

Next Steps

  1. Extracting Tone Palettes from ~3000 recordings.
  2. Dealing with really shitty labels, likely the usable dataset size is closer to ~1200 recordings. (GMU Dataset)
  3. The actual classification expressed as membership liklihood (SVM for benchmark, CNN for awesomeness), SVM will likely require fewer dimensions.
  4. Validation, Rotating 70/30, its like K-fold, but less rigorous (also classes are not equally represented).
  5. I would be happy with better than 60% (Lin, Simske 2004 had 83%...).

Aknowledgements/References

M. Yusnita, M. Paulraj, S. Yaacob, and A. Shahriman, “Classification of speaker accent using hybrid dwt-lpc features and k-nearest neighbors in ethnically diverse malaysian english,” in Computer Applications and Industrial Electronics (ISCAIE), 2012 IEEE Symposium on, Dec 2012, pp. 179–184.

G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.

S. Weinberger, “Gmu speech accent archive,” http://accent.gmu.edu, 2014.

S. Deshpande, S. Chikkerur, and V. Govindaraju, “Accent classification in speech,” in Automatic Identification Advanced Technologies, 2005. Fourth IEEE Workshop on, Oct 2005, pp. 139–143.

P. Cook, “Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing,” Master’s thesis, Stanford University, Stanford, California, 12/1990 1990. [Online]. Available: https://ccrma.stanford.edu/files/papers/stanm68.pdf

E. Shriberg, “Higher-level features in speaker recognition.” in Speaker Classification (1), ser. Lecture Notes in Computer Science, C. Mller, Ed., vol. 4343. Springer, 2007, pp. 241–259. [Online]. Available: http: //dblp.uni- trier.de/db/conf/speakerc/speakerc2007- 1.html#Shriberg07

F. Biadsy, “Automatic dialect and accent recognition and its application to speech recognition,” Ph.D. dissertation, Columbia University, 2011, ph.D., Columbia University.

U. Shrawankar and V. Thakare, “Techniques for feature extraction in speech recognition system : A comparative study,” ArXiv e-prints, May 2013.

G. Doddington, “Speaker recognition; identifying people by their voices,” Proceedings of the IEEE, vol. 73, no. 11, pp. 1651–1664, Nov 1985.

M. Marolt, “Gaussian mixture models for extraction of melodic lines from audio recordings.” in ISMIR, 2004. [Online]. Available: http://dblp.uni-trier.de/db/conf/ismir/ismir2004.html#Marolt04

J. H. P. Angkitirakul, “Stochastic trajectory model analysis for accent classification,” in Spoken Language Processing, 2002. ICSLP Conf. on, Sept 2002, pp. 493–496.

X. Lin and S. Simske, “Phoneme-less hierarchical accent classification,” in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on, vol. 2, Nov 2004, pp. 1801– 1804 Vol.2.

O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Commun., vol. 25, no. 1-3, pp. 133–147, Aug. 1998. [Online]. Available: http://dx.doi.org/10.1016/S0167- 6393(98)00033- 8