The same phonemes can be pronouced differently by different people/groups of people. This is colloquially referred to as an "accent".
The models used throughout SR pipelines typically have to be exceptionally well tuned to be robust to phonological variation. This is addressed implicitly through the use of HHMM.
A reliable accent classification system can reduce the problem space considerably.
Repeat after me:
"pʰlis kɔl stɛːlʌ ɑsk˺ ɜ tə bɹɪ̃ŋ ðiz θɪ̃ŋz̥ wɪf hɜ fɹʌ̃ɱ ðə stɔɹ siks spunz̥ əv̥ fɹɪʃ sn̥oʊ piːs faɪf θɪk slæb̥s əv blu ʧiːz ɛn măɪbi ɜ snæk˺ foɹ̥ hɜ bɹɑɾə̆ ʔə brʌðə bɑp wi ɔlˠsŏ nid ə smɔlˠ plæstɪk sneɪk ɛn ə bɪk tʊi fɹɔɡ̥ fɛ̆ ðə kids̥ ʃi kɛ̆n skøp ðiz θɪ̃ŋs ɪntu fɹi ɹɛd bæɡz̥ ɛn wi wɪl ɡoʊ miːd ɜ̆ wĕnz̥d̥eɪ ɛt d̪ə tɹeɪn steɪʃən"
"Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."
There has been successes in the field using a variety of mixed-acoustic/HHMM techniques, but none that exploits restricted elicitations.
There have been successes separating 'obvious dialects' using acoustic information, but involve very small testing sets.
I'm using the GMU Speech Accent Archive, which contains ~3000 labelled recordings of the elicitation, traditionally used by and for linguistic research.
In [4]:
%pylab --no-import-all inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import features as F
import training_data
import extract;
eng3, fs = training_data.get_data(4, 'english')
eng_mfcc = F.mfcc_atomic(eng3, fs)
eng_grad = F.stack_double_deltas(eng_mfcc)
eng_filtered = F.low_energy_filter(eng_grad, 15)
tone_palette = F.norm_ordered_centroids(eng_filtered, k=40);
f1, (a1, a2) = plt.subplots(2, 1, figsize=(14, 4))
a1.plot(eng3[5000:30000])
a2.specgram(eng3[4000:31000])
a1.set_title("1-Second of Speech")
a2.set_title("Spectrogram")
a1.get_xaxis().set_visible(False)
a2.get_xaxis().set_visible(False)
a1.get_yaxis().set_visible(False)
a2.get_yaxis().set_visible(False)
f2, a1 = plt.subplots(1, 1, figsize=(14, 4))
a1.imshow(np.log2(np.abs(tone_palette.T)))
a1.set_title("Full Stream Tone Palette of Single Speaker");
Out[4]:
In [5]:
f1 # Sample from recording
Out[5]:
In [6]:
f2 # Tone Palette from full recording
Out[6]:
M. Yusnita, M. Paulraj, S. Yaacob, and A. Shahriman, “Classification of speaker accent using hybrid dwt-lpc features and k-nearest neighbors in ethnically diverse malaysian english,” in Computer Applications and Industrial Electronics (ISCAIE), 2012 IEEE Symposium on, Dec 2012, pp. 179–184.
G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
S. Weinberger, “Gmu speech accent archive,” http://accent.gmu.edu, 2014.
S. Deshpande, S. Chikkerur, and V. Govindaraju, “Accent classification in speech,” in Automatic Identification Advanced Technologies, 2005. Fourth IEEE Workshop on, Oct 2005, pp. 139–143.
P. Cook, “Identification of control parameters in an articulatory vocal tract model, with applications to the synthesis of singing,” Master’s thesis, Stanford University, Stanford, California, 12/1990 1990. [Online]. Available: https://ccrma.stanford.edu/files/papers/stanm68.pdf
E. Shriberg, “Higher-level features in speaker recognition.” in Speaker Classification (1), ser. Lecture Notes in Computer Science, C. Mller, Ed., vol. 4343. Springer, 2007, pp. 241–259. [Online]. Available: http: //dblp.uni- trier.de/db/conf/speakerc/speakerc2007- 1.html#Shriberg07
F. Biadsy, “Automatic dialect and accent recognition and its application to speech recognition,” Ph.D. dissertation, Columbia University, 2011, ph.D., Columbia University.
U. Shrawankar and V. Thakare, “Techniques for feature extraction in speech recognition system : A comparative study,” ArXiv e-prints, May 2013.
G. Doddington, “Speaker recognition; identifying people by their voices,” Proceedings of the IEEE, vol. 73, no. 11, pp. 1651–1664, Nov 1985.
M. Marolt, “Gaussian mixture models for extraction of melodic lines from audio recordings.” in ISMIR, 2004. [Online]. Available: http://dblp.uni-trier.de/db/conf/ismir/ismir2004.html#Marolt04
J. H. P. Angkitirakul, “Stochastic trajectory model analysis for accent classification,” in Spoken Language Processing, 2002. ICSLP Conf. on, Sept 2002, pp. 493–496.
X. Lin and S. Simske, “Phoneme-less hierarchical accent classification,” in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on, vol. 2, Nov 2004, pp. 1801– 1804 Vol.2.
O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Commun., vol. 25, no. 1-3, pp. 133–147, Aug. 1998. [Online]. Available: http://dx.doi.org/10.1016/S0167- 6393(98)00033- 8