In general, enrichment can be performed in any order (i.e., speaker enrichment is independent of syllable encoding), so you can perform the major sections in any order and the result is the same. Within a section, however (i.e., Encoding syllables), the ordering of the steps is necessary (i.e., syllabic segments must be specified before syllables can be encoded).
First we begin with our standard imports and the path to the downloaded corpus:
In [ ]:
import os
from polyglotdb import CorpusContext
corpus_root = '/mnt/e/Data/pg_tutorial'
To create syllables requires two steps. The first is to specify the subset of phones in the corpus that are syllabic segments and function as syllabic nuclei. In general these will be vowels, but can also include syllabic consonants. Subsets in PolyglotDB are completely arbitrary sets of labels that speed up querying and allow for simpler references, see Subset enrichment for more details.
In [ ]:
syllabics = ["ER0", "IH2", "EH1", "AE0", "UH1", "AY2", "AW2", "UW1", "OY2", "OY1", "AO0", "AH2", "ER1", "AW1",
"OW0", "IY1", "IY2", "UW0", "AA1", "EY0", "AE1", "AA0", "OW1", "AW0", "AO1", "AO2", "IH0", "ER2",
"UW2", "IY0", "AE2", "AH0", "AH1", "UH2", "EH2", "UH0", "EY1", "AY0", "AY1", "EH0", "EY2", "AA2",
"OW2", "IH1"]
with CorpusContext('pg_tutorial') as c:
c.encode_type_subset('phone', syllabics, 'syllabic')
Once the syllabic segments have been marked as such in the phone inventory, the next step is to actually create the syllable annotations as follows:
In [ ]:
with CorpusContext('pg_tutorial') as c:
c.encode_syllables(syllabic_label='syllabic')
The encode_syllables function uses a maximum onset algorithm based on all existing word-initial sequences of phones not
marked as syllabic in this case, and then maximizes onsets between syllabic segments. As an example, something like
astringent would have a phone sequence of AH0 S T R IH1 N JH EH0 N T. In any reasonably-sized corpus of English, the
list of possible onsets would in include S T R and JH, but not N JH, so the sequence would be syllabified as
AH0 . S T R IH1 N . JH EH0 N T.
See Creating syllable units for more details on syllable enrichment.
As with syllables, encoding utterances consists of two steps. The first is marking the "words" that are actually non-speech
elements within the transcript. When a corpus is first imported, every annotation is treated as speech. As such, encoding
labels like <SIL> as pause elements and not actual speech sounds is a crucial first step.
In [ ]:
pause_labels = ['<SIL>']
with CorpusContext('pg_tutorial') as c:
c.encode_pauses(pause_labels)
Once pauses are encoded, the next step is to actually create the utterance annotations as follows:
In [ ]:
with CorpusContext('pg_tutorial') as c:
c.encode_utterances(min_pause_length=0.15)
In many cases, it is desirable to not split groups of words for all pauses, i.e., small pauses might be inserted due to forced alignment, or can signify a smaller break than an utterance break. Thus usually there is a minimum pause length to determine the breaks between utterances, as above.
See Creating utterance units for more details on u enrichment.
In [ ]:
speaker_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'speaker_info.csv')
with CorpusContext('pg_tutorial') as c:
c.enrich_speakers_from_csv(speaker_enrichment_path)
Once enrichment is complete, we can then query information and extract information about these characteristics of speakers.
See Enrichment via CSV files for more details on enrichment from csvs.
Stress enrichment requires the Encoding syllables step has been completed.
Once syllables have been encoded, there are a couple of ways to encode the stress level of the syllable (i.e., primary stress,
secondary stress, or unstressed). The way used in this tutorial will use a lexical enrichment file included in the tutorial
corpus. This file has a field named stress_pattern that gives a pattern for the syllables based on the stress. For
example, astringent will have a stress pattern of 0-1-0.
In [ ]:
lexicon_enrichment_path = os.path.join(corpus_root, 'enrichment_data', 'iscan_lexicon.csv')
with CorpusContext('pg_tutorial') as c:
c.enrich_lexicon_from_csv(lexicon_enrichment_path)
In [ ]:
with CorpusContext('pg_tutorial') as c:
c.encode_stress_from_word_property('stress_pattern')
Following this enrichment step, words will have a type property of stress_pattern and syllables will have a token property
of stress that can be queried on and extracted.
See Encoding stress for more details on enrichment from csvs.
Speech rate enrichment requires that both the Encoding syllables and Encoding utterances steps have been completed.
One of the final enrichment in this tutorial is to encode speech rate onto utterance annotations. The speech rate measure used here is going to to be syllables per second.
In [ ]:
with CorpusContext('pg_tutorial') as c:
c.encode_rate('utterance', 'syllable', 'speech_rate')
Next we will encode the number of syllables per word:
In [ ]:
with CorpusContext('pg_tutorial') as c:
c.encode_count('word', 'syllable', 'num_syllables')
Once the enrichments complete, a token property of speech_rate will be available for query and export on utterance
annotations, as well as one for num_syllables on word tokens.
See Hierarchical enrichment for more details on encoding properties based on the rate/count/position of lower annotations (i.e., phones or syllables) within higher annotations (i.e., syllables, words, or utterances).