An introduction and tutorial

Section 1.1 An introduction

TABLE OF CONTENTS

Section 1.1: An introduction

What is ProMo?
What is Prosody?
- Considerations when resynthesizing duration
- Considerations when resynthesizing pitch
Requirements
Installing ProMo

Section 1.2: Pitch manipulation tutorial (opens in a new notebook)



In [1]:

    
%matplotlib inline



In [2]:

    
# Settings things up
from os.path import join

import matplotlib.pyplot as plt
import numpy as np

import IPython
inputPath = join('..', 'examples', 'files')

from praatio import dataio

# Some convenience functions -- we'll be using this a lot
def pitchForPlots(pitchFN):
    pitchTier = dataio.open2DPointObject(pitchFN)
    x, y = zip(*pitchTier.pointList)
    return x, y

def doPlot(axis, title, pitchFN):
    axis.plot(*pitchForPlots(pitchFN))
    axis.set_title(title)
    axis.set_xlabel("time(s)")
    axis.set_ylabel("F0(hz)")
    
def displayAudio(audioTuple):
    for title, wavFN in audioTuple:
        print(title)
        IPython.display.display(IPython.display.Audio(wavFN))

An Introduction

ProMo is a library for making some complicated prosody manipulations simple. ProMo comes with code for working with duration and pitch manipulations.

The PSOLA algorithm is used to resynthesize the speech. However, ProMo doesn't implement PSOLA. It offloads that work to Praat via the praatIO library.

Instead, ProMo automates the nitty-gritty work necessary to do certain kinds of manipulations. Given an audio recording and a duration target or a pitch target, ProMo will output the original audio with a pitch or duration replaced with the target pitch or duration. Or given a source and target audio file, the target pitch or duration can be automatically determined and the desired audio output. Intermediate steps between the source and the target can also be generated.

This is how ProMo got its name (morphing from one target to another) although ProMo's manipulation functions don't actually have to be used for morphing.

What is Prosody?

ProMo is a library for manipulating prosody. What is prosody? Prosody is the melodic aspect of speech. The pitch contour or intonation of an utterance. The length of a syllable. The loudness of a word. These are aspects of prosody that speakers manipulate to alter the meaning of what they're saying.

Different languages use prosody in different ways:

In Mandarin, each word is assigned one of five pitch contours or tones. Words differing in tone have different meanings.
- 妈 'mother' [mā] (high tone) vs 马 'horse' [mǎ] (falling rising tone)
In Arabic, there is a distinction between short and long vowels. Words differing only in different vowel length have different meaning.
- أكل 'He ate' [akala] vs أكلا "Those two ate" [akala:]
In English multisyllabic words, one word has greater emphasis than others--the stressed syllable. Stressed syllables are longer than unstressed syllables and are pronounced more carefully.
- The test for stress in English is to say a word as if you are surprised. The stressed syllable is ususally obvious for native speakers. Eg: banana:'ba.NA.na' or pineapple:'PI.na.pl',
In English use of pause can alter the interpretation of how pieces of a sentence are connected
- The sentence 'The woman saw the man with the telescope' is ambiguous. Which scenario is correct? The woman used a telescope to see a man or a woman saw a man who was holding a telescope. A slight pause between 'man' and 'with the telescope' leads to a stronger intepretation that the woman has the telescope. A slight pause before 'the man' leads to a stronger intepretation that the man has the telescope.
In many languages, a rising intonation on a sentence designates a question while a falling intonation designates a statement.
- E.g. 'You're going to the store?' vs 'You're going to the store.'

In addition to linguistic meaning, prosody can also be used to convey information about the speaker such as their emotional state or social identity.

Considerations when resynthesizing duration

To understand duration manipulation, you should have a thorough understanding of consonants and vowels. We won't be reviewing those here, but you might find it handy to reference the international phonetic alphabet with sounds or with links to wikipedia pages for more detailed information

Intuitively, some sounds can easily be lengthened naturally: vowels (a, i, u), fricatives (s, sh), nasals (m, n), liquids (r, l)

Other sounds, however, cannot be lengthened naturally: stops (t, k), affricates (ch, dj).

In some languages, the duration of speech segments is relatively fixed (Japanese) while in others there is a lot more flexibility (English, Dutch). If we want to lengthen a word, do we equally lengthen each segment? Or only part of it? In Japanese, we could equally lengthen all parts of a word. For English, it might be necessary to weight some segments more than others.

Similarly, what about entire words? In English, function words like 'if', 'on', or 'the' are generally reduced and pronounced quicker relative to content words (nouns, verbs, adjectives, etc). If we wanted to increase the length of an utterance, it might sound better if only the content words are increased. Or maybe not. It's something to keep in mind.

In summary, in a duration resynthesis task:

* Not all sounds are equally lengthened
* There may be language-specific considerations at the syllable or word level.



In [3]:

    
displayAudio((("The original audio file:", join(inputPath, "mary1.wav")),
              ("The same file doubled in length:", join(inputPath, "mary1_double_length.wav"))))









    



The original audio file:






    





                
              






    



The same file doubled in length:

Considerations when resynthesizing pitch

Pitch is the fundamental frequency of speech--the lowest frequency produced by the vocal chords during speech. Speech contains both voiced and unvoiced segments. There is no voicing for consonants such as [t, k, sh, f] and thus, there is no pitch information in this segments.

When it comes to resynthesizing pitch, only voiced segments will be affected by pitch resynthesis.

The resynthesis process also depends on the pitch tracker. For very low pitch or in creaky voice, automatic pitch tracking software may struggle to accurately follow the pitch. Low pitch is very common at the end of an utterance, when the vocal folds tend to "wind down". Speakers with deep voices might also have more problems having their pitch resynthesized because their voice will more regularly be 'too low' than speakers with higher pitch voices.

Furthermore, pitch tracking software can be configured via a number of parameters (such as the minimum or maximum pitch values to consider). Changing the parameters can lead to drammatically different results. For these reasons, the output of a pitch tracker should never be assumed to be the absolute 'true' pitch.

Noisy speech environments can also cause problems for the pitch tracking software. In these cases, the resynthesized pitch quality can be poor. By poor I mean that either the desired effect will not be perceived (an utterance final rising pitch to signal a question) or the output will be distorted with the speaker sounding robotic or with the presence of "pops" in the output.

If the pitch is too dramatic of a difference compared with the original audio,

In summary, in a pitch resynthesis task:

* Voiced sounds are affected by resynthesis; unvoiced sounds aren't
* Pitch resynthesis in praat depends on pitch tracking software
* Pitch trackers estimate pitch--extracted pitch should not be treated as 'truth'.
    * Changing the parameters to the pitch tracker can lead to very different analysis
    * Poor recording quality, deep voices, and large differences in pitch can lead to less accurate pitch

A Tutorial

Requirements

ProMo requires my python library PraatIO and Praat. Praat doesn't require installation.

ProMo can also create visualizations of the morph process if you have matplotlib installed.

Finally, if you want to use the pitch interpolation feature, you need to have scipy installed.

matplotlib and scipy are both complicated to install, unless you use pip, like so:

pip install matplotlib
pip install scipy

python -m pip install matplotlib
python -m pip install scipy

ProMo can easily be installed with pip. If you have trouble installing ProMo with pip, please visit the github ProMo page for other installation options.



In [4]:

    
!pip install promo --upgrade









    



Requirement already up-to-date: promo in c:\python36\lib\site-packages

What's next

Section 1.2: Pitch manipulation (opens in a new notebook)



In [ ]:

ProMo - A Speech Prosody Morphing library for Python