Section 1.1: An introduction
Section 1.2: Pitch manipulation tutorial (opens in a new notebook)
In [1]:
%matplotlib inline
In [2]:
# Settings things up
from os.path import join
import matplotlib.pyplot as plt
import numpy as np
import IPython
inputPath = join('..', 'examples', 'files')
from praatio import dataio
# Some convenience functions -- we'll be using this a lot
def pitchForPlots(pitchFN):
pitchTier = dataio.open2DPointObject(pitchFN)
x, y = zip(*pitchTier.pointList)
return x, y
def doPlot(axis, title, pitchFN):
axis.plot(*pitchForPlots(pitchFN))
axis.set_title(title)
axis.set_xlabel("time(s)")
axis.set_ylabel("F0(hz)")
def displayAudio(audioTuple):
for title, wavFN in audioTuple:
print(title)
IPython.display.display(IPython.display.Audio(wavFN))
ProMo is a library for making some complicated prosody manipulations simple. ProMo comes with code for working with duration and pitch manipulations.
The PSOLA algorithm is used to resynthesize the speech. However, ProMo doesn't implement PSOLA. It offloads that work to Praat via the praatIO library.
Instead, ProMo automates the nitty-gritty work necessary to do certain kinds of manipulations. Given an audio recording and a duration target or a pitch target, ProMo will output the original audio with a pitch or duration replaced with the target pitch or duration. Or given a source and target audio file, the target pitch or duration can be automatically determined and the desired audio output. Intermediate steps between the source and the target can also be generated.
This is how ProMo got its name (morphing from one target to another) although ProMo's manipulation functions don't actually have to be used for morphing.
ProMo is a library for manipulating prosody. What is prosody? Prosody is the melodic aspect of speech. The pitch contour or intonation of an utterance. The length of a syllable. The loudness of a word. These are aspects of prosody that speakers manipulate to alter the meaning of what they're saying.
Different languages use prosody in different ways:
In Mandarin, each word is assigned one of five pitch contours or tones. Words differing in tone have different meanings.
In Arabic, there is a distinction between short and long vowels. Words differing only in different vowel length have different meaning.
In English multisyllabic words, one word has greater emphasis than others--the stressed syllable. Stressed syllables are longer than unstressed syllables and are pronounced more carefully.
In English use of pause can alter the interpretation of how pieces of a sentence are connected
In many languages, a rising intonation on a sentence designates a question while a falling intonation designates a statement.
In addition to linguistic meaning, prosody can also be used to convey information about the speaker such as their emotional state or social identity.
To understand duration manipulation, you should have a thorough understanding of consonants and vowels. We won't be reviewing those here, but you might find it handy to reference the international phonetic alphabet with sounds or with links to wikipedia pages for more detailed information
Intuitively, some sounds can easily be lengthened naturally: vowels (a, i, u), fricatives (s, sh), nasals (m, n), liquids (r, l)
Other sounds, however, cannot be lengthened naturally: stops (t, k), affricates (ch, dj).
In some languages, the duration of speech segments is relatively fixed (Japanese) while in others there is a lot more flexibility (English, Dutch). If we want to lengthen a word, do we equally lengthen each segment? Or only part of it? In Japanese, we could equally lengthen all parts of a word. For English, it might be necessary to weight some segments more than others.
Similarly, what about entire words? In English, function words like 'if', 'on', or 'the' are generally reduced and pronounced quicker relative to content words (nouns, verbs, adjectives, etc). If we wanted to increase the length of an utterance, it might sound better if only the content words are increased. Or maybe not. It's something to keep in mind.
In summary, in a duration resynthesis task:
* Not all sounds are equally lengthened
* There may be language-specific considerations at the syllable or word level.
In [3]:
displayAudio((("The original audio file:", join(inputPath, "mary1.wav")),
("The same file doubled in length:", join(inputPath, "mary1_double_length.wav"))))
Pitch is the fundamental frequency of speech--the lowest frequency produced by the vocal chords during speech. Speech contains both voiced and unvoiced segments. There is no voicing for consonants such as [t, k, sh, f] and thus, there is no pitch information in this segments.
When it comes to resynthesizing pitch, only voiced segments will be affected by pitch resynthesis.
The resynthesis process also depends on the pitch tracker. For very low pitch or in creaky voice, automatic pitch tracking software may struggle to accurately follow the pitch. Low pitch is very common at the end of an utterance, when the vocal folds tend to "wind down". Speakers with deep voices might also have more problems having their pitch resynthesized because their voice will more regularly be 'too low' than speakers with higher pitch voices.
Furthermore, pitch tracking software can be configured via a number of parameters (such as the minimum or maximum pitch values to consider). Changing the parameters can lead to drammatically different results. For these reasons, the output of a pitch tracker should never be assumed to be the absolute 'true' pitch.
Noisy speech environments can also cause problems for the pitch tracking software. In these cases, the resynthesized pitch quality can be poor. By poor I mean that either the desired effect will not be perceived (an utterance final rising pitch to signal a question) or the output will be distorted with the speaker sounding robotic or with the presence of "pops" in the output.
If the pitch is too dramatic of a difference compared with the original audio,
In summary, in a pitch resynthesis task:
* Voiced sounds are affected by resynthesis; unvoiced sounds aren't
* Pitch resynthesis in praat depends on pitch tracking software
* Pitch trackers estimate pitch--extracted pitch should not be treated as 'truth'.
* Changing the parameters to the pitch tracker can lead to very different analysis
* Poor recording quality, deep voices, and large differences in pitch can lead to less accurate pitch
ProMo requires my python library PraatIO and Praat. Praat doesn't require installation.
ProMo can also create visualizations of the morph process if you have matplotlib installed.
Finally, if you want to use the pitch interpolation feature, you need to have scipy installed.
matplotlib and scipy are both complicated to install, unless you use pip, like so:
pip install matplotlib
pip install scipy
or
python -m pip install matplotlib
python -m pip install scipy
ProMo can easily be installed with pip. If you have trouble installing ProMo with pip, please visit the github ProMo page for other installation options.
In [4]:
!pip install promo --upgrade
Section 1.2: Pitch manipulation (opens in a new notebook)
In [ ]: