In [1]:
%matplotlib inline
import numpy, scipy, matplotlib.pyplot as plt, IPython.display as ipd
import librosa, librosa.display
import stanford_mir; stanford_mir.init()

Audio Representation

In performance, musicians convert sheet music representations into sound which is transmitted through the air as air pressure oscillations. In essence, sound is simply air vibrating (Wikipedia). Sound vibrates through the air as longitudinal waves, i.e. the oscillations are parallel to the direction of propagation.

Audio refers to the production, transmission, or reception of sounds that are audible by humans. An audio signal is a representation of sound that represents the fluctuation in air pressure caused by the vibration as a function of time. Unlike sheet music or symbolic representations, audio representations encode everything that is necessary to reproduce an acoustic realization of a piece of music. However, note parameters such as onsets, durations, and pitches are not encoded explicitly. This makes converting from an audio representation to a symbolic representation a difficult and ill-defined task.

Waveforms and the Time Domain

The basic representation of an audio signal is in the time domain.

Let's listen to a file:


In [2]:
x, sr = librosa.load('audio/c_strum.wav')
ipd.Audio(x, rate=sr)


Out[2]:

(If you get an error using librosa.load, you may need to install ffmpeg.)

The change in air pressure at a certain time is graphically represented by a pressure-time plot, or simply waveform.

To plot a waveform, use librosa.display.waveplot:


In [3]:
plt.figure(figsize=(15, 5))
librosa.display.waveplot(x, sr, alpha=0.8)


Out[3]:
<matplotlib.collections.PolyCollection at 0x10f956a20>

Digital computers can only capture this data at discrete moments in time. The rate at which a computer captures audio data is called the sampling frequency (often abbreviated fs) or sampling rate (often abbreviated sr). For this workshop, we will mostly work with a sampling frequency of 44100 Hz, the sampling rate of CD recordings.

Timbre: Temporal Indicators

Timbre is the quality of sound that distinguishes the tone of different instruments and voices even if the sounds have the same pitch and loudness.

One characteristic of timbre is its temporal evolution. The envelope of a signal is a smooth curve that approximates the amplitude extremes of a waveform over time.

Envelopes are often modeled by the ADSR model (Wikipedia) which describes four phases of a sound: attack, decay, sustain, release.

During the attack phase, the sound builds up, usually with noise-like components over a broad frequency range. Such a noise-like short-duration sound at the start of a sound is often called a transient.

During the decay phase, the sound stabilizes and reaches a steady periodic pattern.

During the sustain phase, the energy remains fairly constant.

During the release phase, the sound fades away.

The ADSR model is a simplification and does not necessarily model the amplitude envelopes of all sounds.


In [4]:
ipd.Image("https://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/ADSR_parameter.svg/640px-ADSR_parameter.svg.png")


Out[4]:

Timbre: Spectral Indicators

Another property used to characterize timbre is the existence of partials and their relative strengths. Partials are the dominant frequencies in a musical tone with the lowest partial being the fundamental frequency.

The partials of a sound are visualized with a spectrogram. A spectrogram shows the intensity of frequency components over time. (See Fourier Transform and Short-Time Fourier Transform for more.)

Pure Tone

Let's synthesize a pure tone at 1047 Hz, concert C6:


In [5]:
T = 2.0 # seconds
f0 = 1047.0
sr = 22050
t = numpy.linspace(0, T, int(T*sr), endpoint=False) # time variable
x = 0.1*numpy.sin(2*numpy.pi*f0*t)
ipd.Audio(x, rate=sr)


Out[5]:

Display the spectrum of the pure tone:


In [6]:
X = scipy.fft(x[:4096])
X_mag = numpy.absolute(X)        # spectral magnitude
f = numpy.linspace(0, sr, 4096)  # frequency variable
plt.figure(figsize=(14, 5))
plt.plot(f[:2000], X_mag[:2000]) # magnitude spectrum
plt.xlabel('Frequency (Hz)')


Out[6]:
Text(0.5,0,'Frequency (Hz)')

Oboe

Let's listen to an oboe playing a C6:


In [7]:
x, sr = librosa.load('audio/oboe_c6.wav')
ipd.Audio(x, rate=sr)


Out[7]:

In [8]:
print(x.shape)


(23625,)

Display the spectrum of the oboe:


In [9]:
X = scipy.fft(x[10000:14096])
X_mag = numpy.absolute(X)
plt.figure(figsize=(14, 5))
plt.plot(f[:2000], X_mag[:2000]) # magnitude spectrum
plt.xlabel('Frequency (Hz)')


Out[9]:
Text(0.5,0,'Frequency (Hz)')

Clarinet

Let's listen to a clarinet playing a concert C6:


In [10]:
x, sr = librosa.load('audio/clarinet_c6.wav')
ipd.Audio(x, rate=sr)


Out[10]:

In [11]:
print(x.shape)


(51386,)

In [12]:
X = scipy.fft(x[10000:14096])
X_mag = numpy.absolute(X)
plt.figure(figsize=(14, 5))
plt.plot(f[:2000], X_mag[:2000]) # magnitude spectrum
plt.xlabel('Frequency (Hz)')


Out[12]:
Text(0.5,0,'Frequency (Hz)')

Notice the difference in the relative amplitudes of the partial components. All three signals have approximately the same pitch and fundamental frequency, yet their timbres differ.