In [1]:
%matplotlib inline
import numpy, scipy, matplotlib.pyplot as plt, IPython.display as ipd
import librosa, librosa.display
import stanford_mir; stanford_mir.init()
To detect note onsets, we want to locate sudden changes in the audio signal that mark the beginning of transient regions. Often, an increase in the signal's amplitude envelope will denote an onset candidate. However, that is not always the case, for notes can change from one pitch to another without changing amplitude, e.g. a violin playing slurred notes.
Novelty functions are functions which denote local changes in signal properties such as energy or spectral content. We will look at two novelty functions:
Playing a note often coincides with a sudden increase in signal energy. To detect this sudden increase, we will compute an energy novelty function (FMP, p. 307):
First, load an audio file into the NumPy array x
and sampling rate sr
.
In [2]:
x, sr = librosa.load('audio/simple_loop.wav')
print(x.shape, sr)
Plot the signal:
In [3]:
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr)
Out[3]:
Listen:
In [4]:
ipd.Audio(x, rate=sr)
Out[4]:
librosa.feature.rmse
returns the root-mean-square (RMS) energy for each frame of audio. We will compute the RMS energy as well as its first-order difference.
In [5]:
hop_length = 512
frame_length = 1024
rmse = librosa.feature.rmse(x, frame_length=frame_length, hop_length=hop_length).flatten()
rmse_diff = numpy.zeros_like(rmse)
rmse_diff[1:] = numpy.diff(rmse)
In [6]:
print(rmse.shape)
print(rmse_diff.shape)
To obtain an energy novelty function, we perform half-wave rectification (FMP, p. 307) on rmse_diff
, i.e. any negative values are set to zero. Equivalently, we can apply the function $\max(0, x)$:
In [7]:
energy_novelty = numpy.max([numpy.zeros_like(rmse_diff), rmse_diff], axis=0)
Plot all three functions together:
In [8]:
frames = numpy.arange(len(rmse))
t = librosa.frames_to_time(frames, sr=sr)
In [9]:
plt.figure(figsize=(15, 6))
plt.plot(t, rmse, 'b--', t, rmse_diff, 'g--^', t, energy_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('RMSE', 'delta RMSE', 'energy novelty'))
Out[9]:
The human perception of sound intensity is logarithmic in nature. To account for this property, we can apply a logarithm function to the energy before taking the first-order difference.
Because $\log(x)$ diverges as $x$ approaches zero, a common alternative is to use $\log(1 + \lambda x)$. This function equals zero when $x$ is zero, but it behaves like $\log(\lambda x)$ when $\lambda x$ is large. This operation is sometimes called logarithmic compression (FMP, p. 310).
In [10]:
log_rmse = numpy.log1p(10*rmse)
log_rmse_diff = numpy.zeros_like(log_rmse)
log_rmse_diff[1:] = numpy.diff(log_rmse)
In [11]:
log_energy_novelty = numpy.max([numpy.zeros_like(log_rmse_diff), log_rmse_diff], axis=0)
In [12]:
plt.figure(figsize=(15, 6))
plt.plot(t, log_rmse, 'b--', t, log_rmse_diff, 'g--^', t, log_energy_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('log RMSE', 'delta log RMSE', 'log energy novelty'))
Out[12]:
There are two problems with the energy novelty function:
For example, consider the following audio signal composed of pure tones of equal magnitude:
In [13]:
sr = 22050
def generate_tone(midi):
T = 0.5
t = numpy.linspace(0, T, int(T*sr), endpoint=False)
f = librosa.midi_to_hz(midi)
return numpy.sin(2*numpy.pi*f*t)
In [14]:
x = numpy.concatenate([generate_tone(midi) for midi in [48, 52, 55, 60, 64, 67, 72, 76, 79, 84]])
Listen:
In [15]:
ipd.Audio(x, rate=sr)
Out[15]:
The energy novelty function remains roughly constant:
In [16]:
hop_length = 512
frame_length = 1024
rmse = librosa.feature.rmse(x, frame_length=frame_length, hop_length=hop_length).flatten()
rmse_diff = numpy.zeros_like(rmse)
rmse_diff[1:] = numpy.diff(rmse)
In [17]:
energy_novelty = numpy.max([numpy.zeros_like(rmse_diff), rmse_diff], axis=0)
In [18]:
frames = numpy.arange(len(rmse))
t = librosa.frames_to_time(frames, sr=sr)
In [19]:
plt.figure(figsize=(15, 4))
plt.plot(t, rmse, 'b--', t, rmse_diff, 'g--^', t, energy_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('RMSE', 'delta RMSE', 'energy novelty'))
Out[19]:
Instead, we will compute a spectral novelty function (FMP, p. 309):
Luckily, librosa
has librosa.onset.onset_strength
which computes a novelty function using spectral flux.
In [20]:
spectral_novelty = librosa.onset.onset_strength(x, sr=sr)
In [21]:
frames = numpy.arange(len(spectral_novelty))
t = librosa.frames_to_time(frames, sr=sr)
In [22]:
plt.figure(figsize=(15, 4))
plt.plot(t, spectral_novelty, 'r-')
plt.xlim(0, t.max())
plt.xlabel('Time (sec)')
plt.legend(('Spectral Novelty',))
Out[22]:
Novelty functions are dependent on frame_length
and hop_length
. Adjust these two parameters. How do they affect the novelty function?
Try with other audio files. How do the novelty functions compare?
In [23]:
ls audio