After a description of some theoretical methods to pitch detection - first in the time domain, then in the frequency domain - algorithms used in industry such as the Yin algorithm are covered.
Sources:
Philip McLeod's: Fast, Accurate Pitch Detection Tools for Music Analysis (doctoral thesis)
Stanford MIR author: Steve Tjoa
Librosa author: Brian McFee
Pyo has the Yin algorithm. author: ?
Python code for some of the simpler pitch-detection algorithms described below, such as zero-crossing, FFT peak, or autocorrelation.
Here is a C++ library with a python wrapper: aubio
Calculate pitch for a window of audio of length W.
(ACF) type I Idea: look at every possible lag $ \theta $
$$ \sum_{n = 0}^{\frac{W}{2} - 1} x_n * x_{n+\theta}, 0 \leq \theta \leq \frac{W}{2} $$(ACF) type II Sum n = 0 : (W – Theta – 1) xn * x{n+Theta}, Theta in 0:W Decreases as value of Theta increases To get same result as Type 1, pad with zeroes to double window size and keep Theta in 0:W/2 Works better for non-stationary signals, as effective center t’c is stationary?
Autocorrelation via FFT Zero-pad signal window with number of output terms desired, usually p = W/2 Compute FFT Calculate squared magnitude of each complex term (power spectral density) IFFT yields autocorrelation
Pros
Cons
Much like autocorrelation: idea: adjacent periods have similar shapes Subtract shifted window from original and find global minimum
d’(Theta) = sum i in 0 : (W/2-1) [xi – x{i+Theta}]^2, Theta in 0:W/2
Need only a small number of periods per window, unlike autocorrelation. SDF is the safer option, because unless the window happens to contain a whole number of signal periods, partial waveforms will contribute different amounts depending on phase. No need to window the signal
This method only works with a large enough FFT window, leading to lower time resolution.
Subharmonic-to-harmonic ratio (Sun)
Choose correct octave: given F0 or its subharmonic Ratio of sum of harmonic amplitude SH = Sum for i in 1:n X(iF0) Snd sum of subharmonic amplitude SS = X((i-1/2)F0) Gives subharmonic to harmonic ratio SHR = SS/SH Choose F0 or F0/2 as fundamental depending whether SHR goes over threshold Commonly used in speech Algorithm is modified to use logarithmic frequency axis
Cepstral analysis
Bogart, Healy, Tukey -> Noll
Can be defined as the power spectrum of the log of the power spectrum of a signal
Algorithm is a mathematical trick based on the following assumption: the signal $S(n)$ made of fast-varying components, goes through a filter or impulse response (e.g. vocal tract), $ H(n) $, which governs the overall shape of the spectrogram with its slow varying components. We want to create a linear filter that separates the slow and fast varying components from each other.
In [1]:
%matplotlib inline
import numpy, scipy, matplotlib.pyplot as plt, sklearn, librosa, urllib, IPython.display
#import essentia, essentia.standard as ess
plt.rcParams['figure.figsize'] = (14,4)
In [4]:
url = 'http://audio.musicinformationretrieval.com/simple_loop.wav'
urllib.urlretrieve(url, filename='simple_loop.wav')
Out[4]:
In [6]:
x, fs = librosa.load('simple_loop.wav')
librosa.display.waveplot(x, sr=fs)
Out[6]:
In [7]:
IPython.display.Audio(x, rate=fs)
Out[7]:
In [8]:
mfccs = librosa.feature.mfcc(x, sr=fs)
print mfccs.shape
In this case, mfcc computed 20 MFCCs over 130 frames. The very first MFCC, the 0th coefficient, does not convey information relevant to the overall shape of the spectrum. It only conveys a constant offset, i.e. adding a constant value to the entire spectrum. Therefore, many practitioners will discard the first MFCC when performing classification. For now, we will use the MFCCs as is. Display the MFCCs:
In [9]:
librosa.display.specshow(mfccs, sr=fs, x_axis='time')
Out[9]:
Cheveigne, Kawahara
The main article is here The algorithm combines some of the theoretical algorithms such as autocorrelation, to get the best out of each. It is possible to restrict the range of the pitch if additional information about it is available.
The steps and error rates after each
Autocorrelation (10.0)
Difference function (1.95)
Cumulative mean normalized difference function (weights the value of each lag to distinguish F0 from F1) (1.69)
Absolute threshold (avoid octave error, if one of the higher order dips is deeper than the one at F0. Choose the lowest possible pitch with a dip below a certain threshold. If no dip is below, choose the global minimum) (.78)
Parabolic interpolation (increase precision) (.77)
Best local estimate (.50)
Stanford MIR has an algorithm, but ess. is deprecated (essentia)
In [ ]:
yin = ess.PitchYinFFT()
spectrum = ess.Spectrum()
def get_pitch(segment):
if len(segment) < 4096: # hack to get around Essentia error
N = len(segment) if len(segment) % 2 == 0 else len(segment) - 1
else:
N = 4096
pitch, pitch_conf = yin(spectrum(segment[:N]))
return pitch
Algorithm outline (Pseudocode and code are available)