The Sound of Shazam

Sonified audio fingerprints.

(simplified version)

We'll be using iPython for playback and Librosa for audio analysis:


In [2]:
%pylab inline
import IPython
import librosa as lr


Populating the interactive namespace from numpy and matplotlib

Step 1: load some sample audio and get the spectogram


In [3]:
# read audio
y, sr = lr.load('rick.wav', sr=44100, mono=True)

In [4]:
# stft
n = 4096
hop = n/2
Y = lr.stft(y, n_fft=n, hop_length=hop)

In [5]:
# crop the spectrogram
fmax = 8000.0
maxbin = int(fmax/sr*n)
Yc = Y[:maxbin,:]

# magnitudes
Ym = np.abs(Yc)

# normalize
Yn = Ym/np.max(Ym)

In [6]:
# plot sample
fig, ax = subplots(figsize=(12, 4))
ax.imshow(Yn, origin='lower', interpolation='nearest', aspect='auto', cmap='binary')


Out[6]:
<matplotlib.image.AxesImage at 0x1082310d0>

Step 2: peak detection


In [7]:
from scipy.ndimage.morphology import grey_dilation

In [8]:
# mask size
mask_size = (32,20)

# compute mask
mask = grey_dilation(Yn, size=mask_size)

# plot mask
fig, ax = subplots(figsize=(12, 4))
ax.imshow(mask, origin='lower', interpolation='nearest', aspect='auto', cmap='binary')


Out[8]:
<matplotlib.image.AxesImage at 0x108718d10>

In [9]:
# peak detection trick: peak locations <=> image == mask
peaks = Yn*(Yn==mask)

# plot peaks
fig, ax = subplots(figsize=(12, 4))
ax.imshow(Yn==mask, origin='lower', interpolation='nearest', aspect='auto', cmap='binary')


Out[9]:
<matplotlib.image.AxesImage at 0x10d88c910>

Step 3: reconstruct peak-only signal using the inverse STFT


In [10]:
Z = np.zeros(Y.shape, complex)
Z[:peaks.shape[0],:] = peaks

In [11]:
# istft the result...
z = lr.istft(Z, hop_length=hop)

In [12]:
# let's listen!
IPython.display.Audio(data=z, rate=sr)


Out[12]: