In [74]:
import numpy as np
from scipy.io import wavfile
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy.ndimage.morphology import generate_binary_structure, iterate_structure, binary_erosion
import hashlib

Audio Fingerprinting

The goal here is to be able to take a sample audio file and create a set of fingerprints that define that file. To that end, we're going to do the following:

  • create a spectrogram of the file
  • find the 'peaks' in the audio
  • filter out low peaks
  • create pairs of peaks separated by some time delta
  • hash those pairs with the time delta information

those hashes are our fingerprints.

Creating the Spectrogram


In [2]:
samplerate, channels = wavfile.read('../data/Sor1929.wav')
mono = np.mean(channels, axis=1)

In [3]:
# constants
WINDOW_SIZE = 4096 # granularity of chunks
SAMPLE_RATE = 44100 # by nyquist or whatever
OVERLAP_RATIO = 0.5 # amount our chunks can overlap

In [4]:
res = plt.specgram(mono, 
             NFFT=WINDOW_SIZE, 
             Fs=SAMPLE_RATE, 
             window=mlab.window_hanning,
             noverlap = int(WINDOW_SIZE * OVERLAP_RATIO))[0]


put this shit on a log scale:


In [5]:
# log it!
res = 10 * np.log10(res)

# np.inf is terrible, replace with zeros.
res[res == -np.inf] = 0

In [8]:
def heatmap(m):
    plt.pcolor(m)
    plt.colorbar()

In [9]:
heatmap(res)


Peak Detection

Okay, so what we want is to detect all the peaks in the spectrogram.

How do we do this?

First, we can define a neighborhood around a particular point, and define some point as a peak if it's greater than all the points in its neighborhood.

How do we define a neighborhood? Make a numpy filter using generate_binary_structure:


In [10]:
from scipy.ndimage.morphology import generate_binary_structure, iterate_structure

s = generate_binary_structure(2, 1)
s


Out[10]:
array([[False,  True, False],
       [ True,  True,  True],
       [False,  True, False]], dtype=bool)

We can extend this filter by using iterate_structure:


In [11]:
neighborhood = iterate_structure(s, 5).astype(int)
neighborhood


Out[11]:
array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

In [12]:
# make it a little bigger
neighborhood = iterate_structure(s, 20).astype(int)

Okay, great. How do we apply this to our spectrogram?

We can use a maximum filter, which will re-define each point in our matrix to be the maximum in it's neighborhood:


In [13]:
from scipy.ndimage.filters import maximum_filter
local_max = maximum_filter(res, footprint=neighborhood)
local_max


Out[13]:
array([[ 47.70849849,  47.70849849,  47.70849849, ...,  53.78828931,
         53.78828931,  53.78828931],
       [ 47.70849849,  47.70849849,  47.70849849, ...,  53.78828931,
         53.78828931,  53.78828931],
       [ 51.62175634,  47.70849849,  47.70849849, ...,  53.78828931,
         53.78828931,  53.78828931],
       ..., 
       [-47.48694411, -47.48694411, -47.48694411, ..., -46.33640966,
        -46.33640966, -46.33640966],
       [-47.48694411, -47.48694411, -47.48694411, ..., -46.33640966,
        -46.33640966, -46.33640966],
       [-47.48694411, -47.48694411, -47.48694411, ..., -46.33640966,
        -46.33640966, -46.33640966]])

Then, we can check where the local_max (maximum in the neighborhood) is equal to our original values.

The points for which this is true are our peaks:


In [14]:
peaks = local_max==res
peaks


Out[14]:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ..., 
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]], dtype=bool)

So now, we can get the actual maximum points.

One thing we should do is kind of "filter out" the background noise around the peaks - we want to only pull the peaks, not any of the other data.

Apparently, we need to do a binary erosion of the background, shrinking it a little, in order to get the actual peaks and not a bunch of space around the peaks as well, so we do that:


In [17]:
background = (res == 0)
eroded_background = binary_erosion(background, structure=neighborhood,
                                   border_value=1)
actual_peaks = peaks - eroded_background
actual_peaks


Out[17]:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ..., 
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]], dtype=bool)

Great! Now we have all the peaks from this spectrogram.

Maybe we can plot it?


In [33]:
fig = plt.figure()
ax1 = fig.add_subplot(111)

y, x = actual_peaks.astype(int).nonzero()
ax1.pcolor(res)
#cax = ax1.imshow(res, interpolation='nearest', cmap=cm.coolwarm)
#fig.colorbar(cax)
ax1.scatter(x, y, c= 'blue')

#plt.pcolor(res)
plt.show()


Peak Filtering

I think it's weird that we're seeing a bunch of peaks in the low blue areas, but maybe that's just because of the file?

Maybe we should remove peaks that are below some threshold:


In [38]:
# threshold for amplitudes
AMPLITUDE_THRESHOLD = 20

# get amplitudes of peaks
amplitudes = res[actual_peaks].flatten()

# x & y of peaks
y, x = actual_peaks.astype(int).nonzero()

# tuple up the frequency/time stuff:
all_peaks = zip(x, y, amplitudes)

now, we've got a list of tupes of all the peaks, where:

  • t[0] = the time (remember, we made a spectrogram? time bins.)
  • t[1] = the frequency
  • t[2] = the amplitude

(time, frequency, amplitude)

here's the last 5 peaks, by amplitude:


In [43]:
all_peaks[-5:]


Out[43]:
[(201, 2041, -46.336409661576212),
 (120, 2042, -46.963677847479246),
 (1, 2043, -47.486944106157551),
 (147, 2044, -45.355392898475841),
 (173, 2047, -47.413819600093831)]

So now we can filter them:


In [42]:
# filter the peaks
filtered_peaks = [p for p in all_peaks if p[2] > AMPLITUDE_THRESHOLD]
filtered_peaks[-5:]


Out[42]:
[(196, 763, 26.803332308264206),
 (0, 779, 21.460365502635383),
 (156, 786, 22.355241916207387),
 (210, 829, 21.9251353126052),
 (147, 858, 23.023283655961642)]

Now, we can create the fingerprint from the times/frequencies (tossing the amplitudes):


In [44]:
fingerprint = ([p[0] for p in filtered_peaks], 
               [p[1] for p in filtered_peaks])

In [50]:
fig = plt.figure()
ax1 = fig.add_subplot(111)

x, y = fingerprint[0], fingerprint[1]
ax1.pcolor(res)
#cax = ax1.imshow(res, interpolation='nearest', cmap=cm.coolwarm)
#fig.colorbar(cax)
ax1.scatter(x, y, c= 'blue')

#plt.pcolor(res)
plt.show()


Looking pretty good to me!

How many peaks do we have for this file?


In [51]:
len(fingerprint[0])


Out[51]:
194

Fingerprinting

194 is a good feature reduction from thousasnds of samples, I think :).

Next step: how do we use these peaks to find matches?

Well, really, a song can be thought of a series of pairs of peaks, separated by a time value.

We can fingerprint a song by creating a series of hashed (peak_freq_1, peak_freq_2, time_delta) tuples.

Then, when we look for matches for a given song $B$, we can:

  1. fingerprint song B, collecting a series of $N$ hashed (peak_freq_1, peak_freq_2, time_delta) tuples (let's call them $t_i$).
  2. $\forall i \in N$ for song $B$, check if we have a hash collision for $t_i$ from our stored files.
  3. Check which song file has the most hash collisions (maybe set a threshold?). If the number of collisions for some file is above our threshold, return that song as a match.

First, we need to define how many pairs of peaks are created for each individual peak. Let's use $\frac{N}{10}$ to start:


In [65]:
fingerprint_pairs = 19

We also need to define a time range - we don't want one peak to be paired with another if they're too far apart in time. maybe 10 is good?


In [64]:
fingerprint_time_delta = 10

Okay, let's actually make the fingerprints. This can probably be faster.


In [86]:
# sort by time (I think the other guy does it by frequency? not sure why...)
sorted_peaks = sorted(filtered_peaks, key=lambda p: p[0])

# of the form (f1, f2, time_delta)
fingerprints = []
for i, peak in enumerate(sorted_peaks):
    # get all peaks within `fingerprint_pairs` of the current peak
    potential_pairs = sorted_peaks[i+1:i+fingerprint_pairs]
    # get rid of the ones that are too far away in time
    potential_pairs = [p for p in potential_pairs if p[0] - peak[0] < fingerprint_time_delta]
    # create the (f1, f2, time_delta) tuples
    prints = [(peak[1], p[1], (p[0] - peak[0])) for p in potential_pairs]
    fingerprints.extend(prints)

Okay, how many fingerprints is that?


In [68]:
len(fingerprints)


Out[68]:
1740

Still not bad. These are gonna be hashes, remember, and this is the whole song.

Let's just check what they look like:


In [73]:
fingerprints[:10]


Out[73]:
[(115, 137, 0),
 (115, 183, 0),
 (115, 252, 0),
 (115, 275, 0),
 (115, 321, 0),
 (115, 390, 0),
 (115, 458, 0),
 (115, 481, 0),
 (115, 505, 0),
 (115, 550, 0)]

I'm not sure about having pairs at time_delta == 0, but I'll leave them there for now.

Okay, great. Now we just have to hash them....

MD5?


In [81]:
hashes = []
for fprint in fingerprints:
    h = hashlib.md5('{0}|{1}|{2}'.format(fprint[0], fprint[1], fprint[2]))
    hashes.append(h.hexdigest())

Okay, what do they look like?


In [83]:
hashes[:5]


Out[83]:
['d525254e088e88327c2598b2a0037f84',
 'd324142f471966af15f16ac8a6c0aa73',
 '7e94cac1b350c9a6fa3c99169efd8aac',
 '59ecdbfc8d113ea2203601f213fb0414',
 'b830f81888f436bab78b35f4603b530e']

Cool! Now, there's just the question of how we're going to do efficient searches of these hashes and stuff.

Can we use a database? Is that cheating? not sure...