In [74]:
import numpy as np
from scipy.io import wavfile
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy.ndimage.morphology import generate_binary_structure, iterate_structure, binary_erosion
import hashlib
The goal here is to be able to take a sample audio file and create a set of fingerprints that define that file. To that end, we're going to do the following:
those hashes are our fingerprints.
In [2]:
samplerate, channels = wavfile.read('../data/Sor1929.wav')
mono = np.mean(channels, axis=1)
In [3]:
# constants
WINDOW_SIZE = 4096 # granularity of chunks
SAMPLE_RATE = 44100 # by nyquist or whatever
OVERLAP_RATIO = 0.5 # amount our chunks can overlap
In [4]:
res = plt.specgram(mono,
NFFT=WINDOW_SIZE,
Fs=SAMPLE_RATE,
window=mlab.window_hanning,
noverlap = int(WINDOW_SIZE * OVERLAP_RATIO))[0]
put this shit on a log scale:
In [5]:
# log it!
res = 10 * np.log10(res)
# np.inf is terrible, replace with zeros.
res[res == -np.inf] = 0
In [8]:
def heatmap(m):
plt.pcolor(m)
plt.colorbar()
In [9]:
heatmap(res)
Okay, so what we want is to detect all the peaks in the spectrogram.
How do we do this?
First, we can define a neighborhood around a particular point, and define some point as a peak if it's greater than all the points in its neighborhood.
How do we define a neighborhood? Make a numpy filter using generate_binary_structure
:
In [10]:
from scipy.ndimage.morphology import generate_binary_structure, iterate_structure
s = generate_binary_structure(2, 1)
s
Out[10]:
We can extend this filter by using iterate_structure
:
In [11]:
neighborhood = iterate_structure(s, 5).astype(int)
neighborhood
Out[11]:
In [12]:
# make it a little bigger
neighborhood = iterate_structure(s, 20).astype(int)
Okay, great. How do we apply this to our spectrogram?
We can use a maximum filter, which will re-define each point in our matrix to be the maximum in it's neighborhood:
In [13]:
from scipy.ndimage.filters import maximum_filter
local_max = maximum_filter(res, footprint=neighborhood)
local_max
Out[13]:
Then, we can check where the local_max
(maximum in the neighborhood) is equal to our original values.
The points for which this is true are our peaks:
In [14]:
peaks = local_max==res
peaks
Out[14]:
So now, we can get the actual maximum points.
One thing we should do is kind of "filter out" the background noise around the peaks - we want to only pull the peaks, not any of the other data.
Apparently, we need to do a binary erosion of the background, shrinking it a little, in order to get the actual peaks and not a bunch of space around the peaks as well, so we do that:
In [17]:
background = (res == 0)
eroded_background = binary_erosion(background, structure=neighborhood,
border_value=1)
actual_peaks = peaks - eroded_background
actual_peaks
Out[17]:
Great! Now we have all the peaks from this spectrogram.
Maybe we can plot it?
In [33]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
y, x = actual_peaks.astype(int).nonzero()
ax1.pcolor(res)
#cax = ax1.imshow(res, interpolation='nearest', cmap=cm.coolwarm)
#fig.colorbar(cax)
ax1.scatter(x, y, c= 'blue')
#plt.pcolor(res)
plt.show()
In [38]:
# threshold for amplitudes
AMPLITUDE_THRESHOLD = 20
# get amplitudes of peaks
amplitudes = res[actual_peaks].flatten()
# x & y of peaks
y, x = actual_peaks.astype(int).nonzero()
# tuple up the frequency/time stuff:
all_peaks = zip(x, y, amplitudes)
now, we've got a list of tupes of all the peaks, where:
t[0]
= the time (remember, we made a spectrogram? time bins.)t[1]
= the frequencyt[2]
= the amplitude (time, frequency, amplitude)
here's the last 5 peaks, by amplitude:
In [43]:
all_peaks[-5:]
Out[43]:
So now we can filter them:
In [42]:
# filter the peaks
filtered_peaks = [p for p in all_peaks if p[2] > AMPLITUDE_THRESHOLD]
filtered_peaks[-5:]
Out[42]:
Now, we can create the fingerprint from the times/frequencies (tossing the amplitudes):
In [44]:
fingerprint = ([p[0] for p in filtered_peaks],
[p[1] for p in filtered_peaks])
In [50]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
x, y = fingerprint[0], fingerprint[1]
ax1.pcolor(res)
#cax = ax1.imshow(res, interpolation='nearest', cmap=cm.coolwarm)
#fig.colorbar(cax)
ax1.scatter(x, y, c= 'blue')
#plt.pcolor(res)
plt.show()
Looking pretty good to me!
How many peaks do we have for this file?
In [51]:
len(fingerprint[0])
Out[51]:
194 is a good feature reduction from thousasnds of samples, I think :).
Next step: how do we use these peaks to find matches?
Well, really, a song can be thought of a series of pairs of peaks, separated by a time value.
We can fingerprint a song by creating a series of hashed (peak_freq_1, peak_freq_2, time_delta) tuples.
Then, when we look for matches for a given song $B$, we can:
First, we need to define how many pairs of peaks are created for each individual peak. Let's use $\frac{N}{10}$ to start:
In [65]:
fingerprint_pairs = 19
We also need to define a time range - we don't want one peak to be paired with another if they're too far apart in time. maybe 10 is good?
In [64]:
fingerprint_time_delta = 10
Okay, let's actually make the fingerprints. This can probably be faster.
In [86]:
# sort by time (I think the other guy does it by frequency? not sure why...)
sorted_peaks = sorted(filtered_peaks, key=lambda p: p[0])
# of the form (f1, f2, time_delta)
fingerprints = []
for i, peak in enumerate(sorted_peaks):
# get all peaks within `fingerprint_pairs` of the current peak
potential_pairs = sorted_peaks[i+1:i+fingerprint_pairs]
# get rid of the ones that are too far away in time
potential_pairs = [p for p in potential_pairs if p[0] - peak[0] < fingerprint_time_delta]
# create the (f1, f2, time_delta) tuples
prints = [(peak[1], p[1], (p[0] - peak[0])) for p in potential_pairs]
fingerprints.extend(prints)
Okay, how many fingerprints is that?
In [68]:
len(fingerprints)
Out[68]:
Still not bad. These are gonna be hashes, remember, and this is the whole song.
Let's just check what they look like:
In [73]:
fingerprints[:10]
Out[73]:
I'm not sure about having pairs at time_delta == 0, but I'll leave them there for now.
Okay, great. Now we just have to hash them....
MD5?
In [81]:
hashes = []
for fprint in fingerprints:
h = hashlib.md5('{0}|{1}|{2}'.format(fprint[0], fprint[1], fprint[2]))
hashes.append(h.hexdigest())
Okay, what do they look like?
In [83]:
hashes[:5]
Out[83]:
Cool! Now, there's just the question of how we're going to do efficient searches of these hashes and stuff.
Can we use a database? Is that cheating? not sure...