NMF Source Separation

Piano notes consist of a very stable stack of harmonics for each note, with a smooth decay. This kind of sound event can be well captured by separable time and frequency components, i.e., a single spectral profile modulated in time by the decaying envelope of each note event. NMF (Nonnegative Matrix Factorization) tries to find the best decomposition of a matrix into a fixed number of spectrum/envelope pairs. Under the right circumstances, these components can end up being individual piano notes - although controlling it can be tricky. This practical uses the NMF implementation from sklearn (which is wrapped by librosa) to decompose a brief excerpt of piano.

In [1]:
%pylab inline
from __future__ import print_function
import os
import time
import IPython

import numpy as np

import librosa

Populating the interactive namespace from numpy and matplotlib

In [2]:
def my_specshow(S, sr, t_max=None, f_max=None, **kwargs):
    """Call librosa.specshow but limit the time and frequency ranges."""
    librosa.display.specshow(S, sr=sr, cmap='jet', x_axis='time', y_axis='linear', **kwargs)
    if t_max == None:
        t_max = librosa.frames_to_time(S.shape[1])
    tick_frames = librosa.time_to_frames(np.arange(round(t_max) + 1.0))
    plt.xlim([0, tick_frames[-1]])
    if f_max == None:
        f_max = S.shape[0] * sr / n_fft
    freqs = librosa.fft_frequencies()
    max_bin = np.flatnonzero(freqs >= f_max)[0]
    plt.ylim([0, max_bin])
    librosa.display.time_ticks(tick_frames, librosa.frames_to_time(tick_frames), n_ticks=len(tick_frames))
    librosa.display.frequency_ticks(freqs[:max_bin], axis='y')

In [3]:
y, sr = librosa.load('gould-wtc1.wav', sr=None)
print(sr, y.shape)

11025 (330750,)

In [4]:
# STFT analysis of the piano excerpt.
n_fft = 1024
S = librosa.stft(y, n_fft=n_fft)
plt.figure(figsize=(10, 4))
# Zoom in on the first few notes, and the lowest few harmonics.
t_max = 5.0
f_max = 2000.0
my_specshow(librosa.logamplitude(S), sr=sr, t_max=t_max, f_max=f_max)
IPython.display.Audio(y, rate=sr)


In [5]:
# Filter out the first three notes by a low-pass filter, applied directly
# on the spectrogram.  It looks like cutting at 450 Hz should do it, so 
# zero out all the bins above that.
cutoff_bin = int(round(450.0 / sr * n_fft))
S_lpf = S.copy()
S_lpf[cutoff_bin:] = 0
# Convert spectrogram back to audio.
y_lpf = librosa.istft(S_lpf)
plt.figure(figsize=(10, 4))
# Re-analyze back into STFT
my_specshow(librosa.logamplitude(librosa.stft(y_lpf, n_fft=n_fft)), sr=sr, t_max=t_max, f_max=f_max)
plt.clim([-56, 12])
IPython.display.Audio(y_lpf, rate=sr)