NTDS'17 assignment 3: spectral graph theory

Michaël Defferrard, EPFL LTS2

The first two assignments were designed to warm you up. This third assignment is closer to what we expect you to do for the projects. It mainly misses the exploratory data analysis part (though there is some in sections 5.1 and 5.2). As such, our plan for this assignement encompasses data collection and exploitation, and goes as follows:

  1. Data collection: use a web API to collect the musical genre of 2000 songs.
  2. Feature extraction: compute features from audio tracks.
  3. Graph construction: construct a nearest-neighbor graph from the features.
  4. Eigendecomposition: factorization of the graph Laplacian (c.f. spectral graph theory).
  5. Visualization & Clustering: using the graph and eigenvectors to visualize the dataset and to cluster the tracks.
  6. Conclusion: a reflexion about what we did (though realisations are scattered all along).

The data we are using for this assignment is a subset of the Free Music Archive dataset, a collection of 1TB of music with metadata released under a permissive license.


In [ ]:
%matplotlib inline

import configparser
import os

import requests
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
from sklearn import preprocessing, decomposition
import librosa
import IPython.display as ipd

plt.rcParams['figure.figsize'] = (17, 5)

If the above cell fails, it's most probably because you miss a package. Install them with e.g. conda install tqdm librosa (or pip install tqdm librosa).

1 Data collection

Like in any data project, the first part of the assignment is to collect some data.

1.1 Get the genre of a single track

As often, we need an API key for certain operations. Add the following to your credentials.ini file. I gave a key during the lab on November 6. If you were not there, ask one of your classmates. Please do not over-use the key (it'll otherwise be banned for everybody).

[freemusicarchive]
api_key = MY-KEY

In [ ]:
# Read the confidential API key.
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))
api_key = credentials.get('freemusicarchive', 'api_key')

Your first task is to develop a function to retrieve the genre ID of a track given its track ID using the FMA API.

Hints:

  • A track might have multiple genres associated to it. Always return the first one and discard the others.
    • Note: you should never discard data blindly. I selected the tracks so that this is not a problem.
  • The get_genre function takes an integer as input, the track ID, and returns another integer, the genre ID.

In [ ]:
def get_genre(track_id):
    """Returns the genre of a track by querying the API."""
    BASE_URL = # Your code here.
    url = # Your code here.
    response = # Your code here.
    return # Your code here.

# A correct implementation should pass the below test.
assert get_genre(1219) == 38

1.2 Create a table of tracks

The fma_tracks.csv file contains a list of 2'000 track IDs that we will use through the assignment.


In [ ]:
tracks = pd.read_csv('../data/fma_tracks.csv', index_col=0)
print('You got {} track IDs.'.format(len(tracks)))

Once imported by pandas, each row of the DataFrame represents a track. The index is the track ID and the genre column is the genre ID of each track.


In [ ]:
tracks.iloc[125:130]

1.3 Add the genre to the table

The genre column contains an integer which represents the genre of the track, i.e. the return value of the get_genre function you developed. Because I'm a nice guy, I retrieved the genre for most tracks. Only the genre of the first 10 tracks is missing (zero is a placeholder value).


In [ ]:
tracks.head(10)

Your task is to replace the 10 zeros with the correct genre IDs.

Hints:

  • As we want to apply one function (get_genre) to many data samples (the track IDs), you may want to use a functional approach. Check out tracks.apply() or the built-in map. In Python, you can declare an anonymous function as lambda x: x + 1.
  • Your table should look like the above table, except with the correct number instead of zeros.

In [ ]:
# Your code here.

1.4 Save the data

To avoid having to collect the data everytime you restart the IPython kernel, save the DataFrame as a CSV file.


In [ ]:
tracks.to_csv('../data/fma_tracks_with_genre.csv')

You can now load it back with the following call instead of running the code in sections 1.1 to 1.3.


In [ ]:
tracks = pd.read_csv('../data/fma_tracks_with_genre.csv', index_col=0)

1.5 Data cleaning

Data cleaning is necessary when dealing with real (as opposed to synthetic) data. In this case, we only need to "summarize the genres". The tracks I've selected for the assignment belong to either one of the following top-level genres: Rock (genre_id=12) and Hip-Hop (genre_id=21). There actual genre(s) might however be more specific and be a sub-genre of those. For example Punk is a sub-genre of Rock. You can explore the genre hierarchy on the Free Music Archive. The below function will return the correct top-level genre for any of the sub-genres you'll encounter.


In [ ]:
def get_top_genre(genre_id):
    if genre_id == 0:
        raise ValueError('Invalid genre! Please complete the DataFrame.')
    else:
        return 21 if genre_id in [21, 83, 100, 539, 542, 811] else 12

tracks = tracks.applymap(lambda genre: get_top_genre(genre))
tracks.head(4)

If everything went fine, you should now have 1000 Rock (genre_id=12) and 1000 Hip-Hop (genre_id=21) tracks.


In [ ]:
tracks['genre'].value_counts()

2 Feature extraction

As is often the case, the data at hand is too large to be dealt with directly. We have to represent it with a smaller set of features, chosen to be maximally relevant to the task. (Manual feature extraction can sometimes be replaced by end-to-end learning systems.)

2.1 Get raw data

In the data folder, you'll find audio excerpts for the first 10 tracks listed in the tracks table. The audio is encoded as mp3 and the filename is simply the track ID.

LibROSA is a well-designed Python package to deal with music data, i.e. to load audio, compute spectrograms and extract features. Let's use it and listen to some music.


In [ ]:
def get_path(track_id):
    return os.path.join('..', 'data', '{:06d}.mp3'.format(track_id))

# 1. Get the path to the first file.
filepath = get_path(tracks.index[0])
print('File: {}'.format(filepath))

# 2. Decode the mp3 and load the audio in memory.
audio, sampling_rate = librosa.load(filepath, sr=None, mono=True)
print('Duration: {:.2f}s, {} samples'.format(audio.shape[-1] / sampling_rate, audio.size))

# 3. Load the audio in the browser and play it.
start, end = 7, 17
ipd.Audio(data=audio[start*sampling_rate:end*sampling_rate], rate=sampling_rate)

If the above cell fails with a NoBackendError, it's most probably because you need an mp3 decoder that audioread can call. Try to install ffmpeg with e.g. conda install ffmpeg.

2.2 Audio features

For music, the mel-frequency cepstral coefficients (MFCCs) are often relevant spectral features. Complete the implementation of the compute_mfcc function, which takes a track ID as its sole parameter and returns the coefficients.

Hint:

  • Use the function librosa.feature.mfcc to compute those features.

In [ ]:
N_MFCC = 20

def compute_mfcc(track_id):
    # Your code here.

mfcc = compute_mfcc(tracks.index[0])
assert mfcc.ndim == 2
assert mfcc.shape[0] == N_MFCC

2.3 Summary statistics

The compute_mfcc function we developed above computes N_MFCC coefficients per window in time. Notice that we computed N_MFCC x 2582 coefficients for the first track. To have a fixed representation for each complete track (and not for each window of 2048 samples), we need to aggregate those coefficients along time. We'll do it with 7 summary statistics: the minimum, the maximum, the median, and the first 4 moments, i.e. the mean, the standard deviation, the skew and the kurtosis.

Below, we construct the DataFrame that will hold our features. (Note the use of a hierarchical index.) Again, I computed the features for most tracks.


In [ ]:
features = pd.read_csv('../data/fma_features.csv', index_col=0, header=[0, 1, 2])
assert (tracks.index == features.index).all()

features.tail(4)

Though I forgot to compute them for the first 10 tracks. ;-) Complete the below code to do that.

Hints:

  • Functions to compute the mentioned statistics are found in np and stats (from scipy, see imports).
  • We use tqdm to show progress on long computations. For example: for i in tqdm(range(10)): print(i).

In [ ]:
for tid in tqdm(tracks.index[:10]):
    mfcc = compute_mfcc(tid)
    features.at[tid, ('mfcc', 'mean')] = # Your code here.
    features.at[tid, ('mfcc', 'std')] = # Your code here.
    features.at[tid, ('mfcc', 'skew')] = # Your code here.
    features.at[tid, ('mfcc', 'kurtosis')] = # Your code here.
    features.at[tid, ('mfcc', 'median')] = # Your code here.
    features.at[tid, ('mfcc', 'min')] = # Your code here.
    features.at[tid, ('mfcc', 'max')] = # Your code here.

features['mfcc','mean'].head(4)

2.4 Feature selection

Once you are done with the rest of the assignment and reach section 5.4, come back here and try to identify the most relevant features for our end goal, i.e. data visualization and clustering.


In [ ]:
# Your code here.
# Example: features.drop(('mfcc', 'mean'), axis=1, inplace=True)

2.5 Feature normalization

Most algorithms expect (or work better) if the data is centered and standardized.


In [ ]:
features -= features.mean(axis=0)
features /= features.std(axis=0)

3 Graph construction

As opposed to social networks, the brain, or a road network, this dataset does not exhibit a natural graph. But we can always build one! In this case, we will build a similarity graph between tracks. In such a graph, each track is represented as a node and the weight of an edge will be an indication of how similar two tracks are (1 meaning identical, and 0, i.e. no edge, meaning very different). The graph is a discrete approximation of a continuous manifold of unknown (but hopefully lower) dimensionality embedded in a high-dimensional ambiant space. Such graphs are useful for e.g. recommendation. If 10 tracks you liked are strongly connected to an 11th one, you'll probably like that one too (if the similarity measure is relevant).

3.1 Distances

The first step to construct a similarity graph is to define the similarity measure. A good first step is to define it as a distance between the feature vector (the statistic on the MFCCs we computed above). But which distance? The cosine distance is a good candidate for high-dimensional data. It is defined as follows: $$d(u,v) = 1 - \frac{u \cdot v} {\|u\|_2 \|v\|_2},$$ where $u$ and $v$ are two feature vectors.

Geometrically, what does the cosine distance compute? Your answer here.

Alternatives are the $p$-norms (or $\ell_p$-norms), defined as $$d(u,v) = \|u - v\|_p,$$ of which the Euclidean distance is a special case with $p=2$.

Implement the cosine distance first. Once you are done with the rest of the assignment and reach section 5.4, come back here and try to identify the most relevant metric for our end goal, i.e. data visualization and clustering.

Hints:

  • Save yourself some pain and use pdist and squareform from spatial.distance.

In [ ]:
distances = # Your code here.

Looking at the computed values is good for i) data exploration and ii), to be more confident about the correctness of the computation. Below an histogram of the distances.


In [ ]:
plt.hist(distances.reshape(-1), bins=50);

In [ ]:
print('{} distances equal exactly zero.'.format(np.sum(distances == 0)))

Why are some distances equal to zero? Your answer here.

3.2 Weighted adjacency matrix

From our distances, we want to compute edge weights. If the distance is short, which means the tracks are similar, we want a strong edge. The most widespread kernel for that task is the Gaussian kernel, defined as $$\mathbf{W}(u,v) = \exp \left( \frac{-d^2(u, v)}{\sigma^2} \right),$$ where $\sigma$ is a parameter which controls the width of the kernel.

Bonus. Can you think of another suitable kernel? Your answer here.

Compute below the weight of each edge.

Hints:

  • At first, you can set the kernel width, $\sigma$, to the mean value of the distance. (It will preserve the distribution of distances.)
  • Don't forget to set the diagonal to zero! We don't want self-connections.

In [ ]:
kernel_width = distances.mean()
weights = # Your code here.

# Your code here.

So we just created a fully connected graph. For our algorithms to be more efficient, we want to sparsify it. There are two main ways to sparsify a graph: i) thresholding all the weights smaller than an $\epsilon$, and ii) keep the $k$ strongest edges for each node. Implement below the second option. You can start with $k=100$.

Hints:

  • You can use np.argsort to sort the weights.
  • Be sure that your weight matrix stays symmetric.
  • Look at both the sparsity pattern and the weight distribution as an indication of correctness.

In [ ]:
fix, axes = plt.subplots(2, 2, figsize=(17, 8))
def plot(weights, axes):
    axes[0].spy(weights)
    axes[1].hist(weights[weights > 0].reshape(-1), bins=50);
plot(weights, axes[:, 0])

NEIGHBORS = 100
# Your code here.

plot(weights, axes[:, 1])

3.3 Bonus: visualize the adjacency matrix

Can you think of a way to observe if the two genres form clusters in the graph we created? You should only use the weight matrix weights and the genre labels tracks['genre'].


In [ ]:
# Your code here.

3.4 Degrees

Compute below the degree vector.

Hints:

  • Again, looking at the degree distribution will help you identify any mistake.

In [ ]:
degrees = # Your code here.

plt.hist(degrees, bins=50);

3.5 Graph Laplacian

We will later need the Fiedler vector. Shall we compute the combinatorial or the normalized Laplacian? Your answer here.

Implement your choice below.

Hints:

  • Compare the sparsity pattern of the Laplacian to the one of the weight matrix.

In [ ]:
laplacian = # Your code here.

plt.spy(laplacian);

For efficient storage and computation, we can store this sparse matrix in a compressed sparse row (CSR) format.


In [ ]:
laplacian = sparse.csr_matrix(laplacian)

Compute the number of remaining edges in the graph.


In [ ]:
# Your code here.

4 Eigendecomposition of the graph Laplacian

The whole point of spectral graph theory is to compute the eigendecomposition of the Laplacian. We however don't need the full eigendecomposition (a.k.a. the graph Fourier basis) here. Compute the 10 eigenvectors with the smallest eigenvalues with one of the following functions: np.linalg.eig, np.linalg.eigh, sparse.linalg.eigs, sparse.linalg.eigsh. Justify your choice.

Your answer here.


In [ ]:
eigenvalues, eigenvectors = # Your code here.

In [ ]:
plt.plot(eigenvalues, '.-', markersize=15);

Why can we diagonalize the graph Laplacian in the first place? Your answer here.

Why are all the eigenvalues real? Your answer here.

Why are all the eigenvalues non-negative? Your answer here.

4.1 Connectedness

Is the graph connected? Justify. Knowing how we built the graph, can we ensure it is connected?


In [ ]:
# Your code here.

Your answer here.

4.2 Eigenvector question

What do you expect as the result of the below computation? Justify. Do you get the value you expected? If not, why?

Note that x @ y (introduced in Python 3.5) is equivalent to np.matmul(x, y). You should prefer the former as it makes it easier to read formulas.


In [ ]:
np.sum(laplacian @ (2 * eigenvectors[:, 0]))

Your answer here.

5 Visualization and clustering

Finally, let's use the data and graph we prepared. When exploring data, it's often useful to visualize an entire dataset. Because for us humans it's hard to look at data in 140 dimensions, we need to somehow reduce the dimensionality to 2 or 3 and visualize the data in this more familiar space. While such a reduction will obviously be destructive, many algorithms have been developed to preserve certain properties.

5.1 Principal component analysis (PCA)

PCA is a standard algorithm to reduce the dimensionality of a dataset. It computes the axes of principal variance and project the data on them. It does only use the features we computed, not the graph. We show it here for comparison. (We use scikit-learn, a very convenient library for Machine Learning.)


In [ ]:
features_pca = decomposition.PCA(n_components=2).fit_transform(features)
genres = preprocessing.LabelEncoder().fit_transform(tracks['genre'])
plt.scatter(features_pca[:, 0], features_pca[:, 1], c=genres, cmap='RdBu', alpha=0.5);

5.2 Graph embedding

Instead of using the features directly, we can try to visualize our similarity graph. That graph is embedded in an ambiant space of 140 dimensions (the number of features, that is the number of MFCCs times the number of summary statistics) at first, i.e. each node has a position in an 140-dimensional Euclidean space. Because we cannot visualize such an high-dimensional space, we want to embed the graph in a 2D space.

One way to embed a graph is to use the value of the eigenvectors as coordinates. For the below plot, use the value of the second eigenvector as the x coordinate of a node, and the value of the third eigenvector as the y coordinate. The color is indicated by the genre.

Why don't we use the first eigenvector? Your answer here.


In [ ]:
x = # Your code here.
y = # Your code here.
plt.scatter(x, y, c=genres, cmap='RdBu', alpha=0.5);

See how well this plot summarizes 2GB of data and 2000 tracks! We could now design a playlist generator as a random walk on this similarity graph and visualize its trajectory as it hops from track to track. :)

5.3 Clustering

Note how we did not try to build a machine to recognize the musical genre given a track (that would have been a classification problem). We did merely try to visualize the data, by means of PCA and a graph embedding algorithm. What does it tell us that genre clearly appears in our visualization?

Your answer here.

As such, we can try to cluster the tracks with the Fiedler vector, and look if the (unsupervised) clustering agrees with the ground truth genre categorization. Reproduce below the above scatter plot, but with the sign of the Fiedler vector as the color instead of the genre.


In [ ]:
labels = # Your code here.

plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);

5.4 Error rate

How many tracks were wrongly categorized by the Fiedler vector, according to the Rock / Hip-Hop ground truth?


In [ ]:
err = # Your code here.
print('{} errors ({:.2%})'.format(err, err/len(labels)))

Tune some parameters (e.g. kernel_width, NEIGHBORS), discard some features (section 2.4), change the distance metric (section 3.1) to get less errors. You should get an error rate below 15% (i.e. less than 300 errors in total). Try to understand the effect of each parameter. After data cleaning, parameter tuning is the other dirty work of a data scientist! Be aware that tuning the parameters on a specific dataset will lead to overfitting.

6 Conclusion

Among other things, this assignment showed us that a graph can be useful for e.g. visualization or clustering, even when there is none in the original data. One of the design goal of this assignment, while dealing with real data, was to follow the complete Data Science process, from data acquisition to interpretation of the results. The exploitation of the data showed us that a machine can discern musical genres by looking at pairwise distances between spectral features extracted from audio recordings.

6.1 Bonus

What is the name of the technique we used to visualize the data in the last two plots? What does it try to preserve when reducing the dimensionality (of the ambiant space) from 140 to 2?

Your answer here.