NTDS'17 assignment 3: feedback

Michaël Defferrard, EPFL LTS2

The below grading scheme was followed for the correction of the third assignment, on a total of 100 points plus 10 bonus points. You'll also find some comments and common mistakes.

Thanks for your work! It was quite good in general. Some did very well, and I read quite interesting comments throughout.

General remark

First, a general remark: use vectorized code (via numpy or pandas) instead of loops. It's much more efficient, because the loop is either written in optimized C (or Fortran) code, or the function is carried by the CPU's single instruction, multiple data (SIMD) unit (c.f. MMX, SSE, AVX instructions for x86 CPUs from Intel and AMD).

Below are some examples from your submissions:

err = np.sum(np.abs(labels - genres)) is better than err = len([1 for i in range(len(labels)) if labels[i] != genres[i]]).
np.mean(mfcc, axis=1) is better than [np.mean(x) for x in mfcc].

weights = np.exp(-distances**2 / kernel_width**2) is better than

for i in range(0,2000):
    for j in range(i,2000):
        weights[i,j] = math.exp(-math.sqrt(distances[i,j])/math.sqrt(kernel_width))

If, for some reason, you cannot vectorize your code, consider using numba or Cython.

If you wrote any loop for your submission, please look at my solution for ways to avoid them. It's both faster and makes the code easier to understand.

Data and features (25 points)

1 Code: get genre from FMA (5 points)

In Python, you can convert an object to an integer with int(), to a string with str(), etc.

2 Code: fill table with genres (5 points)

Doing a for loop on a pandas dataframe is quite inefficient. Again, try to vectorize. Here, the apply or map functions are handy.
Do .apply(get_genre) instead of .apply(lambda x: get_genre(x)). The anonymous function is useless if you don't alter the argument.

3 Code: MFCCs (5 points)

4 Code: summary statistics (5 points)

5 Code: feature selection (5 points)

Some of you made great efforts to select the best features, even if that was not the focus of the assignment (as stated). Well done!

Graph (55 + 7 points)

6 Question: what is the cosine distance (2 points)

Beware the difference between a distance and a similarity measure. A distance is 0 if two elements are equal, while a similarity measure takes its maximum. The maximum can be any number. It's 1 in the case of the cosine similarity.
Note that the range of the cosine similarity is $[-1, 1]$ ($1$ for vectors pointing in the same direction, $0$ for orthogonal vectors, and $-1$ for vectors pointing in opposite directions). The range of the cosine distance is thus $[0, 2]$.

7 Code: distances (3 points)

8 Question: distances equal to zero (4 points)

Some of you investigated why the distance between a pair of songs was zero (or almost zero) and discovered they were duplicates. Good job! :)

9 Bonus: alternative kernel (3 points)

Think about the requirements for a valid kernel. Note that we want to transform a distance to a similarity measure. See the [solution] for more.

10 Code: weights (4 points)

11 Code: nearest neightbors (7 points)

Some of you used algorithms with one or two loops. Try to vectorize as much as possible for your code to be efficient. (I still gave all the points if you used only one loop.)

12 Bonus: adjacency matrix visualization (4 points)

The "block-diagonal" view (see my [solution]) is the simplest visualization here. I put block-diagonal in quotes as the matrix is not exactly block-diagonal. If it was, our graph would be disconnected and a perfect separation would be easy. It shows the size of the clusters, and gives an indication of intra and inter cluster concentration of edges.
Some of you plotted non-zero values in the adjacency matrix with different colors to indicate if they were intra or inter genre. As you observed, this is quite hard to visualize. Especially if we would have more than 2 genres!
Some also plotted distributions of edge weights for each genre and between genre. This is a good idea, though it's more verbose and harder to interpret than the "block-diagonal" view.

13 Code: degrees (3 points)

No need to use NetworkX. A simple W.sum(0) will do.

14 Question: choice of Laplacian (3 points)

A lot of you said that they should use the normalized Laplacian, without justification. Both Laplacians are however valid choices. Clustering with the sign of the Fiedler vector of the combinatorial Laplacian is a relaxation of the RatioCut. A relaxation of the NormalizedCut is obtained with the normalized Laplacian. Both the RatioCut and the NormalizedCut are normalized versions of the MinCut which seek to impose balanced clusters.

15 Code: Laplacian (4 points)

When computing the normalized Laplacian $\mathbf{I} - \mathbf{D}^{-1/2} \mathbf{W} \mathbf{D}^{-1/2}$, don't do np.linalg.inv(D) to inverse the degree matrix. The inverse of a diagonal matrix is straightforward and can be computed with np.diag(1 / np.sqrt(degrees)).
If the graph is weighted, we compute the Laplacian from the weighted adjacency and degree matrices. We only use the binary adjacency matrix if no weights are available. We would otherwise discard valuable information.

16 Code: number of edges (3 points)

If you use the Laplacian matrix, you should subtract the non-zero elements on the diagonal.
You should divide the number of non-zero values by two as we are counting edges for an undirected graph.

17 Question: which eigensolver (4 points)

We are not using the routines from scipy.sparse because they allow us to choose the number of eigenvectors to return. We use them because they implement efficient algorithms for partial eigendecomposition of sparse matrices.

18 Code: eigenvectors & eigenvalues (5 points)

Using eigsh with which='SM', which='SA', or sigma=0 are all correct approaches. See the [solution] for more information.

19 Question: eigenvectors & eigenvalues (5 points)

20 Question: connectedness (4 points)

The least costly way to check if the graph is connected is to look at the multiplicity of eigenvalue 0. Some of you used NetworkX, which is costlier (because we have the eigenvalues already).
While a $k$ nearest neighbor (kNN) graph ensures that each node is connected to at least $k$ nodes, it does not ensure connectedness. For example, two well separated clusters in feature space would not be connected together. We would end up with two graphs. That's not bad, it's very easy to cluster!

21 Question: first eigenvector (4 points)

Most of you correclty expected to get 0 here, but not all realized that it was not exactly zero because computers have finite memory, and can only approximate real numbers with a 32 or 64 bits floating point representation. Some numerical error in the eigendecomposition is likely a reason as well.

Visualization and clustering (20 + 3 points)

22 Question: Why not the first eigenvector (3 points)

The first eigenvector of the normalized Laplacian is not constant. It's value is $\mathbf{D}^{1/2} \mathbf{1}$, where $\mathbf{D}$ is the diagonal weighted degree matrix and $\mathbf{1}$ is the vector of all ones.
The first eigenvector of the combinatorial Laplacian is not all ones, but $\frac{1}{N} \mathbf{1}$. While $\mathbf{1}$ is also an eigenvector (by scaling by $N$), by convention we normalize the eigenvectors so that they form an orthonormal basis.

23 Code: 2D graph embedding (3 points)

While any symmetric matrix is diagonalizable, the main result here is that the spectral theorem says

24 Question: appearance of genre (5 points)

25 Code: classification with Fiedler vector (4 points)

The separating plane should be vertical as the first eigenvector was also used for the x axis.

26 Code: error rate (5 points)

You should take into accounts that the labels may be reversed, i.e. -1 could correspond to either rock or hip-hop. Clustering with $\operatorname{sign}(\mathbf{u}_2)$ or $\operatorname{sign}(-\mathbf{u}_2)$, where $\mathbf{u}_2$ is the Fiedler vector, should give the same error rate.

27 Bonus: method name and goal (3 points)

Some of you mentioned spectral graph embedding instead of Laplacian eigenmaps. The main point is that the embedding method minimizes $\operatorname{tr}(\mathbf{Y}^\intercal\mathbf{L}\mathbf{Y})$ and thus preserves the distances between samples as much as possible given the dimension of the embedding space.
Please don't copy Wikipedia, or any other source, without appropriate citation.