The first two assignments were designed to warm you up. This third assignment is closer to what we expect you to do for the projects. It mainly misses the exploratory data analysis part (though there is some in sections 5.1 and 5.2). As such, our plan for this assignement encompasses data collection and exploitation, and goes as follows:
The data we are using for this assignment is a subset of the Free Music Archive dataset, a collection of 1TB of music with metadata released under a permissive license.
In [ ]:
%matplotlib inline
In [ ]:
import configparser
import os
import requests
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
from sklearn import preprocessing, decomposition
import librosa
import IPython.display as ipd
In [ ]:
plt.rcParams['figure.figsize'] = (17, 5)
If the above cell fails, it's most probably because you miss a package. Install them with e.g. conda install tqdm librosa
(or pip install tqdm librosa
).
As often, we need an API key for certain operations. Add the following to your credentials.ini
file. I gave a key during the lab on November 6. If you were not there, ask one of your classmates. Please do not over-use the key (it'll otherwise be banned for everybody).
[freemusicarchive]
api_key = MY-KEY
In [ ]:
# Read the confidential API key.
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))
api_key = credentials.get('freemusicarchive', 'api_key')
Your first task is to develop a function to retrieve the genre ID of a track given its track ID using the FMA API.
Hints:
get_genre
function takes an integer as input, the track ID, and returns another integer, the genre ID.
In [ ]:
def get_genre(track_id):
"""Returns the genre of a track by querying the API."""
URL = 'https://freemusicarchive.org/api/get/tracks.json'
params = {
'track_id': track_id,
'api_key': api_key,
}
response = requests.get(URL, params).json()
return int(response['dataset'][0]['track_genres'][0]['genre_id'])
# A correct implementation should pass the below test.
assert get_genre(1219) == 38
In [ ]:
tracks = pd.read_csv('../data/fma_tracks.csv', index_col=0)
print('You got {} track IDs.'.format(len(tracks)))
Once imported by pandas, each row of the DataFrame represents a track. The index is the track ID and the genre column is the genre ID of each track.
In [ ]:
tracks.iloc[125:130]
The genre column contains an integer which represents the genre of the track, i.e. the return value of the get_genre
function you developed. Because I'm a nice guy, I retrieved the genre for most tracks. Only the genre of the first 10 tracks is missing (zero is a placeholder value).
In [ ]:
tracks.head(10)
Your task is to replace the 10 zeros with the correct genre IDs.
Hints:
get_genre
) to many data samples (the track IDs), you may want to use a functional approach. Check out tracks.apply()
or the built-in map
. In Python, you can declare an anonymous function as lambda x: x + 1
.
In [ ]:
tracks['genre'][:10] = tracks[:10].index.map(get_genre)
# Alternative:
# tracks.iloc[:10, 0] = tracks[:10].apply(lambda track: get_genre(track.name), axis=1)
# Alternative:
# tracks.iloc[:10, 0] = list(map(get_genre, tracks.index[:10]))
In [ ]:
tracks.to_csv('../data/fma_tracks_with_genre.csv')
You can now load it back with the following call instead of running the code in sections 1.1 to 1.3.
In [ ]:
tracks = pd.read_csv('../data/fma_tracks_with_genre.csv', index_col=0)
Data cleaning is necessary when dealing with real (as opposed to synthetic) data. In this case, we only need to "summarize the genres". The tracks I've selected for the assignment belong to either one of the following top-level genres: Rock (genre_id=12
) and Hip-Hop (genre_id=21
). There actual genre(s) might however be more specific and be a sub-genre of those. For example Punk is a sub-genre of Rock. You can explore the genre hierarchy on the Free Music Archive. The below function will return the correct top-level genre for any of the sub-genres you'll encounter.
In [ ]:
def get_top_genre(genre_id):
if genre_id == 0:
raise ValueError('Invalid genre! Please complete the DataFrame.')
else:
return 21 if genre_id in [21, 83, 100, 539, 542, 811] else 12
tracks = tracks.applymap(lambda genre: get_top_genre(genre))
tracks.head(4)
If everything went fine, you should now have 1000 Rock (genre_id=12
) and 1000 Hip-Hop (genre_id=21
) tracks.
In [ ]:
tracks['genre'].value_counts()
In the data folder, you'll find audio excerpts for the first 10 tracks listed in the tracks
table. The audio is encoded as mp3 and the filename is simply the track ID.
LibROSA is a well-designed Python package to deal with music data, i.e. to load audio, compute spectrograms and extract features. Let's use it and listen to some music.
In [ ]:
def get_path(track_id):
return os.path.join('..', 'data', '{:06d}.mp3'.format(track_id))
# 1. Get the path to the first file.
filepath = get_path(tracks.index[0])
print('File: {}'.format(filepath))
# 2. Decode the mp3 and load the audio in memory.
audio, sampling_rate = librosa.load(filepath, sr=None, mono=True)
print('Duration: {:.2f}s, {} samples'.format(audio.shape[-1] / sampling_rate, audio.size))
# 3. Load the audio in the browser and play it.
start, end = 7, 17
ipd.Audio(data=audio[start*sampling_rate:end*sampling_rate], rate=sampling_rate)
If the above cell fails with a NoBackendError
, it's most probably because you need an mp3 decoder that audioread can call. Try to install ffmpeg with e.g. conda install ffmpeg
.
For music, the mel-frequency cepstral coefficients (MFCCs) are often relevant spectral features. Complete the implementation of the compute_mfcc
function, which takes a track ID as its sole parameter and returns the coefficients.
Hint:
librosa.feature.mfcc
to compute those features.
In [ ]:
N_MFCC = 20
def compute_mfcc(track_id):
audio, sampling_rate = librosa.load(get_path(track_id), sr=None, mono=True)
return librosa.feature.mfcc(y=audio, sr=sampling_rate, n_mfcc=N_MFCC)
mfcc = compute_mfcc(tracks.index[0])
assert mfcc.ndim == 2
assert mfcc.shape[0] == N_MFCC
The compute_mfcc
function we developed above computes N_MFCC
coefficients per window in time. Notice that we computed N_MFCC
x 2582 coefficients for the first track. To have a fixed representation for each complete track (and not for each window of 2048 samples), we need to aggregate those coefficients along time. We'll do it with 7 summary statistics: the minimum, the maximum, the median, and the first 4 moments, i.e. the mean, the standard deviation, the skew and the kurtosis.
Below, we construct the DataFrame that will hold our features. (Note the use of a hierarchical index.) Again, I computed the features for most tracks.
In [ ]:
features = pd.read_csv('../data/fma_features.csv', index_col=0, header=[0, 1, 2])
assert (tracks.index == features.index).all()
features.tail(4)
Though I forgot to compute them for the first 10 tracks. ;-) Complete the below code to do that.
Hints:
np
and stats
(from scipy
, see imports).tqdm
to show progress on long computations. For example: for i in tqdm(range(10)): print(i)
.
In [ ]:
for tid in tqdm(tracks.index[:10]):
mfcc = compute_mfcc(tid)
features.at[tid, ('mfcc', 'mean')] = np.mean(mfcc, axis=1)
features.at[tid, ('mfcc', 'std')] = np.std(mfcc, axis=1)
features.at[tid, ('mfcc', 'skew')] = stats.skew(mfcc, axis=1)
features.at[tid, ('mfcc', 'kurtosis')] = stats.kurtosis(mfcc, axis=1)
features.at[tid, ('mfcc', 'median')] = np.median(mfcc, axis=1)
features.at[tid, ('mfcc', 'min')] = np.min(mfcc, axis=1)
features.at[tid, ('mfcc', 'max')] = np.max(mfcc, axis=1)
features['mfcc','mean'].head(4)
In [ ]:
features.drop(('mfcc', 'kurtosis'), axis=1, inplace=True)
features.drop(('mfcc', 'skew'), axis=1, inplace=True)
features.drop(('mfcc', 'min'), axis=1, inplace=True)
features.drop(('mfcc', 'max'), axis=1, inplace=True)
In [ ]:
features -= features.mean(axis=0)
features /= features.std(axis=0)
As opposed to social networks, the brain, or a road network, this dataset does not exhibit a natural graph. But we can always build one! In this case, we will build a similarity graph between tracks. In such a graph, each track is represented as a node and the weight of an edge will be an indication of how similar two tracks are (1 meaning identical, and 0, i.e. no edge, meaning very different). The graph is a discrete approximation of a continuous manifold of unknown (but hopefully lower) dimensionality embedded in a high-dimensional ambiant space. Such graphs are useful for e.g. recommendation. If 10 tracks you liked are strongly connected to an 11th one, you'll probably like that one too (if the similarity measure is relevant).
The first step to construct a similarity graph is to define the similarity measure. A good first step is to define it as a distance between the feature vector (the statistic on the MFCCs we computed above). But which distance? The cosine distance is a good candidate for high-dimensional data. It is defined as follows: $$d(u,v) = 1 - \frac{u \cdot v} {\|u\|_2 \|v\|_2},$$ where $u$ and $v$ are two feature vectors.
Geometrically, what does the cosine distance compute? Your answer here. The distance is proportional to the angle formed by the two vectors (0 if colinear, 1 if orthogonal, 2 if opposed direction).
Alternatives are the $p$-norms (or $\ell_p$-norms), defined as $$d(u,v) = \|u - v\|_p,$$ of which the Euclidean distance is a special case with $p=2$.
Implement the cosine distance first. Once you are done with the rest of the assignment and reach section 5.4, come back here and try to identify the most relevant metric for our end goal, i.e. data visualization and clustering.
Hints:
pdist
and squareform
from spatial.distance
.
In [ ]:
#distances = spatial.distance.pdist(features, metric='euclidean') # Lowest error rate.
distances = spatial.distance.pdist(features, metric='cosine') # Nicest LE plot.
#distances = spatial.distance.pdist(features, metric='cityblock')
#distances = spatial.distance.pdist(features, metric='minkowski', p=1.4) # Lowest error rate.
distances = spatial.distance.squareform(distances)
Looking at the computed values is good for i) data exploration and ii), to be more confident about the correctness of the computation. Below an histogram of the distances.
In [ ]:
plt.hist(distances.reshape(-1), bins=50);
In [ ]:
print('{} distances equal exactly zero.'.format(np.sum(distances == 0)))
Why are some distances equal to zero? Your answer here. The diagonal represents the distance from a track to itself and $d(u,u) = 0$. As there is 2000 tracks, there should be at least 2000 zeroes. Moreover, tracks 74348 and 86761 are duplicates (same song from same artist released in a different album) and thus share the same features vectors. Leaving aside numerical differencies, we should thus have two (due to symmetry) additional zeroes.
In [ ]:
features.loc[74348].equals(features.loc[86761])
From our distances, we want to compute edge weights. If the distance is short, which means the tracks are similar, we want a strong edge. The most widespread kernel for that task is the Gaussian kernel, defined as $$\mathbf{W}(u,v) = \exp \left( \frac{-d^2(u, v)}{\sigma^2} \right),$$ where $\sigma$ is a parameter which controls the width of the kernel.
Bonus. Can you think of another suitable kernel? Your answer here.
The role of the kernel is to transform a distance to a similarity measure which will weight the edges of our graph. As such, the kernel should be positive and monotonically decreasing with respect to the distance. It should take its maximum value when the distance is zero, and tend to zero when the distance increases. Note that distances are non-negative by definition. So any funtion $f : \mathbb{R}^+ \rightarrow [0,C]$ that verifies $f(0)=C$ and $\lim_{x \rightarrow +\infty}f(x)=0$ and is strictly decreasing should be adapted. How fit a kernel is to a problem should be empirically tested.
Some examples:
Compute below the weight of each edge.
Hints:
In [ ]:
kernel_width = distances.mean()
weights = np.exp(-distances**2 / kernel_width**2)
np.fill_diagonal(weights, 0)
So we just created a fully connected graph. For our algorithms to be more efficient, we want to sparsify it. There are two main ways to sparsify a graph: i) thresholding all the weights smaller than an $\epsilon$, and ii) keep the $k$ strongest edges for each node. Implement below the second option. You can start with $k=100$.
Hints:
np.argsort
to sort the weights.
In [ ]:
fig, axes = plt.subplots(2, 2, figsize=(17, 8))
def plot(weights, axes):
axes[0].spy(weights)
axes[1].hist(weights[weights > 0].reshape(-1), bins=50);
plot(weights, axes[:, 0])
NEIGHBORS = 100
# Alternative:
# indices = np.argpartition(weights, -NEIGHBORS, axis=1)[:, :-NEIGHBORS]
# weights[np.arange(weights.shape[0])[:, None], indices] = 0
# weights = np.maximum(weights, weights.T)
# Alternative:
# mask = weights < np.sort(weights, axis=1)[:, -NEIGHBORS]
# weights[mask & mask.T] = 0
partition = np.argpartition(weights, -NEIGHBORS, axis=1)
maximum = weights[np.arange(weights.shape[0]), partition[:, -NEIGHBORS]]
mask = weights < maximum
weights[mask & mask.T] = 0
# Epsilon graph. Not good because not connected.
# epsilon = np.percentile(weights, 80)
# weights[weights < epsilon] = 0
# A correct implementation should pass the below tests.
assert np.all(weights == weights.T)
assert np.all(np.count_nonzero(weights, 0) >= NEIGHBORS)
plot(weights, axes[:, 1])
In [ ]:
# Saving the graph for the fourth assignment.
w = sparse.csr_matrix(weights)
sparse.save_npz('../data/fma_graph.npz', w)
In [ ]:
indices = np.argsort(tracks['genre'])
tracks = tracks.iloc[indices]
features = features.iloc[indices]
weights = weights[indices, :][:, indices]
distances = distances[indices, :][:, indices]
fig, axes = plt.subplots(1, 2)
axes[0].spy(weights)
axes[1].imshow(distances)
line = mpl.lines.Line2D((1000, 1000), (0, 2000))
axes[0].add_line(line)
line = mpl.lines.Line2D((0, 2000), (1000, 1000))
axes[0].add_line(line);
It is clear from the above matrices that there is more connections (or shorter distances) between tracks from the same genre than across genres.
In [ ]:
degrees = weights.sum(0)
plt.hist(np.count_nonzero(weights, 0), bins=50, label='non-weighted degree distribution')
plt.hist(degrees, bins=50, label='weighted degree distribution')
plt.legend();
We will later need the Fiedler vector. Shall we compute the combinatorial or the normalized Laplacian? Your answer here. Both choices are valid. Clustering with the sign of the Fiedler vector of the combinatorial Laplacian is a relaxation of the RatioCut. A relaxation of the NormalizedCut is obtained with the normalized Laplacian. The RatioCut is normalized by the number of nodes in the clusters which we wish to separate. The NormalizedCut is normalized by the volume of the cluster, which is the sum of degrees. Both normalizations of the MinCut seek to balance the clusters. For example we'd like to avoid separating one node from the rest of the graph. Both methods are viable in our case as we know that the nodes are perfectly balanced between the two genres. In practice, the normalized Laplacian is often preferred. In Luxburg's "A tutorial on spectral clustering", it is claimed that it is preferable to use the normalized Laplacian in case of a broad degree distribution. In the end, the choice of Laplacian is task-dependent: think of your end goal. If you value clustering, choose the one who gives you the better clusters. If you value visualization, choose the one who gives a better visual representation of the data at hand. In our case, the normalized Laplacian does better (from my point-of-view).
Implement your choice below.
Hints:
In [ ]:
# Combinatorial Laplacian.
laplacian = np.diag(degrees) - weights
# Normalized Laplacian.
deg_inv = np.diag(1 / np.sqrt(degrees))
laplacian = deg_inv @ laplacian @ deg_inv
# Alternatively:
# laplacian = np.identity(weights.shape[0]) - deg_inv @ weights @ deg_inv
plt.spy(laplacian);
For efficient storage and computation, we can store this sparse matrix in a compressed sparse row (CSR) format.
In [ ]:
laplacian = sparse.csr_matrix(laplacian)
Compute the number of remaining edges in the graph.
In [ ]:
n_edges = (laplacian.nnz - laplacian.shape[0]) // 2
print('{} edges out of {} x {} = {} ({:.2%})'.format(n_edges, *weights.shape, weights.size, n_edges/weights.size))
max_edges = NEIGHBORS * len(weights)
print('Our kNN construction allows between {:d} and {:d} edges'.format(max_edges//2, max_edges))
The whole point of spectral graph theory is to compute the eigendecomposition of the Laplacian. We however don't need the full eigendecomposition (a.k.a. the graph Fourier basis) here. Compute the 10 eigenvectors with the smallest eigenvalues with one of the following functions: np.linalg.eig
, np.linalg.eigh
, sparse.linalg.eigs
, sparse.linalg.eigsh
. Justify your choice.
Your answer here. The matrix is Hermitian (real symmetric) and sparse, so we can use eigsh
. As we only need a few eigenvectors, using the implicitly restarted Lanczos method (a variant of the power method) is efficient.
To find the 10 smallest eigenvalues, we can pass which='SM'
(SM: smallest in magnitude) to the method. As the Laplacian's eigenvalues are non-negative, which='SA'
(smallest algebraic) should give the same answers, albeit the used algorithm is often slower.
The alternative is to use ARPACK's shift-invert mode. This mode allows quick determination of non-external eigenvalues. (It involves transforming the eigenvalue problem to an equivalent problem with different eigenvalues.) In this case, we hope to find eigenvalues near zero, so we can set sigma = 0
.
While it is said that ARPACK is not quite adept to find small eignevalues, I found that the shift-invert mode was actually slower in our case. (You can measure execution time with the IPython's magic command %timeit
.)
In [ ]:
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')
# Alternative (actually slower):
# eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, sigma=0)
# That's much slower:
# eigenvalues, eigenvectors = np.linalg.eigh(laplacian.toarray())
In [ ]:
plt.plot(eigenvalues, '.-', markersize=15);
Why can we diagonalize the graph Laplacian in the first place? Your answer here. Because the Laplacian is a symmetric real matrix by construction. See the spectral theorem. (More details here.)
Why are all the eigenvalues real? Your answer here. The spectral theorem, again.
Why are all the eigenvalues non-negative? Your answer here. Because the Laplacian is a positive semi-definite matrix (PSD). We can indeed write $$ \newcommand{\x}{\mathbf{x}} \newcommand{\L}{\mathbf{L}} \newcommand{\W}{\mathbf{W}} \x^\intercal \L \x = \sum_{i \sim j} \W_{ij} (x_i - x_j)^2 \geq 0. $$
In [ ]:
print(eigenvalues)
Your answer here. The graph is connected because only one eigenvalue equals zero (eigenvalue zero has a multiplicity of one). While the kNN construction assures that each node is connected to at least $k$ other nodes, it won't force the graph to be connected. If our Rock and Hip-Hop songs would form very distinct clusters, the graph may not be connected. Connection can only be ensured if $NEIGHBORS > \frac{|G|}{2}$ (this graph would not be sparse at all).
Note that the number of connected components is the dimension of the nullspace of $L$, and that the corresponding eigenvectors form a basis of this null space. These eigenvectors are indicator vectors of the connected components. Clustering is straighforward if the clusters are disconnected!
In [ ]:
np.sum(laplacian @ (2 * eigenvectors[:, 0]))
Your answer here. We expect zero because the first eigenvalue is zero. As such, the first eigenvector lies in the null-space of the Laplacian. The small error is due to numerical precision (finite representation of real numbers, accuracy of the eigendecomposition).
Finally, let's use the data and graph we prepared. When exploring data, it's often useful to visualize an entire dataset. Because for us humans it's hard to look at data in 140 dimensions, we need to somehow reduce the dimensionality to 2 or 3 and visualize the data in this more familiar space. While such a reduction will obviously be destructive, many algorithms have been developed to preserve certain properties.
PCA is a standard algorithm to reduce the dimensionality of a dataset. It computes the axes of principal variance and project the data on them. It does only use the features we computed, not the graph. We show it here for comparison. (We use scikit-learn, a very convenient library for Machine Learning.)
In [ ]:
features_pca = decomposition.PCA(n_components=2).fit_transform(features)
genres = preprocessing.LabelEncoder().fit_transform(tracks['genre'])
plt.scatter(features_pca[:, 0], features_pca[:, 1], c=genres, cmap='RdBu', alpha=0.5);
Instead of using the features directly, we can try to visualize our similarity graph. That graph is embedded in an ambiant space of 140 dimensions (the number of features, that is the number of MFCCs times the number of summary statistics) at first, i.e. each node has a position in an 140-dimensional Euclidean space. Because we cannot visualize such an high-dimensional space, we want to embed the graph in a 2D space.
One way to embed a graph is to use the value of the eigenvectors as coordinates. For the below plot, use the value of the second eigenvector as the x coordinate of a node, and the value of the third eigenvector as the y coordinate. The color is indicated by the genre.
Why don't we use the first eigenvector? Your answer here. Because it is not informative w.r.t. genres. The first eigenvector is $\frac{1}{N} \mathbf{1}$ for the combinatorial Laplacian and $\mathbf{D}^{1/2} \mathbf{1}$ for the normalized Laplacian, where $\mathbf{1}$ is the vector of all ones. As such, it is either constant, either a sole indication of the degree.
In [ ]:
x = eigenvectors[:, 1]
y = eigenvectors[:, 2]
plt.scatter(x, y, c=genres, cmap='RdBu', alpha=0.5);
See how well this plot summarizes 2GB of data and 2000 tracks! We could now design a playlist generator as a random walk on this similarity graph and visualize its trajectory as it hops from track to track. :)
Note how we did not try to build a machine to recognize the musical genre given a track (that would have been a classification problem). We did merely try to visualize the data, by means of PCA and a graph embedding algorithm. What does it tell us that genre clearly appears in our visualization?
Your answer here. At first, it means that the data in our high dimensional feature space lies on an intrinsic low dimensional manifold, such as we can meaningfully visualize it in 2 dimensions. At a higher level, it can tell us that genres are useful to categorize songs and organize music collections. Or that the chosen features are (overly?) sensitive to musical genre. Or that I overfitted it to genre recognition while designing the assignment. Beware the pitfalls!
As such, we can try to cluster the tracks with the Fiedler vector, and look if the (unsupervised) clustering agrees with the ground truth genre categorization. Reproduce below the above scatter plot, but with the sign of the Fiedler vector as the color instead of the genre.
In [ ]:
labels = (eigenvectors[:, 1] > 0)
plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);
In [ ]:
err = np.count_nonzero(labels - genres)
err = err if err < len(labels) / 2 else len(labels) - err
print('{} errors ({:.2%})'.format(err, err/len(labels)))
Tune some parameters (e.g. kernel_width
, NEIGHBORS
), discard some features (section 2.4), change the distance metric (section 3.1) to get less errors. You should get an error rate below 15% (i.e. less than 300 errors in total). Try to understand the effect of each parameter. After data cleaning, parameter tuning is the other dirty work of a data scientist! Be aware that tuning the parameters on a specific dataset will lead to overfitting.
Among other things, this assignment showed us that a graph can be useful for e.g. visualization or clustering, even when there is none in the original data. One of the design goal of this assignment, while dealing with real data, was to follow the complete Data Science process, from data acquisition to interpretation of the results. The exploitation of the data showed us that a machine can discern musical genres by looking at pairwise distances between spectral features extracted from audio recordings.
What is the name of the technique we used to visualize the data in the last two plots? What does it try to preserve when reducing the dimensionality (of the ambiant space) from 140 to 2?
Your answer here. The technique is named Laplacian eigenmaps and has been introduced by Belkin and Niyogi in their paper Laplacian Eigenmaps for Dimensionality Reduction and Data. The algorithm tries to preserve the similarity between data samples as follows: $$ \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\Y}{\mathbf{Y}} \newcommand{\L}{\mathbf{L}} \newcommand{\D}{\mathbf{D}} \newcommand{\I}{\mathbf{I}} \arg\min_{\y_1,\ldots,\y_N}\sum_{i\sim j}w_{ij}\|\y_i - \y_j\|_2^2 = \arg\min_{\Y \in \mathbb{R}^{N \times P}} \operatorname{tr}(\Y^\intercal \L \Y) \text{ , s.t. } \Y^\intercal \Y = \I, $$ where $\y_i$ is the coordinate vector in the $P$-dimensional space of the $i$-th data point. It requires that $\y_i$ and $\y_j$ are close together if the corresponding data points $\x_i$ and $\x_j$ are strongly connected, i.e. their similarity $w_{ij}$ is large. It turns out, as seen in class, that the solution to this problem is given by the eigenvectors of the graph Laplacian.