A Network Tour of Data Science

      Xavier Bresson, Winter 2016/17

Exercise 4 - Code 1 : Graph Science

Construct Network of Text Documents


In [ ]:
# Load libraries

# Math
import numpy as np

# Visualization 
%matplotlib notebook 
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import ndimage

# High-res visualization (but no rotation possible)
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png2x','pdf')

# Print output of LFR code
import subprocess

# Sparse matrix
import scipy.sparse
import scipy.sparse.linalg

# 3D visualization
import pylab
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot

# Import data
import scipy.io

# Import functions in lib folder
import sys
sys.path.insert(1, 'lib')

# Import helper functions
%load_ext autoreload
%autoreload 2
from lib.utils import compute_ncut
from lib.utils import reindex_W_with_classes
from lib.utils import nldr_visualization
from lib.utils import construct_knn_graph
from lib.utils import compute_purity

# Import distance function
import sklearn.metrics.pairwise

# Remove warnings
import warnings
warnings.filterwarnings("ignore")

In [ ]:
# Load 10 classes of 4,000 text documents
mat = scipy.io.loadmat('datasets/20news_5classes_raw_data.mat')
X = mat['X']
n = X.shape[0]
d = X.shape[1]
Cgt = mat['Cgt'] - 1; Cgt = Cgt.squeeze()
nc = len(np.unique(Cgt))
print('Number of data =',n)
print('Data dimensionality =',d);
print('Number of classes =',nc);

Question 1a: Compute the k-NN graph (k=10) with L2/Euclidean distance
Hint: You may use the function W=construct_knn_graph(X,k,'euclidean')


In [ ]:
# Your code here
W_euclidean = construct_knn_graph(X,10,'euclidean')
print('size of W=',W_euclidean.shape)
print('number of edges: {}'.format(W_euclidean.nnz))

In [ ]:
plt.spy(W_euclidean, markersize=1)

Question 1b: Plot the adjacency matrix of the graph.
Hint: Use function plt.spy(W, markersize=1)


In [ ]:
# Your code here
plt.figure(1)
plt.spy(W_euclidean, markersize=1)
plt.title('Adjacency Matrix W')
plt.show()

Question 1c: Reindex the adjacency matrix of the graph w.r.t. ground truth communities. Plot the reindexed adjacency matrix of the graph.
Hint: You may use the function [W_reindex,C_classes_reindex]=reindex_W_with_classes(W,C_classes).


In [ ]:
# Your code here
[reindexed_W_gt,reindexed_C_gt] = reindex_W_with_classes(W_euclidean,Cgt)

In [ ]:
# Your code here
plt.figure(2)
plt.spy(reindexed_W_gt,markersize=1)
plt.title('Adjacency Matrix W indexed according to GROUND TRUTH communities')
plt.show()

Question 1d: Perform graph clustering with NCut technique. What is the clustering accuracy of the NCut solution? What is the clustering accuracy of a random partition? Reindex the adjacency matrix of the graph w.r.t. NCut communities. Plot the reindexed adjacency matrix of the graph.
Hint: You may use function C_ncut, accuracy = compute_ncut(W,C_solution,n_clusters) that performs Ncut clustering.
Hint: You may use function accuracy = compute_purity(C_computed,C_solution,n_clusters) that returns the accuracy of a computed partition w.r.t. the ground truth partition. A random partition can be generated with the function np.random.randint.


In [ ]:
# Your code here
Cncut,acc = compute_ncut(W_euclidean, Cgt, nc)
print('accuracy NCut=',acc)

Crandom = np.random.randint(nc,size=(n,1))
acc = compute_purity(Crandom,Cgt,nc)
print('accuracy Random=',acc)

[reindexed_W_ncut,reindexed_C_ncut] = reindex_W_with_classes(W_euclidean,Cncut)

plt.figure(3)
plt.spy(reindexed_W_ncut, markersize=1)
plt.title('Adjacency Matrix W indexed according to NCut communities')
plt.show()

Question 2a: Compute the k-NN graph (k=10) with Cosine distance.
Answer to questions 1b-1d for this graph.
Hint: You may use function W=construct_knn_graph(X,10,'cosine').


In [ ]:
# Reload data matrix
X = mat['X']

In [ ]:
# Your code here
# Compute the k-NN graph with Cosine distance
W_cosine = construct_knn_graph(X,10,'cosine')
print('size of W=',W_cosine.shape)
print('number of edges: {}'.format(W_cosine.nnz))

In [ ]:
# Your code here
plt.figure(11)
plt.spy(W_cosine,precision=0.01, markersize=1)
plt.title('Adjacency Matrix W')
plt.show()

In [ ]:
# Your code here
[reindexed_W_gt,reindexed_C_gt] = reindex_W_with_classes(W_cosine,Cgt)

plt.figure(12)
plt.spy(reindexed_W_gt,precision=0.01, markersize=1)
plt.title('Adjacency Matrix W indexed according to GROUND TRUTH communities')
plt.show()

In [ ]:
# Your code here
Cncut,acc = compute_ncut(W_cosine, Cgt, nc)
print('accuracy NCut=',acc)

Crandom = np.random.randint(nc,size=(n,1))
acc = compute_purity(Crandom,Cgt,nc)
print('accuracy Random=',acc)

[reindexed_W_ncut,reindexed_C_ncut] = reindex_W_with_classes(W_cosine,Cncut)

plt.figure(13)
plt.spy(reindexed_W_ncut,precision=0.01, markersize=1)
plt.title('Adjacency Matrix W indexed according to NCut communities')
plt.show()

Question 2b: Visualize the adjacency matrix with the non-linear reduction technique in 2D and 3D.
Hint: You may use function [X,Y,Z] = nldr_visualization(W).
Hint: You may use function plt.scatter(X,Y,c=Cncut) for 2D visualization and ax.scatter(X,Y,Z,c=Cncut) for 3D visualization.


In [ ]:
# Your code here
[X,Y,Z] = nldr_visualization(W_cosine)

plt.figure(14)
size_vertex_plot = 10
plt.scatter(X, Y, s=size_vertex_plot*np.ones(n), c=Cncut)
plt.title('2D Visualization')
plt.show()

# 3D Visualization
fig = pylab.figure(15)
ax = Axes3D(fig)
ax.scatter(X, Y, Z, c=Cncut)
pylab.title('3D Visualization')
pyplot.show()

In [ ]: