Python 3.6 Jupyter Notebook

Finding connected components using clustering


Note that this notebook contains advanced exercises applicable only to students who wish to deepen their understanding and qualify for bonus marks on the course. You will be able to achieve 100% for this notebook by only completing Exercise 1. Optional, additional exercises can be completed to qualify for bonus marks.

Your completion of the notebook exercises will be graded based on your ability to do the following:

Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?

Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?

Analyze: Are you able to pick the relevant method or library to resolve specific stated questions?

Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?

Notebook objectives

By the end of this notebook, you will be expected to:

  • Find connected components in networks (using the techniques of hierarchical clustering, modularity maximization, and spectral graph partitioning); and
  • Interpret clustering results.

List of exercises

  • Exercise 1: Understanding hierarchical clustering.
  • Exercise 2 [Advanced]: Interpreting the results of hierarchical clustering.
  • Exercise 3 [Advanced]: Summarizing clustering results based on modularity maximization and spectral graph partitioning.

Notebook introduction

Community detection is an important task in social network analysis. The idea behind it is to identify groups of people that share a common interest, based on the assumption that these people tend to link to each other more than to the rest of the network. Specifically, real-world networks exhibit clustering behavior that can be observed in the graph representation of these networks by the formation of clusters or partitions. These groups of nodes on a graph (clusters) correspond to communities that share common properties, or have a common role in the system under study.

Intuitively, it is expected that such clusters are associated with a high concentration of nodes. In the following examples, you will explore the identification of these clusters using the following approaches, as discussed in the video content:

  • Hierarchical clustering (using a distance matrix)
  • The Louvain Algorithm (using modularity maximization)
  • Spectral graph partitioning

Import required modules


In [ ]:
import networkx as nx
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pylab as plt
from networkx.drawing.nx_agraph import graphviz_layout

from collections import defaultdict, Counter
import operator

## For hierarchical clustering.
from scipy.cluster import hierarchy
from scipy.spatial import distance

## For spectral graph partitioning.
from sklearn.cluster import spectral_clustering as spc

## For Community Detection (Louvain Method).
import community


import sys
sys.path.append('..')
from utils import draw_partitioned_graph
from utils import fancy_dendrogram

plt.rcParams['figure.figsize'] = (15, 9)
plt.rcParams['axes.titlesize'] = 'large'

1. Data preparation

You are going to read the graph from an adjacency list saved in earlier exercises.


In [ ]:
call_adjmatrix = pd.read_csv('./call.adjmatrix', index_col=0)
call_graph     = nx.from_numpy_matrix(call_adjmatrix.as_matrix())

In [ ]:
# Display call graph object.
plt.figure(figsize=(10,10))
plt.axis('off')

pos = graphviz_layout(call_graph, prog='dot')
nx.draw_networkx(call_graph, pos=pos, node_color='#11DD11', with_labels=False)
_ = plt.axis('off')

2. Hierarchical clustering

This notebook makes use of a hierarchical clustering algorithm, as implemented in Scipy. The following example uses the average distance measure. Since the graph is weighted, you can also use the single linkage inter-cluster distance measure (see exercises).


In [ ]:
def create_hc(G, linkage='average'):
    """
    Creates hierarchical cluster of graph G from distance matrix
    """ 
    
    path_length=nx.all_pairs_shortest_path_length(G)
    distances=np.zeros((G.order(),G.order())) 
    
    for u,p in dict(path_length).items():
        for v,d in p.items():
            distances[list(G.nodes)[u]][list(G.nodes)[v]] = d
            distances[list(G.nodes)[v]][list(G.nodes)[u]] = d
            if u==v: 
                distances[list(G.nodes)[u]][list(G.nodes)[u]]=0
    # Create hierarchical cluster (HC).
    Y=distance.squareform(distances)
    if linkage == 'max':
        # Creates HC using farthest point linkage.
        Z=hierarchy.complete(Y)  
    if linkage == 'single':
        # Creates HC using closest point linkage.
        Z=hierarchy.single(Y)  
    if linkage == 'average':
        # Creates HC using average point linkage.
        Z=hierarchy.average(Y)
        
    return Z

def get_cluster_membership(Z, maxclust):
    '''
    Assigns cluster membership by specifying cluster size.
    '''
    hc_out=list(hierarchy.fcluster(Z,maxclust, criterion='maxclust'))
    
    # Map cluster values to a dictionary variable.
    cluster_membership = {}
    i = 0
    for i in range(len(hc_out)):
        cluster_membership[i]=hc_out[i]
    
    return cluster_membership

Below is a demonstration of hierarchical clustering when applied to the call graph.


In [ ]:
# Perform hierarchical clustering using 'average' linkage. 
Z = create_hc(call_graph, linkage='average')

The dendrogram corresponding to the partitioned graph is obtained as follows:


In [ ]:
hierarchy.dendrogram(Z)
plt.show()

You will notice that the full dendrogram is unwieldy, and difficult to use or read. Fortunately, the dendrogram method has a feature that allows one to only show the lastp merged clusters, where $p$ is the desired number of last p merged clusters.


In [ ]:
plt.title('Hierarchical Clustering Dendrogram (pruned)')
plt.xlabel('sample index (or leaf size)')
plt.ylabel('distance')
hierarchy.dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=10,                   # show only the last p merged clusters
    show_leaf_counts=True,  # numbers in brackets are counts for each leaf
    leaf_rotation=90,
    leaf_font_size=12)
plt.show()

This dendrogram can help explain what happens as a result of the agglomerative method of hierarchical clustering. Starting at the bottom-most level, each node is assigned its own cluster. The closest pair of nodes (according to a distance function) are then merged into a new cluster. The distance matrix is recomputed, treating the merged cluster as an individual node. This process is repeated until the entire network has been merged into a single, large cluster, which the top level in the dendrogram above represents. You can now understand why this method is agglomerative.

The linkage function is used to determine the distance between a cluster and a node, or between two clusters, using the following possibilities:

  • Single: Merge two clusters with the smallest minimum pairwise distance.
  • Average: Merge two clusters with the smallest average pairwise distance.
  • Maximum or complete: Merge the two clusters with the smallest maximum pairwise distance.

Now, you can finally retrieve the clusters, based on the analysis of the dendrogram. In this post-processing, there are different ways of determining $k$, the number of clusters to partition the data into. Scipy's hierarchical flat clustering function - "hierarchy.fcluster()" - is used to assign cluster membership by specifying a distance threshold, or the number of clusters required. In the function definition (above), you have been provided with a utility function, "get_cluster_membership()", which does the latter.

Selecting the number of clusters $k$ is, in general, an ill-posed problem. Different interpretations are possible, depending on the nature of the problem, the scale of the distribution of points in a data set, and the required clustering resolution. In agglomerative clustering, as used in the example above, you can get zero error for the objective function by considering each data point as its own cluster. Hence, the selection of $k$ invariably involves a trade-off maximum compression of the data (using a single cluster), and maximum accuracy by assigning each data point to its own cluster. The selection of an optimal $k$ can be done using automated techniques or manually.

Here, identification of an appropriate cluster is ideally done manually as this has the advantages of gaining some insights into your data as well as providing an opportunity to perform sanity checks. To select the cluster size, look for a large shift in the distance metric. In our example with dendrograms plots shown above, say a case has been made for an ideal cutoff of 3.5. The number of clusters is then simply the number of intersections of a horizontal line (with height of 3.5) with the vertical lines of the dendrogram. Therefore, 3 clusters would be obtained in this case as shown below.


In [ ]:
fancy_dendrogram( Z, truncate_mode='lastp', p=12, leaf_rotation=90.,
    leaf_font_size=12.0,
    show_contracted=False,
    annotate_above=10,
    max_d=3.5)
plt.show()

In [ ]:
opt_clust = 3
opt_clust

You can now assign the data to these "opt_clust" clusters.


In [ ]:
cluster_assignments = get_cluster_membership(Z, maxclust=opt_clust)

The partitioned graph, corresponding to the dendrogram above, can now be visualized.


In [ ]:
clust = list(set(cluster_assignments.values()))
clust

In [ ]:
cluster_centers = sorted(set(cluster_assignments.values()))
freq = [list(cluster_assignments.values()).count(x) for x in cluster_centers]

In [ ]:
# Creata a DataFrame object containing list of cluster centers and number of objects in each cluster
df = pd.DataFrame({'cluster_centres':cluster_centers, 'number_of_objects':freq})
df.head(10)


Exercise 1 Start.

Instructions

  1. How many clusters are obtained after the final step of a generic agglomerative clustering algorithm (before post-processing)?

    Note: Post-processing involves determining the optimal clusters for the problem at hand.

  2. Based on your answer above, would you consider agglomerative clustering a top-down approach, or a bottom-up approach?
  3. Which of the three linkage functions (i.e. single, average, or maximum or complete) do you think is likely to be most sensitive to outliers?

    Hint: Look at this single-link and complete-link clustering resource.

Your markdown answer here.


Exercise 1 End.

Exercise complete:


Exercise 2 [Advanced] Start.

Instructions

In this exercise, you will investigate the structural properties of the clusters generated from above.

  1. Assign the values from your "cluster_assignments" to a Pandas DataFrame named "df1", with the column name "cluster_label".

    Hint: The variable "cluster_assignments" is of type dict. You will need to get the values component of this dict, not the keys.

  2. Add a field called "participantID" to "df1", and assign to this the index values from the previously-loaded "call_adjmatrix" DataFrame.
  3. Load the DataFrame containing the centrality measures that you saved in Notebook 1 of this module, into "df2".
  4. Perform an inner join by merging "df1" and "df2" on the field "participantID". Assign the result of this join to variable "df3".
  5. Perform a groupby on "df3" (using "cluster_label" field), and then evaluate the mean of the four centrality measures (using the "agg()" method). Assign the aggregation result to "df4".
  6. Review "df4", and plot its barplot.
  7. Merge clusters which share the same mean values for a centrality measure into a single cluster. Assign the smallest value of the labels in the set to the merged cluster.

    Note:
    Combine clusters such that, given a cluster with centrality measures $[x1, x2, x3, x4]$, and another cluster with centrality measures $[z1, z2, z3, z4]$, the following holds true:
    $x1 = z1$
    $x2 = z2$
    $x3 = z3$
    $x4 = z4$

  8. Print the size of each cluster, in descending order, after performing the cluster merging in the preceding step.

In [ ]:
# Your code here.


Exercise 2 [Advanced] End.

Exercise complete:
This is a good time to "Save and Checkpoint".

3. Community detection

Community detection is an important component in the analysis of large and complex networks. Identifying these subgraph structures helps in understanding organizational and functional characteristics of the underlying physical networks. In this section, you will study a few approaches that are widely used in community detection using graph representations.

3.1 The Louvain modularity-maximization approach

The Louvain method is one of the most widely-used methods for detecting communities in large networks. It was developed by a team of researchers at the Université catholique de Louvain. The method can unveil hierarchies of communities, and allows you to zoom within communities in order to discover sub-communities, sub-sub-communities, and so forth. The modularity QQ quantifies how good a "community" or partition is, and is defined as follows:

$$Q_c =\frac{1}{2m}\sum _{(ij)} \left [ A_{ij}-\frac{k_ik_j}{2m} \right] \delta(c_i, c_j)$$

The higher the $Q_c$ of a community is, the better the partition is.

The Louvain method is a greedy optimization method that attempts to optimize the "modularity" of a partition of the network via two steps:

  1. Locally optimize the modularity to identify "small" communities.
  2. Aggregate nodes belonging to the same community, and create a new network with aggregated nodes as individual nodes.

Steps 1 and 2 are then repeated until a maximum of modularity produces a hierarchy of communities.

3.2 Spectral graph partitioning

Spectral graph partitioning and clustering is based on the spectrum — the eigenvalues and associated eigenvectors — of the Laplacian matrix that corresponds to a given graph. The approach is mathematically complex, but involves performing a $k$-means clustering, on a spectral projection of the graph, with $k$=2 (using an adjacency matrix as the affinity). A schematic illustration of the process is depicted in the figure below.

Optional: You can read more about spectral graph processing.

Now, apply spectral graph partitioning to your call graph, and visualize the resulting community structure. You can read more about Scikit-Learn, and the Spectral Clustering function utilized in this section. Spectral graph partitioning needs input in the form of the number of clusters sought (default setting is 8). There are various approaches one can take to optimize the final number of clusters, depending on problem domain knowledge. Below you will use a value of $k=9$.


In [ ]:
# Create the spectral partition using the spectral clustering function from Scikit-Learn.
spectral_partition = spc(call_adjmatrix.as_matrix(), 9, assign_labels='discretize')

In [ ]:
pos = graphviz_layout(call_graph, prog='dot')
nx.draw_networkx_nodes(call_graph, pos, cmap=plt.cm.RdYlBu, node_color=spectral_partition)
nx.draw_networkx_edges(call_graph, pos, alpha=0.5)
plt.axis('off')
plt.show()

In [ ]:
print(spectral_partition)


Exercise 3 [Advanced] Start.

Instructions

Compute the size of each the clusters obtained using the spectral graph partitioning method.


In [ ]:
# Your code here.


Exercise 3 [Advanced] End.

Exercise complete:
This is a good time to "Save and Checkpoint".

4. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.