Python 3.6 Jupyter Notebook

Network analysis using NetworkX

This notebook contains advanced exercises that are only applicable to students who wish to deepen their understanding and qualify for bonus marks on this course. You will be able to achieve 100% for this notebook by successfully completing Exercises 1, 2, 3, 5, 6, and 7. An optional, additional exercise (Exercise 4) can be completed to qualify for bonus marks.

Your completion of the notebook exercises will be graded based on your ability to do the following:

Understand: Do your pseudo-code and comments show evidence that you recall and understand technical concepts?

Apply: Are you able to execute code (using the supplied examples) that performs the required functionality on supplied or generated data sets?

Analyze: Are you able to pick the relevant method or library to resolve specific stated questions?

Evaluate: Are you able to interpret the results and justify your interpretation based on the observed data?

Notebook objectives

By the end of this notebook, you will be expected to:

  • Prepare a data set for graph analysis, using NetworkX;
  • Evaluate and compare structural properties of a graph object;
  • Interpret what information the structural properties provide in the physical world; and
  • Develop a basic understanding of small-world networks.

List of exercises:

Exercise 1: Compute the number of call interactions between a pair of nodes.

Exercise 2: Evaluate structure qualitatively in a graph based on visualization.

Exercise 3: Create a graph object using the SMS data set.

Exercise 4 [Advanced]: Compare the centrality structural properties evaluated on a graph.

Exercise 5: Describe the effect on the average clustering coefficient when nodes of lower degree are removed.

Exercise 6: List the two criteria of a small-world network.

Exercise 7: Identify small-world networks, given the values for the characteristic path length and clustering coefficient.

Notebook introduction

The use of phone logs to infer relationships between volume of communication and other parameters has been an area of major research interest. In his seminal paper, which was the first application of phone logs, George Kingsley Zipf (1949) investigated the influence of distance on communication. Many studies have since followed. Big data is characterized by significant increases in structured and unstructured data generated by mobile phones that are sampled and captured at high velocities. Its emergence, and the availability of computer processing technologies that are able to store and process these data sets efficiently, has made it possible to expand these studies in order to improve our understanding of human behavior with unprecedented resolution. Mobile phone data allows the inference of real social networks using call detail records, or CDRs (i.e., phone calls, short message service (SMS) and multimedia message (MMS) communications). These records are combined with GPS and WiFi datasets, browsing habits, application logs, and tower data to reveal a superposition of several social actors.

According to Blondel et al. (2015):

The mobile nature of a mobile phone brings two advantages: first, the temporal patterns of communications [are] reflected in great detail due to the fact that the owner of the device usually carries the device with them and therefore the possibility of receiving the call exists in almost all cases, and second, the positioning data of a mobile phone allows tracking the displacements of its owner.

Unlike self-reported surveys – which are often subjective, limited to a very small subset of the population, and have been the only avenue used to gather data in the past – mobile phone CDRs contain information on verifiable communications between millions of people at a time. Further enrichment from geolocation data, which invariably is also collected alongside CDRs, as well as other external data that is available for the target segment (typically demographics), makes mobile phone CDRs an extremely rich and informative source of data for scientists and analysts.

These interactions via mobile phones can be represented by a large network where nodes represent individuals, and links are drawn between individuals that have had a phone call, or exchanged messages or other media.

The study of the structure of such networks provides useful insights into their organization, and can assist in improving communication infrastructure, understanding human behavior, traffic planning, and marketing initiatives, among others. According to Gautier Krings (2012), these applications are informed by the extraction and analysis of different kinds of information from large networks, including the following:

  1. Associating every node with geographical coordinates. This can facilitate how geography influences the creation of links. More specifically, the intensity of communication between nodes decreases as a power of the geographical distance that separates them.

  2. Studying how links in networks change over time (i.e., dynamical networks). In these networks, new nodes enter or leave the network and the strength of their connections rise and wane during the observation period. Of particular interest is the influence of time scales on the emergence of different structural properties of dynamical networks.

  3. Detecting communities in networks. Communities are groups of nodes that are densely connected to each other.

Load libraries and set global parameters for Matplotlib


In [ ]:
# Load the relevant libraries to your notebook. 
import pandas as pd            # Processing csv files and manipulating the DataFrame.
import networkx as nx          # Graph-like object representation and manipulation module.
import matplotlib.pylab as plt # Plotting and data visualization module. 
                               # This is used for basic graph visualization.
import numpy as np       

from networkx.drawing.nx_agraph import graphviz_layout

import random

from IPython.display import Image, display

# Set global parameters for plotting. 
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)

In [ ]:
def pydot(G):
    pdot = nx.drawing.nx_pydot.to_pydot(G)
    display(Image(pdot.create_png()))

1. Graph structures using NetworkX

In this notebook, you will continue working with the empirical dataset from the "Friends and Family" study used in Module 2.

1.1 Data preparation

As before, the first step is preparing the data for analysis. In the following, you will load the data into a DataFrame object, filter and retain the records of interest, and select the fields or data columns to use when creating graph objects.

1.1.1 Load the data into a DataFrame

In this data, each record or row is typical of what is available in a CDR, i.e., the actors involved, the starting time of the interaction, the duration of the interaction, who initiated it, and who was the recipient, among other details not included here (such as the geolocation of the sender and receiver).


In [ ]:
# Read the CallLog.csv file, print the number of records loaded as well as the first 5 rows.
calls = pd.read_csv('../data/CallLog.csv')
print('Loaded {0} rows of call log.'.format(len(calls)))
calls.head()

1.1.2 Row filtering

In the data set, there are calls to outsiders that can be seen in each entry where the participant's ID is "NaN". These are not relevant to the current exercise and need to be removed before you proceed. Remove all calls where one of the participant IDs is missing. First, check the number of records in your DataFrame using Pandas's shape DataFrame method.


In [ ]:
# Initial number of records.
calls.shape[0]

Next, review the data using the info() method.


In [ ]:
calls.info()

Next, you will clean the data by removing interactions involving outsiders as discussed above. Removing missing values is very common in data analysis, and Pandas has a convenient method, appropriately named dropna(), designed to automate this cleaning process.


In [ ]:
# Drop rows with NaN in either of the participant ID columns.
calls = calls.dropna(subset = ['participantID.A', 'participantID.B'])
print('{} rows remaining after dropping missing values from selected columns.'.format(len(calls)))
calls.head(n=5)

1.1.3 Column selection

For the purpose of this study, you should only focus on the social actors involved in the call interaction. Therefore, you can remove all columns not relevant to the network being analyzed.


In [ ]:
# Create a new object containing only the columns of interest.
interactions = calls[['participantID.A', 'participantID.B']]

Finally, exclude rows where the actors are the same.


In [ ]:
# Get a list of rows with different participants.
row_with_different_participants = interactions['participantID.A'] != interactions['participantID.B']

# Update "interactions" to contain only the rows identified. 
interactions = interactions.loc[row_with_different_participants,:]
interactions.head()

1.2 Creating graph objects with NetworkX

The call interactions captured above are directed, meaning that edges (u,v) and (v,u) are different.

First, let's try to capture the number of interactions between social actors, irrespective of who initiated the call. This will be done using an undirected graph. You will need to capture the number of interactions between any pair of actors with a link in the graph. Therefore, the graph object that needs to be created is a weighted undirected graph.

Using a Pandas DataFrame object as direct input into NetworkX to create graphs, the following demonstration illustrates how to build an unweighted and undirected graph.


In [ ]:
# Create an unweighted undirected graph using the NetworkX's from_pandas_edgelist method.
# The column participantID.A is used as the source and participantID.B as the target.
G = nx.from_pandas_edgelist(interactions, 
                             source='participantID.A', 
                             target='participantID.B', 
                             create_using=nx.Graph())

Review basic information on your graph.


In [ ]:
# Print the number of nodes in our network.
print('The undirected graph object G has {0} nodes.'.format(G.number_of_nodes()))

# Print the number of edges in our network.
print('The undirected graph object G has {0} edges.'.format(G.number_of_edges()))

In the following cells, the neighbors for five of the nodes are saved in Python dict, with the node label as key, and then printed.


In [ ]:
# Declare a variable for number of nodes to get neighbors of.
max_nodes = 5

In [ ]:
# Variable initialization.
count = 0
ndict = {}

# Loop through G and get a node's neigbours, store in ndict. Do this for a maximum of 'max_nodes' nodes. 
for node in list(G.nodes()):
    ndict[node] = tuple(G.neighbors(node))
    count = count + 1
    if count > max_nodes:
        break

In [ ]:
print(ndict)

In [ ]:
# Print only the first item in the dict.
print([list(ndict)[0], ndict[list(ndict)[0]]])

Your original objective is to create a weighted undirected graph for call interactions, with the weights representing the number of interactions between two distinct participants. As illustrated above, you can use the "from_pandas_dataframe" method to build an undirected graph between the pairs of actors, by specifying the graph structure using a parameter to the argument "create_using=". To get the correct weights in the undirected graph, however, you will need to add the weight information separately. Unfortunately, you cannot rely on NetworkX to do this as it cannot be used to control what data the undirected edges get. Below is a description of how to add the necessary weights to the undirected graph.

The first task is to compute the number of interactions between participants. You will use Pandas' "group_by" DataFrame method to achieve this.


In [ ]:
# Get the count of interactions between participants and display the top 5 rows.
grp_interactions = pd.DataFrame(interactions.groupby(['participantID.A', 'participantID.B']).size(), 
                                columns=['counts']).reset_index()

grp_interactions.head(5)

In [ ]:
nx.to_pandas_edgelist?

In [ ]:
# Create a directed graph with an edge_attribute labeled counts.
g = nx.from_pandas_edgelist(grp_interactions, 
                             source='participantID.A', 
                             target='participantID.B', 
                             edge_attr='counts', 
                             create_using=nx.DiGraph())

Instantiate a weighted undirected graph, and populate edge information using the edges list from the directed graph.


In [ ]:
# Set all the weights to 0 at this stage. We will add the correct weight information in the next step.
G = nx.Graph()
G.add_edges_from(g.edges(), counts=0)

Now, iterate through each link from the directed graph, adding the attribute weight (counts) to the corresponding link in the undirected graph.


In [ ]:
for u, v, d in g.edges(data=True):
    G[u][v]['counts'] += d['counts']

Look at some of the edges and their corresponding weights.


In [ ]:
# Print a sample of the edges, with corresponding attribute data.
max_number_of_edges = 5
count = 0
for n1,n2,attr in G.edges(data=True): # unpacking
    print(n1,n2,attr)
    count = count + 1
    if count > max_number_of_edges:
        break

You can verify whether the steps you executed above have worked using the following:


In [ ]:
# Verify our attribute data is correct using a selected (u,v) pair from the data.
u = 'fa10-01-77'
v = 'fa10-01-78'
print('Number of undirected call interactions between {0} and {1} is {2}.'.format(u,
                                                                    v,
                                                            G.get_edge_data(v,u)['counts']))

In [ ]:
# Compare our data set to the interactions data set.
is_uv_pair = ((interactions['participantID.A'] == u) & (interactions['participantID.B'] == v)) 
is_vu_pair = ((interactions['participantID.A'] == v) & (interactions['participantID.B'] == u))
print('Number of undirected call interactions between {0} and {1} is {2}'.format(u,
                                            v, 
                                            interactions[is_uv_pair | is_vu_pair].shape[0]))

Based on the comparison above, it can be said with confidence that your graph object captures the interactions as expected.


Exercise 1 Start.

Instructions

Calculate the number of call interactions between participant sp10-01-52 and participant fa10-01-81 captured in your graph, using any of the above approaches.


In [ ]:
# Your answer here.


Exercise 1 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

1.3 Graph visualization

The next step is to visualize the graph object – a topic that you briefly touched on in Notebook 1. NetworkX is not primarily a graph drawing package, but provides basic drawing capabilities using Matplotlib. More advanced graph visualization packages can be used. However, these are outside of the scope of this course.

NetworkX documentation (2015) states:

Proper graph visualization is hard, and we highly recommend that people visualize their graphs with tools dedicated to that task. Notable examples of dedicated and fully-featured graph visualization tools are Cytoscape, Gephi, Graphviz and, for LaTeX typesetting, PGF/TikZ.

A graph is an abstract mathematical object without a specific representation in the Cartesian coordinate space, and graph visualization is therefore not a well-defined problem with a unique solution. Depending on which structures in the graph object are of interest, several layout algorithms exist that can be used to optimize node positioning for display visualization. Whenever you want to visualize a graph, you have to find mapping from vertices to Cartesian coordinates first, preferably in a way that is aesthetically pleasing. A separate branch of graph theory, namely graph drawing, attempts to solve this problem via several graph layout algorithms.

You will use the interface provided by Graphviz for node positioning in most of your visualization in this course, because considering other possibilities may distract from the core objectives. Two node positioning algorithms can be accessed using the Graphviz interface provided by NetworkX. They are the following:

  • dot: "hierarchical" or layered drawings of directed graphs. This is the default to use if edges have directionality. The dot algorithm produces a ranked layout of a graph honoring edge directions. It is particularly appropriate for displaying hierarchies or directed acyclic graphs.
  • neato: "spring model" layouts. This is the default to use if the graph is not too large (about 100 nodes), and you don't know anything else about it. Neato attempts to minimize a global energy function, which is equivalent to statistical multidimensional scaling. An ideal spring is placed between every pair of nodes, such that its length is set to the shortest path distance between the endpoints. The springs push the nodes so their geometric distance in the layout approximates their path distance in the graph.

Below is a visual display of your weighted undirected call graph, using different visualization approaches.

1.3.1 Graphviz layout using the "dot" engine


In [ ]:
pos = graphviz_layout(G, prog='dot') # you can also try using the "neato" engine
nx.draw_networkx(G, pos=pos, with_labels=False)
_ = plt.axis('off')

1.3.2 Graph visualization using NetworkX's in-built "spring layout"


In [ ]:
layout = nx.spring_layout(G)
nx.draw_networkx(G, pos=layout, node_color='green', with_labels=False)
_ = plt.axis('off')

1.3.3 Graph visualization with Pydot rendering


In [ ]:
pydot(G)


Exercise 2 Start.

Instructions

Based on the various visualizations explored above, what can you tell about these networks and the types of interactions they capture? Please provide written feedback (a sentence or two) based on your insights of the call log data in the markdown cell below.

Hint:

  • In your answer, indicate if there appears to be some structure in the graph, or if the connections between nodes appear random (i.e., do some nodes have more links than others)? Do the participants cluster into identifiable communities or not?

Your markdown answer here.


Exercise 2 End.

Exercise complete:

This is a good time to "Save and Checkpoint".


Exercise 3 Start.

Instructions

You will now need to reproduce the steps above for SMS records.

  1. Load the file "SMSLog.csv" from the data folder in your home directory, into a variable "sms".
  2. Create a weighted undirected graph using the number of interactions between participants as weights.
  3. Assign the graph to variable "H" (do not overwrite "G" as you will still use it below).
  4. Ignore all interactions where one of the parties is missing or unknown (i.e. "NaN").
  5. Disregard any self-interactions.
  6. Display a visualization of the obtained graph network, using the spring_layout algorithm for node positioning.

Hints:

  • Make sure that you use different variables when loading the datasets, and remember that you can always insert additional cells in the notebook, should you prefer to break up steps or perform additional investigations.

  • It is good practice to make clear comments (start the line with #) in your code when sharing your work or if you need to review it at a later stage. Make sure that you add comments to enable your tutor to understand your thinking process.

  • The number of cells below are only indicative. You can insert additional cells as required.


In [ ]:
# Your answer here.

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


Exercise 3 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2. Computing and visualizing structural properties of networks

Physical networks exhibit different behaviors. Since graph objects are abstractions of these behaviors, you might expect these graphs to be different. To characterize these differences, you need more than visualizations that are pleasing to the eye. To this end, a number of characteristics have been developed to characterize the structural properties of graphs. These properties help you understand and characterize physical networks with more mathematical rigor. You will now explore characteristics discussed in the video content.

2.1 Degree distribution

The degree of a node in a network is the number of connections it has to other nodes, and the degree distribution (also referred to as the neighbor distribution) is the probability distribution of these degrees over the whole network. Specifically, the degree distribution $p(k)$ is the probability that a randomly-chosen node has $k$ connections (or neighbors).

2.1.1 Degree distribution histogram

A degree distribution histogram is a plot of the frequency of occurrence of the number of connections or neighbors, based on the relationships (edges) between entities (nodes), as represented by a graph object.

Continuing with the call data (graph G), from the preceding sections, you will now compute and plot the degree distribution of the data.


In [ ]:
# Extract the degree values for all the nodes of G
degrees = []
for (nd,val) in G.degree():
    degrees.append(val)

In [ ]:
# Plot the degree distribution histogram.
out = plt.hist(degrees, bins=50)
plt.title("Degree Histogram")
plt.ylabel("Frequency Count")
plt.xlabel("Degree")

2.1.2 Logarithmic plot of the degree distribution

In many cases, the histogram distribution is best represented using a log-log plot.


In [ ]:
# Logarithmic plot of the degree distribution.
values = sorted(set(degrees))
hist = [list(degrees).count(x) for x in values]
out = plt.loglog(values, hist, marker='o')
plt.title("Degree Histogram")
plt.ylabel("Log(Frequency Count)")
plt.xlabel("Log(Degree)")

2.2 Node centrality

Centrality measures provide relative measures of importance in a network. There are many different centrality measures, and each measures a different type of importance. In the video content, you were introduced to the following centrality measures:

  1. Degree centrality: Number of connections. An important node is involved in a large number of interactions. For directed graphs, the in-degree and out-degree concepts are used. The in-degree of a Node v is the number of edges with Vertex v as the terminal vertex, and the out-degree of v is the number of edges with v as the initial vertex.

  2. Closeness centrality: Average length of the shortest paths between a specific node and all other nodes in the graph. An important node is typically close to, and can communicate quickly with, the other nodes in the network.

  3. Betweenness centrality: Measures the extent to which a particular vertex lies on the path between all other vertices. An important node will lie on a high proportion of paths between other nodes in the network.

  4. Eigenvector centrality: An important node is connected to important neighbors.

The following schematic is a demonstration and comparison of the first three of the centrality metrics discussed above. In this figure, Node X always has the highest centrality measure, although it measures different behaviors in each case.

NetworkX provides the functionality to evaluate these metrics for graph objects, which will be described in the next section.

2.2.1 Degree centrality


In [ ]:
# Plot degree centrality.
call_degree_centrality = nx.degree_centrality(G)
colors =[call_degree_centrality[node] for node in G.nodes()]
pos = graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos, node_color=colors, node_size=300, with_labels=False)
_ = plt.axis('off')

The visual above uses different colors on nodes to highlight their degree centrality. Blue and purple nodes have a low value, and the yellow and green nodes indicate the nodes with the highest centrality values in the network. Although it is possible to add label information on the nodes, it can become too busy and, therefore, make it difficult to read the visual. In the following example, the data is arranged according to the degree centrality measure so that the node with the highest degree centrality measure appears at the top, followed by the node with the next highest degree centrality measure, and so forth (that is, in descending order).


In [ ]:
# Arrange in descending order of centrality and return the result as a tuple, i.e. (participant_id, deg_centrality).
t_call_deg_centrality_sorted = sorted(call_degree_centrality.items(), key=lambda kv: kv[1], reverse=True)

# Convert tuple to pandas dataframe.
df_call_deg_centrality_sorted = pd.DataFrame([[x,y] for (x,y) in t_call_deg_centrality_sorted], 
                                             columns=['participantID', 'deg.centrality'])

Note:

In NetworkX, the degree centrality values are normalized by dividing by the maximum possible degree in a simple graph ($n-1$), where $n$ is the number of nodes in the graph. To get integer values, when required, the computed degree centrality values are multiplied by ($n-1$).

You can print the nodes with the highest degree centrality measure using "head()".


In [ ]:
# Top 5 participants with the highest degree centrality measure.
df_call_deg_centrality_sorted.head()

Here are some immediate questions to ask:

  1. How many unique actors are associated with each of the five participants with the highest degree centrality measure?
  2. How many total call interactions are associated with each of those five participants?

These questions are answered below.


In [ ]:
# Number of unique actors associated with each of the five participants with highest degree centrality measure.
for node in df_call_deg_centrality_sorted.head().participantID:
    print('Node: {0}, \t num_neighbors: {1}'.format(node, len(list(G.neighbors(node)))))

In [ ]:
# Total call interactions are associated with each of these five participants with highest degree centrality measure.
for node in df_call_deg_centrality_sorted.head().participantID:
    outgoing_call_interactions = interactions['participantID.A']==node
    incoming_call_interactions = interactions['participantID.B']==node
    all_call_int = interactions[outgoing_call_interactions | incoming_call_interactions]
    print('Node: {0}, \t total number of calls: {1}'.format(node, all_call_int.shape[0]))

2.2.2 Closeness centrality


In [ ]:
# Plot closeness centrality.
call_closeness_centrality = nx.closeness_centrality(G)
colors = [call_closeness_centrality[node] for node in G.nodes()]
pos = graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos=pos,node_color=colors, with_labels=False)
_ = plt.axis('off')

The single node with the highest closeness centrality can be distinguished by the yellow color, whereas those with lower values are depicted in a gradation of colors from green to purplish color. To propagate information quickly in the network, one would need to involve nodes with a high closeness centrality measure.

Below, you will identify these nodes explicitly, and store the data in a separate DataFrame.


In [ ]:
# Arrange participants according to closeness centrality measure, in descending order. 
# Return the result as a tuple, i.e. (participant_id, cl_centrality).
t_call_clo_centrality_sorted = sorted(call_closeness_centrality.items(), key=lambda kv: kv[1], reverse=True)

# Convert tuple to pandas dataframe.
df_call_clo_centrality_sorted = pd.DataFrame([[x,y] for (x,y) in t_call_clo_centrality_sorted], 
                                             columns=['participantID', 'clo.centrality'])

In [ ]:
# Top 5 participants with the highest closeness centrality measure.
df_call_clo_centrality_sorted.head()

2.2.3 Betweenness centrality


In [ ]:
# Plot betweenness centrality.
call_betweenness_centrality = nx.betweenness_centrality(G)
colors =[call_betweenness_centrality[node] for node in G.nodes()]
pos = graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos=pos, node_color=colors, with_labels=False)
_ = plt.axis('off')

Betweenness centrality is a measure of the influence a node has over the spread of information through the network. Specifically, these nodes are strategically positioned, and dictate information flow across the network. In the visual above, two nodes (one in yelow color and the other in a blue-green color) are highlighted as the key nodes that govern information flow in the network. You can explicitly identify these nodes by re-arranging the data in order of descending betweenness centrality measure.


In [ ]:
# Arrange participants according to betweenness centrality measure, in descending order. 
# Return the result as a tuple, i.e. (participant_id, btn_centrality). 
t_call_btn_centrality_sorted = sorted(call_betweenness_centrality.items(), key=lambda kv: kv[1], reverse=True)

# Convert tuple to a Pandas DataFrame.
df_call_btn_centrality_sorted = pd.DataFrame([[x,y] for (x,y) in t_call_btn_centrality_sorted], 
                                             columns=['participantID', 'btn.centrality'])

In [ ]:
# Top 5 participants with the highest betweenness centrality measure.
df_call_btn_centrality_sorted.head()

2.2.4 Eigenvector centrality

The eigenvector centrality measure is based on the idea that a node is important if it is linked to other important nodes. Eigenvector centrality characterizes the "global" (as opposed to "local") prominence of a vertex in a graph. Google’s Pagerank algorithm is a variation of eigenvector centrality.


In [ ]:
# Plot eigenvector centrality.
call_eigenvector_centrality = nx.eigenvector_centrality(G)
colors = [call_eigenvector_centrality[node] for node in G.nodes()]
pos = graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos=pos, node_color=colors,with_labels=False)
_ = plt.axis('off')

Now, identify the nodes with the highest eigenvector centrality.


In [ ]:
# Arrange participants according to eigenvector centrality measure, in descending order. 
# Return the result as a tuple, i.e. (participant_id, eig_centrality).
t_call_eig_centrality_sorted = sorted(call_eigenvector_centrality.items(), key=lambda kv: kv[1], reverse=True)

# Convert tuple to pandas dataframe.
df_call_eig_centrality_sorted = pd.DataFrame([[x,y] for (x,y) in t_call_eig_centrality_sorted], 
                                             columns=['participantID', 'eig.centrality'])

In [ ]:
# Top 5 participants with the highest eigenvector centrality measure. 
df_call_eig_centrality_sorted.head()

2.3 Comparing the connectedness measures

Based on the above examples, you should now have a sense that the centrality metrics are similar regarding how they rank important nodes, since a limited number of nodes are common across different metrics. However, they also differ because they do not always return the same nodes. In this section, analysis tools, which assist in comparing these metrics, are provided.

Below, you are provided with a plotting function that accepts two objects containing different centrality measures as Python dicts. The metrics you found above are all in the form of Python dict objects, and can be used as is. The function plots a scatter plot of the metrics against each other, so as to compare the centrality measures.

Note:

You do not need to understand the function and its syntax. It is included for advanced students, and to make it possible to demonstrate the concepts in this section to you more easily. All you need to do is execute the cell below.


In [ ]:
# Execute this cell to define a function that produces a scatter plot.
def centrality_scatter(dict1,dict2,path="",ylab="",xlab="",title="",line=False):
    '''
    The function accepts two dicts containing centrality measures and outputs a scatter plot
    showing the relationship between the two centrality measures
    '''
    # Create figure and drawing axis.
    fig = plt.figure(figsize=(7,7))
    
    # Set up figure and axis.
    fig, ax1 = plt.subplots(figsize=(8,8))
    # Create items and extract centralities.
    
    items1 = sorted(list(dict1.items()), key=lambda kv: kv[1], reverse=True)
    
    items2 = sorted(list(dict2.items()), key=lambda kv: kv[1], reverse=True)
    xdata=[b for a,b in items1]
    ydata=[b for a,b in items2]
    ax1.scatter(xdata, ydata)

    if line:
        # Use NumPy to calculate the best fit.
        slope, yint = np.polyfit(xdata,ydata,1)
        xline = plt.xticks()[0]
        
        yline = [slope*x+yint for x in xline]
        ax1.plot(xline,yline,ls='--',color='b')
        # Set new x- and y-axis limits.
        plt.xlim((0.0,max(xdata)+(.15*max(xdata))))
        plt.ylim((0.0,max(ydata)+(.15*max(ydata))))
        # Add labels.
        ax1.set_title(title)
        ax1.set_xlabel(xlab)
        ax1.set_ylabel(ylab)

2.3.1 Compare betweenness centrality and degree centrality

Use a scatter plot to visually compare the betweenness centrality and degree centrality.


In [ ]:
# Let us compare call_betweenness_centrality, call_degree_centrality.
centrality_scatter(call_betweenness_centrality, call_degree_centrality, xlab='betweenness centrality',ylab='closeness centrality',line=True)

The distribution of the points in the scatter plot above is unclear due to the effect of the two nodes with very high centrality values. To better understand the distribution of other nodes, remove these nodes from the data set, and redraw the scatter plot.


In [ ]:
# Make a (deep) copy of the graph; we work on the copy.
g2 = G.copy()

# Remove the 2 nodes with the highest centrality meaures as discussed above 
g2.remove_node('fa10-01-04')
g2.remove_node('fa10-01-13')

# Recompute the centrality measures.
betweenness2 = nx.betweenness_centrality(g2)
centrality2 = nx.degree_centrality(g2)

# Scatter plot comparison of the recomputed measures.
centrality_scatter(betweenness2, centrality2, 
                   xlab='betweenness centrality',ylab='degree centrality',line=True)

2.3.3 Merge the centrality measures into a single DataFrame

You can also use Pandas to merge all of the centrality measures that you have computed into a single DataFrame. The merge method accepts two DataFrames, and merges on a column that is common to both DataFrames. Below is a repeated call on merge that eventually outputs a single DataFrame.


In [ ]:
m1 = pd.merge(df_call_btn_centrality_sorted, df_call_clo_centrality_sorted)
m2 = pd.merge(m1, df_call_deg_centrality_sorted)
df_merged  = pd.merge(m2, df_call_eig_centrality_sorted)
df_merged.head()

The above Pandas functionality is generally quite useful when you are presented with data from different sources, and would like to combine them into a single DataFrame using a common column that is shared by both DataFrames.

Save the merged DataFrame for future use.


In [ ]:
df_merged.to_csv('centrality.csv', index=False)


Exercise 4 [Advanced] Start.

Instructions

  1. Make a copy of the Graph G, and assign it to variable "g3".

  2. Remove the two nodes with the highest betweenness centrality measures from "g3".

  3. Recompute the degree and eigenvector centrality measure for "g3", and assign the output to the variables "deg_centrality3" and "eig_centrality3", respectively.

  4. Make a scatter plot comparison of the centrality measures, computed in the previous step, using the "centrality_scatter" plot function provided.


In [ ]:
# Your answer here.

In [ ]:


Exercise 4 [Advanced] End.

Exercise complete:

This is a good time to "Save and Checkpoint".

2.4 Average path length and network diameter

The size of a network can be calculated using two measures. The first measure of network size is simply the average distance within the network, which is equal to the average of the distances between all possible pairs of vertices. The average path length shows, on average, the number of steps it takes to get from one member of the network to another.

The second, and alternative, network size measure is the diameter, and it is defined as the shortest distance between the two most distant nodes in the network. It is representative of the linear size of a network. To compute the diameter, one looks at all of the shortest path distance values between connected edges in the network.

Both are measures of the size of the network, and can be understood as the distance between the nodes. Unlike the connectedness centrality measures, which are node-focused, average path length and diameter are global metrics on the structure of the graph. Along with degree distribution and clustering coefficient, average path length provides a robust measure of network topology.


In [ ]:
print('Diameter {}'.format(nx.diameter(G)))
print('Average path length {:0.2f}'.format(nx.average_shortest_path_length(G)))

2.5 Clustering coefficient

The clustering coefficient is used to measure the extent to which nodes tend to cluster together. This measure can be understood as the "friends of my friends are friends" measure. In most real-world networks, such as social networks, nodes tend to create tight-knit groups, characterized by a relatively high density of connections between nodes. This likelihood tends to be greater than the average probability of a tie randomly established between two nodes. A high clustering coefficient for a network is an indication of a small world, which is a phenomenon in which two strangers often find that they have a friend in common. Human social networks, such as on Facebook, Twitter, or LinkedIn, typically exhibit the feature that, in any cluster of friends, each friend is also connected to other friends.

Two definitions of the clustering coefficient of a graph are commonly used:

  1. Global clustering
  2. Average local clustering

The global clustering coefficient, or transitivity, was discussed in the video content, and is a measure designed to give an overall indication of the clustering in the network. It is based on the concept of triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected ties. A triangle consists of three closed triplets, one centered on each of the nodes. The global clustering coefficient is the number of closed triplets (or 3 $\times$ triangles) over the total number of triplets (both open and closed). In NetworkX, this measure can be obtained using a method called "transitivity", as demonstrated below.


In [ ]:
print('The global clustering for our graph G is {0:.4f}'.format(nx.transitivity(G)))

Watts and Strogatz (1998) proposed another clustering definition, which is referred to as the local clustering coefficient. More detail can be found in the caption of Figure 2 of their freely-available paper on the collective dynamics of small-world networks. The local clustering coefficient gives an indication of the embeddedness of single nodes or how concentrated the neighborhood of that node is. It is given by the ratio of the number of actual edges there are between neighbors to the number of potential edges there are between neighbors. The clustering coefficient of a network is then given as the average of the vertex clustering coefficients.

The intuition behind the local clustering coefficient is illustrated below, using the following network.


In [ ]:
# Create a graph object from a provided edges list and visualize the graph. 
e = [(1,2), (2,3), (3,4), (4,5), (4,6), (5,6), (5,7), (6,7)]
g =  nx.Graph()
g.add_edges_from(e)
pos = graphviz_layout(g, prog='dot')
nx.draw_networkx(g, pos=pos, with_labels=True, node_size=500)
_ = plt.axis('off')

Choosing Node 5 as the node of interest, let's calculate its clustering coefficient (i.e., how concentrated its neighborhood is).


In [ ]:
# Number of actual edges there are between neigbhors of 5. 
actual_edges = len([(4,6), (6,7)])

# Total possible edges between neighbors of 5.
total_possible_edges = len([(4,6), (6,7), (4,7)])

# Clustering coeff of node. 
local_clustering_coef  = 1.0 * actual_edges / total_possible_edges

print(local_clustering_coef)

NetworkX contains a method for calculating the clustering coefficient for all the nodes in a graph. You can specify a list of nodes as an argument when calling the method on a graph.

Note:

The clustering coefficient is defined as zero when there are less than two neighbors in a node's neighborhood.


In [ ]:
# Local clustering for node 5
print((list(nx.clustering(g, nodes=[5]).values())[0]))

Do all nodes in a network or graph have the same local clustering coefficient? Call the method without specifying the "nodes" argument.


In [ ]:
nx.clustering(g)

When applied to a single node, the clustering coefficient is a measure of how complete its neighborhood is. When applied to an entire network, it is the average clustering coefficient over all of the nodes in the network. Again, you can also compute this using NetworkX.


In [ ]:
# Compute average clustering coefficient for our graph g directly.
print(('(Direct) The average local clustering coefficient of the network is {:0.3f}'.format(np.mean(list(nx.clustering(g).values())))))

# Or using NetworkX.
print(('(NetworkX) The average local clustering coefficient of the network is {:0.3f}'.format(nx.average_clustering(g))))

With this background, you can now calculate the average clustering coefficient of the call data network from above.

Note:

On the network level, there are two versions of the clustering coefficient. The first one is the global clustering coefficient that you computed at the top of Section 2.5, and is discussed in the video content by Xiaowen Dong. The second one is the average local clustering coefficient of all the nodes in the network.

In most cases, nodes with a degree below a certain threshold are of little or no interest. For visualization purposes in particular, dense nodes are computationally costly, and do not render well when all of the nodes are included. Hence, you may want to exclude these from further analysis.

The next section investigates the effect of removing nodes with a degree of 1 on the clustering coefficient.

Note:

You do not need to understand the function and its syntax. It is included for advanced users, and to make it possible to demonstrate the concepts in this section to you more easily. All you need to do is execute the cell below.


In [ ]:
# Define a function that trims.
def trim_degrees(g, degree=1):
    """
    Trim the graph by removing nodes with degree less than or equal to the value of the degree parameter
    Returns a copy of the graph.
    """
    g2=g.copy()
    d=nx.degree(g2)
    for n in g.nodes():
        if d[n]<=degree: g2.remove_node(n)
    return g2

In [ ]:
# Effect of removing weakly connected nodes.
G1 = trim_degrees(G, degree=1)
G3 = trim_degrees(G, degree=3)
G5 = trim_degrees(G, degree=5)

# Compare the clustering coefficient of the resulting network objects.
print((round(nx.average_clustering(G),3), round(nx.average_clustering(G1),3), round(nx.average_clustering(G3),3),
round(nx.average_clustering(G5),3)))


Exercise 5 Start.

Instructions

In the previous example, nodes with lower degrees were removed from the graph, and the average clustering coefficient recomputed.

Describe, in one sentence, the effect on the average clustering coefficient as these nodes were removed?

Your markdown answer.


Exercise 5 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

3. Small-world networks

Networks incorporating the two characteristics of a high clustering coefficient $C(p)$, and a low mean shortest path or characteristic path length $L(p)$ (but higher than a random network), i.e.,

\begin{equation} C(p) >> C_{random} \hspace{0.5cm}and\hspace{0.5cm} L(p) \gtrsim L_{random} \end{equation}

are known as small-world networks. The name comes from the so-called "small world" phenomenon in which two strangers find that they have a friend in common. Human social networks typically exhibit the feature that, in any cluster of friends, each friend is also connected to other clusters. Consequently, it usually takes only a short string of acquaintances to connect any two people in a network.


Exercise 6 Start.

Instructions

Which two criteria are typically used to identify a network as a small-world network?

Type your answer in the markdown cell below.

Your markdown answer.


Exercise 6 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

Consider the following data for three different real-world networks (Watts and Strogatz 1998), which were also mentioned in the video content. The first system is a collaboration graph of feature film actors. The second is the electrical power grid of the West Coast in the United States. The third is the neural network of the nematode worm Caenorhabditis elegans (C. elegans). The graph representing film actors is a surrogate for a social network, with the advantage of being much easier to specify. The graph of the power grid is relevant to the efficiency and robustness of power networks, and C. elegans is the sole example of a completely mapped neural network. The graphs are defined as follows:

  • Two actors are joined by an edge if they have acted in a film together.
  • For the power grid, vertices represent generators, transformers, and substations, and edges represent high-voltage transmission lines between them.
  • For C. elegans, an edge joins two neurons if they are connected by either a synapse or a gap function.

The characteristic path lengths and clustering coefficients are reproduced for the three networks in the DataFrame, and the published comparison to random graphs (for the same number of vertices and average number of edges per vertex) is provided.


In [ ]:
# Reproduce the table from the study.
tbl = pd.DataFrame(np.transpose(np.array([[3.65, 18.7, 2.65],
                                          [2.99, 12.4, 2.25],
                                          [0.79, 0.080, 0.28],
                                          [0.00027, 0.005, 0.05]])),
             index=['Film actors', 'Power grid', 'C.elegans'],
            columns = ['L.actual','L.random','C.actual','C.random'])
tbl

Now, extend the table structure, from above, by adding two Boolean type columns that give the results of comparing "L.actual" to "L.random," and "C.actual" to "C.random".


In [ ]:
# Compare 'L.actual' to 'L.random' to return a Boolean based on the above
tbl['L.actual gt L.random'] = tbl['L.actual'] > tbl['L.random']
tbl

Finally, the relative values "C.actual" to "C.random" can be compared.


In [ ]:
tbl['C.actual gt C.random'] = tbl['C.actual'] > tbl['C.random']
tbl


Exercise 7 Start.

Instructions

Based on the above computations, and the definition of a small-world network in the introduction to Section 3 of this notebook, list which of the three networks (film actors, power grid, and C. elegans) exhibit small-world phenomena.

Hint:

You can refer to the video content for the solution.

Your markdown answer here.


Exercise 7 End.

Exercise complete:

This is a good time to "Save and Checkpoint".

4. Saving and cleaning up

Save your network data for use in the next notebook. NetworkX conveniently provides methods that, combined with Pandas, enable you to save the adjacency matrix.


In [ ]:
# Create the adjacency matrix for call records using NetworkX and Pandas functions.
nodes = list(G.nodes())
call_adjmatrix = nx.to_pandas_adjacency(G,nodelist=nodes)
call_adjmatrix.to_csv('./call.adjmatrix')

5. References

Zipf, George Kingsley. 1949. Human behavior and the principle of least effort: An introduction to human ecology. Reading: Addison-Wesley Press.

Blondel, Vincent D., Adeline Decuyper, and Gautier Krings. 2015. “A survey of results on mobile phone datasets analysis.” EPJ Data Science 4:1-55. doi:10.1140/epjds/s13688-015-0046-0.

Krings, Gautier. 2012. “Extraction of information from large networks.” PhD thesis, Université catholique de Louvain.

Watts, Duncan J., and Steven H. Strogatz. 1998. “Collective dynamics of "small-world" networks.” Nature 393:440-442. doi:10.1038/30918.

NetworkX. 2015. “Drawing - NetworkX 1.10 documentation.” Last modified October 26. https://networkx.github.io/documentation/networkx-1.10/reference/drawing.html.

6. Submit your notebook

Please make sure that you:

  • Perform a final "Save and Checkpoint";
  • Download a copy of the notebook in ".ipynb" format to your local machine using "File", "Download as", and "IPython Notebook (.ipynb)"; and
  • Submit a copy of this file to the Online Campus.