In [ ]:
import networkx as nx
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Nodes and Edges: How do we represent relationships between individuals using NetworkX?

As mentioned earlier, networks, also known as graphs, are comprised of individual entities and their representatives. The technical term for these are nodes and edges, and when we draw them we typically use circles (nodes) and lines (edges).

In this notebook, we will work with a synthetic (i.e. simulated) social network, in which nodes are individual people, and edges represent their relationships. If two nodes have an edge between them, then those two individauls know one another.

Data Representation

In the networkx implementation, graph objects store their data in dictionaries.

Nodes are part of the attribute Graph.node, which is a dictionary where the key is the node ID and the values are a dictionary of attributes.

Edges are part of the attribute Graph.edge, which is a nested dictionary. Data are accessed as such: G.edge[node1][node2]['attr_name'].

Because of the dictionary implementation of the graph, any hashable object can be a node. This means strings and tuples, but not lists and sets.

Synthetic Social Network

With this synthetic social network, we will attempt to answer the following basic questions using the NetworkX API:

  1. How many people are present in the network?
  2. What is the distribution of attributes of the people in this network?
  3. How many relationships are represented in the network?
  4. What is the distribution of the number of friends that each person has?

First off, let's load up the synthetic social network. This will show you through some of the basics of NetworkX.

For those who are interested, I simply created an Erdõs-Rényi graph with n=30 and p=0.1. I used randomized functions that I wrote to generate attributes and append them to each node and edge. I then pickled the graph to disk.


In [ ]:
G = nx.read_gpickle('Synthetic Social Network.pkl') #If you are Python 2.7, read in Synthetic Social Network 27.pkl
nx.draw(G)

Basic Network Statistics

Let's first understand how many people and relationships are represented in the network.


In [ ]:
# Who are represented in the network?
G.nodes()

Exercise: Can you write a single line of code that returns the number of individuals represented?


In [ ]:


In [ ]:
# Who is connected to who in the network?
G.edges()

Exercise

Can you write a single line of code that returns the number of relationships represented?


In [ ]:
len(G.edges())

Since this is a social network of people, there'll be attributes for each individual, such as age, and sex. We can grab that data off from the attributes that are stored with each node.


In [ ]:
# Let's get a list of nodes with their attributes.
G.nodes(data=True)

# NetworkX will return a list of tuples in the form (node_id, attribute_dictionary)

Exercise

Can you count how many males and females are represented in the graph?

Hint: You may want to use the Counter object from the collections module.


In [ ]:
from collections import Counter

Edges can also store attributes in their attribute dictionary.


In [ ]:
G.edges(data=True)

In this synthetic social network, I have stored the date as a datetime object. Datetime objects have attributes, namely .year, .month, .day.

Exercise

Can you figure out the range of dates during which these relationships were forged?


In [ ]:

Exercise

We found out that there are two individuals that we left out of the network, individual no. 31 and 32. They are one male (31) and one female (32), their ages are 22 and 24 respectively, they knew each other on 2010-01-09, and together, they both known individual 7, on 2009-12-11. Use the functions G.add_node() and G.add_edge() to introduce this data into the network.

If you need more help, check out https://networkx.github.io/documentation/latest/tutorial/tutorial.html


In [ ]:
G.add_node(31, age=22, sex='Male')


G.add_edge(31, 32, datetime=datetime(2010, 1, 9))

Live Exercise

While we're on the matter of graph construction, let's take a look at our tutorial class. On your sheet of paper, you should have a list of names - these are people for which you knew their name prior to coming to class.

As we iterate over the class, I would like you to holler out your name, your nationality, and in a very slow fashion, the names of the people who you knew in the class.


In [ ]:
## You may choose to join me in this endeavor together.

ptG = nx.DiGraph() #ptG stands for PyCon Tutorial Graph.

# Add in nodes and edges
nodes = [('Eric', {'status':'In School'}),
('Arya', {'status':'In School'}),
('Sofiya', {'status':'In School'}),
('Christa', {'status':'In School'}),
('Bhavin',{'status':'Working'}),
('Lichao',{'status':'Working'}),
('Geoff',{'status':'Working'}),
('Thomas', {'status':'Working'}),
('Janet',{'status':'Working'}),
('Russ',{'status':'Working'}),
('Brian',{'status':'Working'}),
('Roman', {'status':'Working'}),
('Aditi',{'status':'Working'}),
('Horacio',{'status':'Working'}),
('Dave', {'status':'Working'}),
('Bob',{'status':'Working'}),
('Daniel',{'status':'Working'}),
('Jeremy',{'status':'Working'}),
('Eunyoung', {'status':'Working'}),
('Pranitha', 'School'),
('Perry', {'status':'In School'}),
('Dylan',{'status':'In School'}),
('Emily',{'status':'Working'}),
('Pratham', {'status':'Working'}),
('Lauren',{'status':'Working'}),
('Jing',{'status':'Working'}),
('Dan',{'status':'Working'}),
('Sawan', {'status':'Working'}),
('Jon', {'status':'Working'}),
('Paul', {'status':'In School'}),
('Hideki', {'status':'Working'}),
('Jeff',{'status':'Working'}),
('en', {'status':'Working'}),
('Shleifer', {'status':'Working'}),
('Ofer',{'status':'Working'})]

ptG.add_nodes_from(nodes)

In [ ]:
edges = [('Eric','Dan'), ('Eric', 'Ji'), ('Sawan','Eunyoung'), ('Dave', 'Jeff'), ('Sawan','Paul'), ('Karl','Lichao'), ('Brian','Lichao'), ('Geoff','Janet'), ('Janet','Geoff'), ('Janet','LP'), ('en', 'Perry',), ('Horacio','Lauren'), ('Bhavin','Sam'), ('Bob','Bryan'), ('Dylan','Lauren'), ('Daniel','Matt'), ('Arya', 'Christa'), ('Arya', 'Sofiya'), ('Sofiya', 'Christa'), ('Perry','En'), ('Roman', 'Lauren'), ('Jing','Emily'), ('Jing','Jeremy'), ('Karthik','Thomas'), ('Paul', 'Ofer'), ('Karthik','Working'), ('Jeremy','Hideki'), ('Jeremy','Jing'), ('Dan','Justine'), ('Russ','Andy'), ('en', 'Eric')]
ptG.add_nodes_from(edges)

In [ ]:
# Impute status "unknown"
for n, d in ptG.nodes(data=True):
    if 'status' not in d.keys():
        ptG.node[n]['status'] = 'Unknown'

In [ ]:
# We are now going to draw the network using a hive plot, grouping the nodes by the top two nationality groups, and 'others'
# for the third group.

nodes = dict()
nodes['group1'] = [n for n, d in ptG.nodes(data=True) if d['status'] == 'Working'] #list comprehension here
nodes['group2'] = [n for n, d in ptG.nodes(data=True) if d['status'] == 'In School'] #list comprehension here
nodes['group3'] = [n for n, d in ptG.nodes(data=True) if d['status'] == 'Unknown'] #list comprehension here

In [ ]:
edges = dict()
edges['group1'] = [(sc, sk) for sc, sk in ptG.edges(data=True)] #list comprehension here

nodes_cmap = dict()
nodes_cmap['group1'] = 'blue'
nodes_cmap['group2'] = 'green'
nodes_cmap['group3'] = 'purple'

edges_cmap = dict()
edges_cmap['group1'] = 'black'

from hiveplot import HivePlot
h = HivePlot(nodes, edges, nodes_cmap, edges_cmap)
# h.set_minor_angle(np.pi / 32) #optional
h.draw()

Coding Patterns

These are some recommended coding patterns when doing network analysis using NetworkX, which stem from my roughly two years of experience with the package.

Iterating using List Comprehensions

I would recommend that you use the following for compactness:

[d['attr'] for n, d in G.nodes(data=True)]

And if the node is unimportant, you can do:

[d['attr'] for _, d in G.nodes(data=True)]

Iterating over Edges using List Comprehensions

A similar pattern can be used for edges:

[n2 for n1, n2, d in G.edges(data=True)]

or

[n2 for _, n2, d in G.edges(data=True)]

If the graph you are constructing is a directed graph, with a "source" and "sink" available, then I would recommend the following pattern:

[(sc, sk) for sc, sk, d in G.edges(data=True)]

or

[d['attr'] for sc, sk, d in G.edges(data=True)]

Drawing Graphs

As illustrated above, we can draw graphs using the nx.draw() function. The most popular format for drawing graphs is the node-link diagram.


In [ ]:
nx.draw(G)

If the network is small enough to visualize, and the node labels are small enough to fit in a circle, then you can use the with_labels=True argument.


In [ ]:
nx.draw(G, with_labels=True)

However, note that if the number of nodes in the graph gets really large, node-link diagrams can begin to look like massive hairballs. This is undesirable for graph visualization.

Instead, we can use a matrix to represent them. The nodes are on the x- and y- axes, and a filled square represent an edge between the nodes. This is done by using the nx.to_numpy_matrix(G) function.

We then use matplotlib's pcolor(numpy_array) function to plot. Because pcolor cannot take in numpy matrices, we will cast the matrix as an array of arrays, and then get pcolor to plot it.


In [ ]:
matrix = nx.to_numpy_matrix(G)

plt.pcolor(np.array(matrix))
plt.axes().set_aspect('equal') # set aspect ratio equal to get a square visualization
plt.xlim(min(G.nodes()), max(G.nodes())) # set x and y limits to the number of nodes present.
plt.ylim(min(G.nodes()), max(G.nodes()))
plt.title('Adjacency Matrix')
plt.show()

Let's try another visualization, the Circos plot. We can order the nodes in the Circos plot according to the node ID, but any other ordering is possible as well. Edges are drawn between two nodes.

Credit goes to Justin Zabilansky (MIT) for the implementation.


In [ ]:
from circos import CircosPlot

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)

nodes = sorted(G.nodes())
edges = G.edges()

c = CircosPlot(nodes, edges, radius=10, ax=ax)
c.draw()

It's pretty obvious in this visualization that there are nodes, such as node 5 and 18, that are not connected to any other node via an edge. There are other nodes, like node number 19, which is highly connected to other nodes.


In [ ]: