In [ ]:
import networkx as nx
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import warnings
from custom import load_data as cf
warnings.filterwarnings('ignore')
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
As mentioned earlier, networks, also known as graphs, are comprised of individual entities and their representatives. The technical term for these are nodes and edges, and when we draw them we typically use circles (nodes) and lines (edges).
In this notebook, we will work with a social network of seventh graders, in which nodes are individual students, and edges represent their relationships. Edges between individuals show how often the seventh graders indicated other seventh graders as their favourite.
Data credit: http://konect.uni-koblenz.de/networks/moreno_seventh
In the networkx
implementation, graph objects store their data in dictionaries.
Nodes are part of the attribute Graph.node
, which is a dictionary where the key is the node ID and the values are a dictionary of attributes.
Edges are part of the attribute Graph.edge
, which is a nested dictionary. Data are accessed as such: G.edge[node1, node2]['attr_name']
.
Because of the dictionary implementation of the graph, any hashable object can be a node. This means strings and tuples, but not lists and sets.
Let's load some real network data to get a feel for the NetworkX API. This dataset comes from a study of 7th grade students.
This directed network contains proximity ratings between studetns from 29 seventh grade students from a school in Victoria. Among other questions the students were asked to nominate their preferred classmates for three different activities. A node represents a student. An edge between two nodes shows that the left student picked the right student as his answer. The edge weights are between 1 and 3 and show how often the left student chose the right student as his favourite.
In [ ]:
G = cf.load_seventh_grader_network()
In [ ]:
len(G.nodes())
In [ ]:
# Who are represented in the network?
list(G.nodes())[0:5]
In [ ]:
len(G.nodes())
# len(G)
Let's now figure out who is connected to who in the network
In [ ]:
# Who is connected to who in the network?
# list(G.edges())[0:5]
list(G.edges())[0:5]
In [ ]:
len(G.edges())
A network, more technically known as a graph, is comprised of:
They can be represented as two lists:
Since this is a social network of people, there'll be attributes for each individual, such as a student's gender. We can grab that data off from the attributes that are stored with each node.
In [ ]:
# Let's get a list of nodes with their attributes.
list(G.nodes(data=True))[0:5]
# G.nodes(data=True)
# NetworkX will return a list of tuples in the form (node_id, attribute_dictionary)
In [ ]:
from collections import Counter
mf_counts = Counter([d['gender']
for n, d in G.nodes(data=True)])
def test_answer(mf_counts):
assert mf_counts['female'] == 17
assert mf_counts['male'] == 12
test_answer(mf_counts)
Edges can also store attributes in their attribute dictionary.
In [ ]:
list(G.edges(data=True))[0:5]
In this synthetic social network, the number of times the left student indicated that the right student was their favourite is stored in the "count" variable.
In [ ]:
# Answer
counts = [d['count'] for n1, n2, d in G.edges(data=True)]
maxcount = max(counts)
def test_maxcount(maxcount):
assert maxcount == 3
test_maxcount(maxcount)
We found out that there are two individuals that we left out of the network, individual no. 30 and 31. They are one male (30) and one female (31), and they are a pair that just love hanging out with one another and with individual 7 (count=3
), in both directions per pair. Add this information to the graph. (5 min.)
If you need more help, check out https://networkx.github.io/documentation/stable/tutorial.html
In [ ]:
# Answer
G.add_node(30, gender='male')
G.add_node(31, gender='female')
G.add_edge(30, 31, count=3)
G.add_edge(31, 30, count=3) # reverse is optional in undirected network
G.add_edge(30, 7, count=3) # but this network is directed
G.add_edge(7, 30, count=3)
G.add_edge(31, 7, count=3)
G.add_edge(7, 31, count=3)
Verify that you have added in the edges and nodes correctly by running the following cell.
In [ ]:
def test_graph_integrity(G):
assert 30 in G.nodes()
assert 31 in G.nodes()
assert G.nodes[30]['gender'] == 'male'
assert G.nodes[31]['gender'] == 'female'
assert G.has_edge(30, 31)
assert G.has_edge(30, 7)
assert G.has_edge(31, 7)
assert G.edges[30, 7]['count'] == 3
assert G.edges[7, 30]['count'] == 3
assert G.edges[31, 7]['count'] == 3
assert G.edges[7, 31]['count'] == 3
assert G.edges[30, 31]['count'] == 3
assert G.edges[31, 30]['count'] == 3
print('All tests passed.')
test_graph_integrity(G)
If you would like a challenge during the break, try figuring out which students have "unrequited" friendships, that is, they have rated another student as their favourite at least once, but that other student has not rated them as their favourite at least once.
Specifically, get a list of edges for which the reverse edge is not present.
Hint: You may need the class method G.has_edge(n1, n2)
. This returns whether a graph has an edge between the nodes n1
and n2
.
In [ ]:
unrequitted_friendships = []
for n1, n2 in G.edges():
if not G.has_edge(n2, n1):
unrequitted_friendships.append((n1, n2))
assert len(unrequitted_friendships) == 124
In a previous session at ODSC East 2018, a few other class participants provided the following solutions.
This one by @schwanne is the list comprehension version of the above solution:
In [ ]:
len([(n1, n2) for n1, n2 in G.edges() if not G.has_edge(n2, n1)])
This one by @end0 is a unique one involving sets.
In [ ]:
links = ((n1, n2) for n1, n2, d in G.edges(data=True))
reverse_links = ((n2, n1) for n1, n2, d in G.edges(data=True))
len(list(set(links) - set(reverse_links)))
A note about the tests: Testing is good practice when writing code. Well-crafted assertion statements help you program defensivel, by forcing you to explicitly state your assumptions about the code or data.
For more references on defensive programming, check out Software Carpentry's website: http://swcarpentry.github.io/python-novice-inflammation/08-defensive/
For more information on writing tests for your data, check out these slides from a lightning talk I gave at Boston Python and SciPy 2015: http://j.mp/data-test
A similar pattern can be used for edges:
[n2 for n1, n2, d in G.edges(data=True)]
or
[n2 for _, n2, d in G.edges(data=True)]
If the graph you are constructing is a directed graph, with a "source" and "sink" available, then I would recommend the following pattern:
[(sc, sk) for sc, sk, d in G.edges(data=True)]
or
[d['attr'] for sc, sk, d in G.edges(data=True)]
In [ ]:
nx.draw(G)
If the network is small enough to visualize, and the node labels are small enough to fit in a circle, then you can use the with_labels=True
argument.
In [ ]:
nx.draw(G, with_labels=True)
However, note that if the number of nodes in the graph gets really large, node-link diagrams can begin to look like massive hairballs. This is undesirable for graph visualization.
Instead, we can use a matrix to represent them. The nodes are on the x- and y- axes, and a filled square represent an edge between the nodes. This is done by using the MatrixPlot
object from nxviz
.
In [ ]:
from nxviz import MatrixPlot
m = MatrixPlot(G)
m.draw()
plt.show()
In [ ]:
from nxviz import ArcPlot
a = ArcPlot(G, node_color='gender', node_grouping='gender')
a.draw()
Let's try another visualization, the Circos plot. We can order the nodes in the Circos plot according to the node ID, but any other ordering is possible as well. Edges are drawn between two nodes.
Credit goes to Justin Zabilansky (MIT) for the implementation, Jon Charest for subsequent improvements, and nxviz
contributors for further development.
In [ ]:
from nxviz import CircosPlot
c = CircosPlot(G, node_color='gender', node_grouping='gender')
c.draw()
plt.savefig('images/seventh.png', dpi=300)
This visualization helps us highlight nodes that there are poorly connected, and others that are strongly connected.
Next up, let's try Hive Plots. HivePlots are not yet implemented in nxviz
just yet, so we're going to be using the old hiveplot
API for this. When HivePlots have been migrated over to nxviz
, its API will resemble that of the CircosPlot's.
In [ ]:
from hiveplot import HivePlot
nodes = dict()
nodes['male'] = [n for n,d in G.nodes(data=True) if d['gender'] == 'male']
nodes['female'] = [n for n,d in G.nodes(data=True) if d['gender'] == 'female']
edges = dict()
edges['group1'] = G.edges(data=True)
nodes_cmap = dict()
nodes_cmap['male'] = 'blue'
nodes_cmap['female'] = 'red'
edges_cmap = dict()
edges_cmap['group1'] = 'black'
In [ ]:
h = HivePlot(nodes, edges, nodes_cmap, edges_cmap)
h.draw()
Hive plots allow us to divide our nodes into sub-groups, and visualize the within- and between-group connectivity.
In [ ]: