In this chapter, we will introduce you to the NetworkX API. This will allow you to create and manipulate graphs in your computer memory, thus giving you a language to more concretely explore graph theory ideas.
Throughout the book, we will be using different graph datasets to help us anchor ideas. In this section, we will work with a social network of seventh graders. Here, nodes are individual students, and edges represent their relationships. Edges between individuals show how often the seventh graders indicated other seventh graders as their favourite.
The data are taken from the Konect graph data repository
In NetworkX, graph data are stored in a dictionary-like fashion.
They are placed under a Graph
object,
canonically instantiated with the variable G
as follows:
G = nx.Graph()
Of course, you are free to name the graph anything you want!
Nodes are part of the attribute G.nodes
.
There, the node data are housed in a dictionary-like container,
where the key is the node ID
and the values are a dictionary of attributes.
Node data are accessible using syntax that looks like:
G.nodes[node1]
Edges are part of the attribute G.edges
,
which is also stored in a dictionary-like container.
Edge data are accessible using syntax that looks like:
G.edges[node1, node2]
Because of the dictionary-like implementation of the graph, any hashable object can be a node. This means strings and tuples, but not lists and sets.
Let's load some real network data to get a feel for the NetworkX API. This dataset comes from a study of 7th grade students.
This directed network contains proximity ratings between students from 29 seventh grade students from a school in Victoria. Among other questions the students were asked to nominate their preferred classmates for three different activities. A node represents a student. An edge between two nodes shows that the left student picked the right student as his or her answer. The edge weights are between 1 and 3 and show how often the left student chose the right student as his/her favourite.
In the original dataset, students were from an all-boys school. However, I have modified the dataset to instead be a mixed-gender school.
In [ ]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import networkx as nx
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import warnings
from nams import load_data as cf
warnings.filterwarnings('ignore')
In [ ]:
G = cf.load_seventh_grader_network()
When you get graph data, one of the first things you'll want to do is to check its basic graph statistics: the number of nodes and the number of edges that are represented in the graph. This is a basic sanity-check on your data that you don't want to skip out on.
The first thing you need to know is the type
of the graph:
In [ ]:
type(G)
Because the graph is a DiGraph
,
this tells us that the graph is a directed one.
If it were undirected, the type would change:
In [ ]:
H = nx.Graph()
type(H)
In [ ]:
list(G.nodes())[0:5]
G.nodes()
returns a "view" on the nodes.
We can't actually slice into the view and grab out a sub-selection,
but we can at least see what nodes are present.
For brevity, we have sliced into G.nodes()
passed into a list()
constructor,
so that we don't pollute the output.
Because a NodeView
is iterable, though,
we can query it for its length:
In [ ]:
len(G.nodes())
If our nodes have metadata attached to them,
we can view the metadata at the same time
by passing in data=True
:
In [ ]:
list(G.nodes(data=True))[0:5]
G.nodes(data=True) returns a NodeDataView
,
which you can see is dictionary-like.
Additionally, we can select out individual nodes:
In [ ]:
G.nodes[1]
Now, because a NodeDataView
is dictionary-like,
looping over G.nodes(data=True)
is very much like looping over key-value pairs of a dictionary.
As such, we can write things like:
for n, d in G.nodes(data=True):
# n is the node
# d is the metadata dictionary
...
This is analogous to how we would loop over a dictionary:
for k, v in dictionary.items():
# do stuff in the loop
Naturally, this leads us to our first exercise.
Can you count how many males and females are represented in the graph?
In [ ]:
from nams.solutions.intro import node_metadata
#### REPLACE THE NEXT LINE WITH YOUR ANSWER
mf_counts = node_metadata(G)
Test your implementation by checking it against the test_answer
function below.
In [ ]:
from typing import Dict
def test_answer(mf_counts: Dict):
assert mf_counts['female'] == 17
assert mf_counts['male'] == 12
test_answer(mf_counts)
With this dictionary-like syntax, we can query back the metadata that's associated with any node.
In [ ]:
list(G.edges())[0:5]
Similar to the NodeView
, G.edges()
returns an EdgeView
that is also iterable.
As with above, we have abbreviated the output inside a sliced list
to keep things readable.
Because G.edges()
is iterable, we can get its length to see the number of edges
that are present in a graph.
In [ ]:
len(G.edges())
Likewise, we can also query for all of the edge's metadata:
In [ ]:
list(G.edges(data=True))[0:5]
Additionally, it is possible for us to select out individual edges, as long as they exist in the graph:
In [ ]:
G.edges[15, 10]
This yields the metadata dictionary for that edge.
If the edge does not exist, then we get an error:
>>> G.edges[15, 16]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-21-ce014cab875a> in <module>
----> 1 G.edges[15, 16]
~/anaconda/envs/nams/lib/python3.7/site-packages/networkx/classes/reportviews.py in __getitem__(self, e)
928 def __getitem__(self, e):
929 u, v = e
--> 930 return self._adjdict[u][v]
931
932 # EdgeDataView methods
KeyError: 16
As with the NodeDataView
, the EdgeDataView
is dictionary-like,
with the difference being that the keys are 2-tuple-like
instead of being single hashable objects.
Thus, we can write syntax like the following to loop over the edgelist:
for n1, n2, d in G.edges(data=True):
# n1, n2 are the nodes
# d is the metadata dictionary
...
Naturally, this leads us to our next exercise.
In [ ]:
from nams.solutions.intro import edge_metadata
#### REPLACE THE NEXT LINE WITH YOUR ANSWER
maxcount = edge_metadata(G)
Likewise, you can test your answer using the test function below:
In [ ]:
def test_maxcount(maxcount):
assert maxcount == 3
test_maxcount(maxcount)
Great stuff! You now know how to query a graph for:
Now, let's learn how to manipulate the graph. Specifically, we'll learn how to add nodes and edges to a graph.
The NetworkX graph API lets you add a node easily:
G.add_node(node, node_data1=some_value, node_data2=some_value)
It also allows you to add an edge easily:
G.add_edge(node1, node2, edge_data1=some_value, edge_data2=some_value)
In both cases, the keyword arguments that are passed into .add_node()
are automatically collected into the metadata dictionary.
Knowing this gives you enough knowledge to tackle the next exercise.
We found out that there are two students that we left out of the network, student no. 30 and 31. They are one male (30) and one female (31), and they are a pair that just love hanging out with one another and with individual 7 (i.e.
count=3
), in both directions per pair. Add this information to the graph.
In [ ]:
from nams.solutions.intro import adding_students
#### REPLACE THE NEXT LINE WITH YOUR ANSWER
G = adding_students(G)
You can verify that the graph has been correctly created by executing the test function below.
In [ ]:
def test_graph_integrity(G):
assert 30 in G.nodes()
assert 31 in G.nodes()
assert G.nodes[30]['gender'] == 'male'
assert G.nodes[31]['gender'] == 'female'
assert G.has_edge(30, 31)
assert G.has_edge(30, 7)
assert G.has_edge(31, 7)
assert G.edges[30, 7]['count'] == 3
assert G.edges[7, 30]['count'] == 3
assert G.edges[31, 7]['count'] == 3
assert G.edges[7, 31]['count'] == 3
assert G.edges[30, 31]['count'] == 3
assert G.edges[31, 30]['count'] == 3
print('All tests passed.')
test_graph_integrity(G)
A similar pattern can be used for edges:
[n2 for n1, n2, d in G.edges(data=True)]
or
[n2 for _, n2, d in G.edges(data=True)]
If the graph you are constructing is a directed graph, with a "source" and "sink" available, then I would recommend the following naming of variables instead:
[(sc, sk) for sc, sk, d in G.edges(data=True)]
or
[d['attr'] for sc, sk, d in G.edges(data=True)]
For a deeper look at the NetworkX API, be sure to check out the NetworkX docs.
Here's some further exercises that you can use to get some practice.
Try figuring out which students have "unrequited" friendships, that is, they have rated another student as their favourite at least once, but that other student has not rated them as their favourite at least once.
Hint: the goal here is to get a list of edges for which the reverse edge is not present.
_Hint: You may need the class method G.has_edge(n1, n2)
. This returns whether a graph has an edge between the nodes n1
and n2
._
In [ ]:
from nams.solutions.intro import unrequitted_friendships_v1
#### REPLACE THE NEXT LINE WITH YOUR ANSWER
unrequitted_friendships = unrequitted_friendships_v1(G)
assert len(unrequitted_friendships) == 124
In a previous session at ODSC East 2018, a few other class participants provided the following solutions, which you can take a look at by uncommenting the following cells.
This first one by @schwanne is the list comprehension version of the above solution:
In [ ]:
from nams.solutions.intro import unrequitted_friendships_v2
# unrequitted_friendships_v2??
This one by @end0 is a unique one involving sets.
In [ ]:
from nams.solutions.intro import unrequitted_friendships_v3
# unrequitted_friendships_v3??
In [ ]:
import nams.solutions.intro as solutions
import inspect
print(inspect.getsource(solutions))