In [ ]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
If you remember, at the beginning of this book, we saw a quote from John Quackenbush that essentially said that the reason a graph is interesting is because of its edges. In this chapter, we'll see this in action once again, as we are going to figure out how to leverage the edges to find special structures in a graph.
The first structure that we are going to learn about is triangles. Triangles are super interesting! They are what one might consider to be "the simplest complex structure" in a graph. Triangles can also have semantically-rich meaning depending on the application. To borrow a bad example, love triangles in social networks are generally frowned upon, while on the other hand, when we connect two people that we know together, we instead complete a triangle.
To learn about triangles, we are going to leverage a physician trust network. Here's the data description:
This directed network captures innovation spread among 246 physicians for towns in Illinois, Peoria, Bloomington, Quincy and Galesburg. The data was collected in 1966. A node represents a physician and an edge between two physicians shows that the left physician told that the right physician is his friend or that he turns to the right physician if he needs advice or is interested in a discussion. There always only exists one edge between two nodes even if more than one of the listed conditions are true.
In [ ]:
from nams import load_data as cf
G = cf.load_physicians_network()
In [ ]:
from nams.solutions.structures import triangle_finding_strategies
# triangle_finding_strategies()
Let's now get down to implementing this next piece of code.
Write a function that identifies whether a node is or is not in a triangle relationship. It should take in a graph
G
and a noden
, and return a boolean True if the noden
is in any triangle relationship and boolean False if the noden
is not in any triangle relationship.
A hint that may help you:
Every graph object
G
has aG.has_edge(n1, n2)
method that you can use to identify whether a graph has an edge betweenn1
andn2
.
Also:
itertools.combinations
lets you iterate over every K-combination of items in an iterable.
In [ ]:
def in_triangle(G, node):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import in_triangle
# UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
# in_triangle??
Now, test your implementation below! The code cell will not error out if your answer is correct.
In [ ]:
from random import sample
import networkx as nx
def test_in_triangle():
nodes = sample(G.nodes(), 10)
for node in nodes:
assert in_triangle(G, 3) == bool(nx.triangles(G, 3))
test_in_triangle()
As you can see from the test function above,
NetworkX provides an nx.triangles(G, node)
function.
It returns the number of triangles that a node is involved in.
We convert it to boolean as a hack to check whether or not
a node is involved in a triangle relationship
because 0 is equivalent to boolean False
,
while any non-zero number is equivalent to boolean True
.
We're going to leverage another piece of knowledge that you already have: the ability to extract subgraphs. We'll be plotting all of the triangles that a node is involved in.
Given a node, write a function that extracts out all of the neighbors that it is in a triangle relationship with. Then, in a new function, implement code that plots only the subgraph that contains those nodes.
In [ ]:
def get_triangle_neighbors(G, n):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import get_triangle_neighbors
# UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
# get_triangle_neighbors??
In [ ]:
def plot_triangle_relations(G, n):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import plot_triangle_relations
plot_triangle_relations(G, 3)
In professional circles, making connections between two people is one of the most valuable things you can do professionally. What you do in that moment is what we would call triadic closure. Algorithmically, we can do the same thing if we maintain a graph of connections!
Essentially, what we are looking for are "open" or "unfinished" triangles".
In this section, we'll try our hand at implementing a rudimentary triadic closure system.
What graph logic would you use to identify triadic closure opportunities? Try writing out your general strategy, or discuss it with someone.
In [ ]:
from nams.solutions.structures import triadic_closure_algorithm
# UNCOMMENT FOR MY ANSWER
# triadic_closure_algorithm()
In [ ]:
def get_open_triangles_neighbors(G, n):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import get_open_triangles_neighbors
# UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
# get_open_triangles_neighbors??
In [ ]:
def plot_open_triangle_relations(G, n):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import plot_open_triangle_relations
plot_open_triangle_relations(G, 3)
Triangles are interesting in a graph theoretic setting because triangles are the simplest complex clique that exist.
But wait! What is the definition of a "clique"?
A "clique" is a set of nodes in a graph that are fully connected with one another by edges between them.
Given this definition, what is the simplest "clique" possible?
In [ ]:
from nams.solutions.structures import simplest_clique
# UNCOMMENT THE NEXT LINE TO SEE MY ANSWER
# simplest_clique()
Cliques are identified by their size $k$, which is the number of nodes that are present in the clique.
A triangle is what we would consider to be a $k$-clique where $k=3$.
A square with cross-diagonal connections is what we would consider to be a $k$-clique where $k=4$.
By now, you should get the gist of the idea.
Related to this idea of a $k$-clique is another idea called "maximal cliques".
Maximal cliques are defined as follows:
A maximal clique is a subgraph of nodes in a graph
- to which no other node can be added to it and
- still remain a clique.
NetworkX provides a way to find all maximal cliques:
In [ ]:
# I have truncated the output to the first 5 maximal cliques.
list(nx.find_cliques(G))[0:5]
Write a generator function that yields all maximal cliques of size $k$.
I'm requesting a generator as a matter of good practice; you never know when the list you return might explode in memory consumption, so generators are a cheap and easy way to reduce memory usage.
In [ ]:
def size_k_maximal_cliques(G, k):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import size_k_maximal_cliques
Now, test your implementation against the test function below.
In [ ]:
def test_size_k_maximal_cliques(G, k):
clique_generator = size_k_maximal_cliques(G, k)
for clique in clique_generator:
assert len(clique) == k
test_size_k_maximal_cliques(G, 5)
One super neat property of cliques is that every clique of size $k$ can be decomposed to the set of cliques of size $k-1$.
Does this make sense to you? If not, think about triangles (3-cliques). They can be decomposed to three edges (2-cliques).
Think again about 4-cliques. Housed within 4-cliques are four 3-cliques. Draw it out if you're still not convinced!
Knowing this property of $k$-cliques, write a generator function that yields all $k$-cliques in a graph, leveraging the
nx.find_cliques(G)
function.
Some hints to help you along:
If a $k$-clique can be decomposed to its $k-1$ cliques, it follows that the $k-1$ cliques can be decomposed into $k-2$ cliques, and so on until you hit 2-cliques. This implies that all cliques of size $k$ house cliques of size $n < k$, where $n >= 2$.
In [ ]:
def find_k_cliques(G, k):
# your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import find_k_cliques
def test_find_k_cliques(G, k):
for clique in find_k_cliques(G, k):
assert len(clique) == k
test_find_k_cliques(G, 3)
In [ ]:
from nxviz import CircosPlot
In [ ]:
c = CircosPlot(G)
c.draw()
In [ ]:
from nams.solutions.structures import visual_insights
# UNCOMMENT TO SEE MY ANSWER
# visual_insights()
From Wikipedia:
In graph theory, a connected component (or just component) of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices in the supergraph.
NetworkX provides a function to let us find all of the connected components:
In [ ]:
ccsubgraph_nodes = list(nx.connected_components(G))
Let's see how many connected component subgraphs are present:
In [ ]:
len(ccsubgraph_nodes)
In this exercise, we're going to draw a circos plot of the graph, but colour and order the nodes by their connected component subgraph.
Recall Circos API:
c = CircosPlot(G, node_order='node_attribute', node_color='node_attribute')
c.draw()
plt.show() # or plt.savefig(...)
Follow the steps along here to accomplish this.
Firstly, label the nodes with a unique identifier for connected component subgraph that it resides in. Use
subgraph
to store this piece of metadata.
In [ ]:
def label_connected_component_subgraphs(G):
# Your answer here
return G
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import label_connected_component_subgraphs
G_labelled = label_connected_component_subgraphs(G)
# UNCOMMENT TO SEE THE ANSWER
# label_connected_component_subgraphs??
Now, draw a CircosPlot with the node order and colouring dictated by the
subgraph
key.
In [ ]:
def plot_cc_subgraph(G):
# Your answer here
pass
# COMMENT OUT THE IMPORT LINE TO TEST YOUR ANSWER
from nams.solutions.structures import plot_cc_subgraph
plot_cc_subgraph(G_labelled)
Voila! It looks quite clear to me that there are indeed four disjoint group of physicians.
In [ ]:
from nams.solutions import structures
import inspect
print(inspect.getsource(structures))