In [1]:
name = '2018-02-22-networkx'
title = 'Introduction to networkx'
tags = 'networkx, dataviz'
author = 'Maria Zamyatina'

In [2]:
from nb_tools import connect_notebook_to_post
from IPython.core.display import HTML

html = connect_notebook_to_post(name, title, tags, author)

NetworkX is a Python package for creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

A complex network is a graph (network) elements of which are connected neither purely regular nor purely random. Some cool examples:

All these networks have different sizes, densities, degrees of centrality, clustarisation and so on, and one of the ways to study these properties is to use networkx.


In [3]:
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

There are two major components in a network - nodes (or vertices) and edges connecting nodes. They are not specified as networkx objects, leaving you to use something meaningful as nodes and edges. Nodes can be anything: strings, floats, files, etc. Edges are the relationships between nodes, so they can be directed or undirected, and can hold arbitrary data (e.g., weights).

Constructing network

The graph can be grown in several ways. The simplest is to add one or more nodes and edges at time, without using any graph generator functions. Note that networkx automatically adds a node if there is an edge containing this node.


In [4]:
G = nx.Graph() # create an empty graph
G.add_node('Luke') # add one node
G.add_nodes_from(['Leia', 'Han']) # add multiple nodes
G.add_edge('Luke', 'Leia') # add one edge
G.add_edges_from([('Luke', 'Han'), ('Leia', 'Han'), ('Leia', 'Lando')]) # add multiple edges
G.remove_edge('Leia', 'Lando') # remove one edge
G.remove_node('Lando') # remove one node
nx.draw(G, with_labels=True, node_size=3000) # draw network


Depending on the nature of your task, you may decide to use different types of graphs. There are:

  • Graph() - undirected graph, ignores multiple edges between two nodes
  • DiGraph() - graph with directed edges
  • MultiGraph() - allows multiple undirected edges between pairs of nodes
  • MultiDiGraph() - directed version of a MultiGraph()

In [5]:
G = nx.DiGraph() # create an empty directed graph
G.add_edges_from([('Luke', 'Han'), ('Luke', 'Leia'), ('Leia', 'Han')])
nx.draw(G, with_labels=True, node_size=3000)


Data structure

networkx uses a “dictionary of dictionaries of dictionaries” as the basic network data structure. This allows fast lookup with reasonable storage for large sparse networks. To access nodes's or edges's attributes:


In [6]:
G['Luke']


Out[6]:
AtlasView({'Han': {}, 'Leia': {}})

In [7]:
G['Luke']['Leia']


Out[7]:
{}

To add an attribute to nodes or edges:


In [8]:
G.node['Luke']['Force'] = 1
G.node['Leia']['Force'] = 1
G.node['Han']['Force'] = 0
G.nodes(data=True)


Out[8]:
NodeDataView({'Han': {'Force': 0}, 'Leia': {'Force': 1}, 'Luke': {'Force': 1}})

In [9]:
G['Luke']['Han']['weight'] = 10
G['Luke']['Leia']['weight'] = 10
G['Leia']['Han']['weight'] = 10
G.edges(data=True)


Out[9]:
OutEdgeDataView([('Leia', 'Han', {'weight': 10}), ('Luke', 'Han', {'weight': 10}), ('Luke', 'Leia', {'weight': 10})])

Analysing network

The structure of the graph can be analyzed using various graph-theoretic functions. For example, to find out how many nodes are connected together in a cluster and how many clusters there are use nx.connected_components().


In [10]:
G = nx.Graph()
G.add_edges_from([('Luke', 'Han'), ('Luke', 'Leia'), ('Leia', 'Han')])
G.add_node('Lando')
list(nx.connected_components(G)) # generate a sorted list of connected components, largest first


Out[10]:
[{'Han', 'Leia', 'Luke'}, {'Lando'}]

In [11]:
nx.draw(G, with_labels=True, node_size=3000)


Or find the shortest path between point A and point Dusing Dijkstra’s algorithm:


In [12]:
G = nx.Graph()
e = [('A', 'B'), ('B', 'C'), ('A', 'C'), ('C', 'D')]
G.add_edges_from(e)
print(nx.dijkstra_path(G, 'A', 'D'))
nx.draw(G, with_labels=True)


['A', 'C', 'D']

Drawing network

networkx is not primarily a graph drawing package but basic drawing with matplotlib as well as an interface to use the open source graphviz software package are included. The basic drawing functions essentially place the nodes on a scatterplot using the positions you provide via a dictionary or the positions are computed with a layout function. The edges are lines between those dots.


In [13]:
G = nx.complete_graph(10) # return a graph where all nodes are connected to all edges

fig, ax = plt.subplots(1, 3, figsize=(15,4))
nx.draw_random(G, ax=ax[0])
ax[0].set_title('random')

nx.draw_circular(G, ax=ax[1])
ax[1].set_title('circular')

nx.draw_spectral(G, ax=ax[2])
ax[2].set_title('spectral');



In [14]:
r = 2 # branching factor of the tree
h = 4 # height of the tree

btree = nx.balanced_tree(r,h) # return the perfectly balanced r-tree of height h
pos_neato = nx.nx_pydot.graphviz_layout(btree, prog='neato') # set initial positions for nodes using neato layout
pos_dot = nx.nx_pydot.graphviz_layout(btree, prog='dot') # set initial positions for nodes using dot layout

fig, ax = plt.subplots(1, 2, figsize=(15,4))
nx.draw(btree, pos_neato, ax=ax[0])
ax[0].set_title('neato')

nx.draw(btree, pos_dot, ax=ax[1])
ax[1].set_title('dot');


dot draws hierarchical layouts of directed graphs while neato creates spring model layouts (http://www.adp-gmbh.ch/misc/tools/graphviz/index.html).

Using networkx for analysing the Master Chemical Mechanism (MCM) and the Common Representative Intermediates (CRI) Mechanism

The idea here is to see visually by how much the CRI mechanism condenses the same subset of tropospheric chemistry (namely inorganic+C$_{1}$-C$_{5}$ alkane oxidation) in comparison with the MCM.

The input data was downloaded from the corresponding websites and pre-processed into json files following the approach described here.


In [15]:
import os
import json
import numpy as np
import collections

In [16]:
# Choose chemical mechanism
model_name = 'MCM_C1C5' # or 'CRI_C1C5
# Load reaction descriptions
eqs_json_path = '../data/'
with open(os.path.join(eqs_json_path, model_name+'.json'), 'r') as f:
    eqs = json.load(f)

In [17]:
eqs[0:3]


Out[17]:
[{'coef': '5.6D-34*N2*(TEMP/300)**(-2.6)*O2+6.0D-34*O2*(TEMP/300)**(-2.6)*O2',
  'num': 73501.55952675984,
  'prod': ['O3'],
  'reac': ['O']},
 {'coef': '8.0D-12*EXP(-2060/TEMP)',
  'num': 7.960128498945702e-15,
  'prod': ['O2', 'O2'],
  'reac': ['O', 'O3']},
 {'coef': 'KMT01',
  'num': 2.258741017875256e-12,
  'prod': ['NO2'],
  'reac': ['O', 'NO']}]

In [18]:
# Convert equations's info to nodes, edges and edge labels
nodes = []
edges = []
edge_labels = collections.OrderedDict()
major_reactants = ['CL', 'H2', 'HO2', 'NO', 'NO2', 'NO3', 'OH', 'SO2', 'SO3']
for eq in eqs:
    if len(eq['reac']) == 1:
        # photolysis
        if 'J' in eq['coef'] and len(eq['prod']) == 1:
            edge = (eq['reac'][0], eq['prod'][0])
            edges.append(edge)
            edge_labels[edge] = 'hv'
        elif 'J' in eq['coef'] and len(eq['prod']) != 1:
            for prod in eq['prod']:
                edge = (eq['reac'][0], prod)
                edges.append(edge)
                edge_labels[edge] = 'hv'
        # thermal decomposition
        elif 'J' not in eq['coef'] and len(eq['prod']) == 1:
            edge = (eq['reac'][0], eq['prod'][0])
            edges.append(edge)
            edge_labels[edge] = '' 
        elif 'J' not in eq['coef'] and len(eq['prod']) != 1:
            for prod in eq['prod']:
                edge = (eq['reac'][0], prod)
                edges.append(edge)
                edge_labels[edge] = ''
    # bimolecular
    else: # len(eq['reac']) == 2:
        reac1, reac2 = eq['reac']
        if reac1 in major_reactants and reac2 not in major_reactants:
            reac1, reac2 = reac2, reac1 # for consistency in edge labels (if possible, always use a major reactant as an edge label)
        if len(eq['prod']) == 1:
            for reac in eq['reac']:
                edge = (reac, eq['prod'][0])
                edges.append(edge)
                edge_labels[edge] = reac2
        elif len(eq['prod']) != 1:
            for reac in eq['reac']:
                for prod in eq['prod']:
                    edge = (reac, prod)
                    edges.append(edge)
                    edge_labels[edge] = reac2
    # make sure that all reactants and products have a node
    for reac in eq['reac']:
        if reac not in nodes:
            nodes.append(reac)
    for prod in eq['prod']:
        if prod not in nodes:
            nodes.append(prod)

In [19]:
# Create network layout
scheme = nx.MultiDiGraph()
scheme.add_edges_from(edges)
scheme.add_nodes_from(nodes)
pos = nx.nx_pydot.graphviz_layout(scheme)
# Choose x, y limits for the graph
if model_name.split('_')[0] == 'MCM':
    pass
if model_name.split('_')[0] == 'CRI':
    x1, x2 = -400, 400
    y1, y2 = -400, 400

In [20]:
# Draw network
fig, ax = plt.subplots(figsize=(15,15))
nx.draw_networkx_nodes(scheme, pos, ax=ax, node_color='#669999')
nx.draw_networkx_edges(scheme, pos, ax=ax, edge_color='k', width=0.1, arrows=True)

# Add node and edge labels, but it makes the graph too busy
# nx.draw_networkx_labels(scheme, pos, {k: k for k in nodes}, ax=ax, font_size=9)
# nx.draw_networkx_edge_labels(scheme, pos, edge_labels=edge_labels)

_ = ax.axis('off')
ax.collections[0].set_edgecolor('k') # change color of the nodes's outline
if model_name.split('_')[0] == 'CRI':
    ax.set_xlim(x1, x2)
    ax.set_ylim(y1, y2)


Task 1: Find the species that has the maximum number of connections.


In [21]:
d = nx.degree_centrality(scheme) # calculate degree of centrality

In [22]:
{k: d[k] for k in sorted(d.keys())[:10]} # show the first 10 entries of a dictionary with degrees of centrality


Out[22]:
{'ACETOL': 0.03231939163498099,
 'BIACET': 0.019011406844106463,
 'BIACETO': 0.017110266159695818,
 'BIACETO2': 0.039923954372623575,
 'BIACETOH': 0.009505703422053232,
 'BIACETOOH': 0.013307984790874524,
 'BUT2CHO': 0.030418250950570342,
 'BUT2CO2H': 0.0076045627376425855,
 'BUT2CO3': 0.03612167300380228,
 'BUT2CO3H': 0.009505703422053232}

In [23]:
max([val for key, val in d.items()]) # max degree of centrality in the whole network


Out[23]:
1.9315589353612168

In [24]:
# Sort a dictionary with degrees of centrality by value and print the first 11 key, value pairs
sorted_degree_centrality = [(k, d[k]) for k in sorted(d, key=d.get, reverse=True)]
for n, i in enumerate(sorted_degree_centrality[0:11]):
    print(n+1, i)


1 ('OH', 1.9315589353612168)
2 ('NO2', 1.4600760456273765)
3 ('HO2', 1.0874524714828897)
4 ('NO3', 0.6273764258555133)
5 ('NO', 0.5551330798479087)
6 ('CO', 0.3269961977186312)
7 ('HCHO', 0.18631178707224336)
8 ('CH3CHO', 0.16159695817490494)
9 ('HNO3', 0.15399239543726237)
10 ('CH3CO3', 0.11977186311787072)
11 ('O3', 0.11596958174904944)

So, unsurprisingly, the hydroxyl radical (OH) has the maximum number of interactions in the network (actually in both, the MCM and the CRI). Ozone is on 11th place in the MCM, but only 40th in the CRI.

Task 2. Find species reactions between which directly lead to ozone formation.


In [25]:
list(scheme.predecessors('O3'))


Out[25]:
['HOC3H6CO3',
 'HCOCO3',
 'IBUDIALCO3',
 'CO2C4CO3',
 'PRNO3CO3',
 'CO3C4CO3',
 'IPRHOCO3',
 'C41CO3',
 'IPRCO3',
 'O',
 'HO2',
 'CH3CO3',
 'HOC2H4CO3',
 'BUT2CO3',
 'HOIPRCO3',
 'C4H9CO3',
 'HO2C4CO3',
 'C3H7CO3',
 'HOBUT2CO3',
 'HOCH2CO3',
 'HO2C3CO3',
 'HCOCH2CO3',
 'HO2C43CO3',
 'C2H5CO3',
 'NO3CH2CO3']

Task 3. Find species that are produced directly in ozone destroying reactions.


In [26]:
list(scheme.successors('O3'))


Out[26]:
['CH2OOB',
 'CH2OOA',
 'HCHO',
 'NO2',
 'O2',
 'O1D',
 'O',
 'HO2',
 'CH3CHOOA',
 'NO3',
 'CH3CHO',
 'OH']

We were able to find the last two answers only because at the beginning we used MultiDiGraph(), which is a directed graph.

If we load the CRI mechanism and rerun the same cells, the layout of the CRI network is:


In [27]:
HTML(html)


Out[27]:

This post was written as an IPython (Jupyter) notebook. You can view or download it using nbviewer.