Efficient Subgraph enumeration

  • paper: Wernicke (2006). Efficient Detection of Network Motifs.

Notation

  • graph G = (V, E)
  • all vertices in V are uniquely labeled with integers $1,...,n$ with $n = |V|$
  • $u > v$: the label of vertex $u$ is larger than that of vertex $v$
  • edges in G (directed or undirected) are identified using set notation
  • if $G$ is directed and contains both $(u,v)$ and $(v,u)$, these edges are represented by a bidirectional edge $\{u,v\}$
  • relative complement set: $B \backslash A = \{ x \in B | x \notin A \}$ (often called difference)
  • size-k subgraph: a connected subgraph that is induced from a node set of cardinality $k$

neighborhoods

  • for a set $V' \subseteq V$, its open neighborhood $N(V')$ is the set of all vertices $V \backslash V'$ that are adjacent to at least one vertex in $V'$

In [1]:
from discoursekernels.util import print_source
from discoursekernels.subgraph_enumeration import (
    open_neighborhood, exclusive_neighborhood, enumerate_all_size_k_subgraphs,
    enumerate_all_size_k_subgraphs_parellel,
    extend_subgraph,
    enumerate_all_subgraphs_upto_size_k,
    enumerate_all_subgraphs_upto_size_k_parallel)

print_source(open_neighborhood)


Couldn't import _dotparser, loading of dot files will not be possible.
Out[1]:

def open_neighborhood(graph, node_subset):
    """
    $N(V')$: returns the set of all nodes that are in the graph's node set
    (but not in the given subset) and are adjacent to at least one node
    in the subset. Based on Wernicke (2006).

    WARNING: different results for directed vs. undirected graphs
    """
    open_nbh = set()
    node_set = set(graph.nodes())
    nodes_not_in_subset = node_set - node_subset
    for node_not_in_subset in nodes_not_in_subset:
        if any(neighbor in node_subset
               for neighbor in graph.neighbors(node_not_in_subset)):
            open_nbh.add(node_not_in_subset)
    return open_nbh
  • $N_{excl}(v, V')$: for a vertex $v \in V \backslash V'$ its exclusive neighborhood w.r.t. $V'$ consists of all vertices neighboring $v$ that don't belong to $V' \cup N(V')$

In [2]:
print_source(exclusive_neighborhood)


Out[2]:

def exclusive_neighborhood(graph, node, node_subset):
    """
    given a node v that doesn't belong to the given node subset V',
    returns all nodes that are neighbors of v, but don't belong to
    the node subset V' or its open neighborhood N(V').
    Based on Wernicke (2006).

    WARNING: different results for directed vs. undirected graphs
    """
    assert node not in node_subset
    open_nbh = open_neighborhood(graph, node_subset)

    exclusive_nbh = set()
    for neighbor in graph.neighbors(node):
        if neighbor not in node_subset and neighbor not in open_nbh:
            exclusive_nbh.add(neighbor)
    return exclusive_nbh

In [3]:
from nxpd import draw
from discoursekernels.test_dependency_graph import the_man_saw_the_woman_with_the_telescope

draw(the_man_saw_the_woman_with_the_telescope, show='ipynb')


Out[3]:

In [4]:
for n, attrs in the_man_saw_the_woman_with_the_telescope.nodes(data=True):
    print n, attrs['label']


1 *
2 saw
3 man
4 the
5 woman
6 the
7 with
8 telescope
9 the

In [5]:
open_neighborhood(the_man_saw_the_woman_with_the_telescope, set([5]))


Out[5]:
{2}

In [6]:
open_neighborhood(the_man_saw_the_woman_with_the_telescope.to_undirected(), set([5]))


Out[6]:
{2, 6}

In [7]:
exclusive_neighborhood(the_man_saw_the_woman_with_the_telescope.to_undirected(), 2, set([5, 6]))


Out[7]:
{1, 3, 7}

Enumerating all size-k subgraphs

  • starting with a node $v$ from the input graph, we only add those nodes to the $V_{Extension}$ set that have these properties:

    • their label must be larger than that of $v$
    • they may be neighbored to the newly added node $w$, but not to a node already in $V_{Subgraph}$
    • i.e. they must be in the exclusive neighborhood of $w$ with respect to $V_{Subgraph}$

In [8]:
print_source(enumerate_all_size_k_subgraphs)


Out[8]:

def enumerate_all_size_k_subgraphs(graph, k):
    """
    returns all subgraphs of the given graph that have k nodes.
    The algorith is called ``ESU`` in Wernicke (2006).
    """
    assert all(isinstance(node, int) for node in graph.nodes_iter())
    if not 1 <= k <= len(graph):
        return []

    all_subgraphs = []
    for node in graph.nodes_iter():
        extension = {neighbor for neighbor in graph.neighbors(node)
                     if neighbor > node}
        subgraphs = extend_subgraph(graph, k, {node}, extension, node)
        if isinstance(subgraphs, list):
            all_subgraphs.extend(subgraphs)
        else: # isinstance(subgraphs, nx.Graph)
            all_subgraphs.append(subgraphs)
    return all_subgraphs

In [9]:
print_source(extend_subgraph)


Out[9]:

def extend_subgraph(graph, k, subgraph_nodes, extension_nodes, node):
    """
    This function is the recursively called part of the ``ESU`` algorithm
    in Wernicke (2006).
    """
    if len(subgraph_nodes) == k:
        return graph.subgraph(subgraph_nodes)

    all_subgraphs = []
    while extension_nodes:
        # remove random node w from extension_nodes
        random_extension_node = random.choice(list(extension_nodes))
        extension_nodes.remove(random_extension_node)
        exclusive_neighbors = {neighbor for neighbor in exclusive_neighborhood(graph,
                                                                               random_extension_node,
                                                                               subgraph_nodes)
                               if neighbor > node}
        vbar_extension = extension_nodes | exclusive_neighbors # union
        extended_subgraph_nodes = subgraph_nodes | {random_extension_node}
        subgraphs = extend_subgraph(graph, k, extended_subgraph_nodes, vbar_extension, node)

        if isinstance(subgraphs, list):
            all_subgraphs.extend(subgraphs)
        else: # isinstance(subgraphs, nx.Graph)
            all_subgraphs.append(subgraphs)
    return all_subgraphs

In [10]:
import os
import networkx as nx
from discoursegraphs.readwrite import TigerDocumentGraph

TIGER_FILE = os.path.expanduser('~/corpora/potsdam-commentary-corpus-2.0.0/syntax/maz-1423.xml')
tdg = TigerDocumentGraph(TIGER_FILE)

In [11]:
tdg_digraph = nx.DiGraph(tdg)
# nx.convert_node_labels_to_integers(tdg, first_label=1, label_attribute='node_id')
tdg_with_in_labels = nx.convert_node_labels_to_integers(tdg_digraph, first_label=1, label_attribute='node_id')
# enumerate_all_size_k_subgraphs(tdg_with_in_labels, 10)

In [12]:
print_source(enumerate_all_subgraphs_upto_size_k)


Out[12]:

def enumerate_all_subgraphs_upto_size_k(document_graph, k):
    """
    returns all subgraphs of a DiscourseDocumentGraph (i.e. a MultiDiGraph)
    with up to k nodes.

    WARNING: This just calls ESU / enumerate_all_size_k_subgraphs() for
    1 ... k.

    TODO: optimize the algorithm to use the results of ESU(k-1) to calculate
    ESU(k).
    """
    document_nodes = len(document_graph)
    if k > document_nodes:
        k = document_nodes

    all_subgraphs = []
    int_graph = nx.convert_node_labels_to_integers(nx.DiGraph(document_graph),
                                                   first_label=1,
                                                   label_attribute='node_id')
    for i in xrange(1, k+1):
        size_k_subgraphs = enumerate_all_size_k_subgraphs(int_graph, i)
        all_subgraphs.extend(size_k_subgraphs)
    return all_subgraphs

In [13]:
%timeit len(enumerate_all_subgraphs_upto_size_k(tdg_with_in_labels, 5))


1 loops, best of 3: 862 ms per loop

In [14]:
# %timeit len(enumerate_all_subgraphs_upto_size_k(tdg_with_in_labels, 10))  # 20s per loop

In [15]:
from discoursegraphs import DiscourseDocumentGraph, print_dot
from discoursegraphs.readwrite import RSTGraph, MMAXDocumentGraph

In [16]:
rdg = RSTGraph(os.path.expanduser('~/corpora/potsdam-commentary-corpus-2.0.0/rst/maz-1423.rs3'))
mdg = MMAXDocumentGraph(os.path.expanduser('~/corpora/potsdam-commentary-corpus-2.0.0/coreference/maz-1423.mmax'))

In [17]:
merged_graph = DiscourseDocumentGraph()
merged_graph.merge_graphs(tdg.copy())
merged_graph.merge_graphs(rdg.copy())
merged_graph.merge_graphs(mdg.copy())

Speed test: enumerate all subgraphs up to size k

  • graph: MAZ-1423 with Syntax, RST and coreference
  • 356 nodes

In [18]:
%load_ext gvmagic

In [19]:
# %dotstr print_dot(merged_graph)

In [20]:
len(merged_graph)


Out[20]:
356

In [21]:
%time len(enumerate_all_subgraphs_upto_size_k(merged_graph, 5))


CPU times: user 4.22 s, sys: 64 ms, total: 4.28 s
Wall time: 4.22 s
Out[21]:
5142

In [22]:
%time len(enumerate_all_subgraphs_upto_size_k(merged_graph, 6))


CPU times: user 10.7 s, sys: 60 ms, total: 10.8 s
Wall time: 10.8 s
Out[22]:
12587

In [23]:
%time len(enumerate_all_subgraphs_upto_size_k(merged_graph, 7))


CPU times: user 29.6 s, sys: 144 ms, total: 29.8 s
Wall time: 29.6 s
Out[23]:
33942

In [24]:
%time len(enumerate_all_subgraphs_upto_size_k(merged_graph, 8))


CPU times: user 1min 24s, sys: 372 ms, total: 1min 24s
Wall time: 1min 24s
Out[24]:
100715

In [27]:
print_source(enumerate_all_size_k_subgraphs_parellel)


Out[27]:

def enumerate_all_size_k_subgraphs_parellel(graph, k, num_of_workers=4):
    """
    Trivial parallelization of the ESU algorithm from Wernicke (2006).
    """
    assert all(isinstance(node, int) for node in graph.nodes_iter())
    if not 1 <= k <= len(graph):
        return []

    pool = Pool(num_of_workers) # number of CPUs / workers
    results = [pool.apply_async(__extract_subgraphs_from_node,
                                args=(graph, node, k))
               for node in graph.nodes_iter()]

    pool.close()
    pool.join()
    return [result.get() for result in results]

In [69]:
%time _x = enumerate_all_size_k_subgraphs_parellel(tdg_with_in_labels, 10)


CPU times: user 1.7 s, sys: 128 ms, total: 1.83 s
Wall time: 3.72 s

In [70]:
%time _x = enumerate_all_size_k_subgraphs(tdg_with_in_labels, 5)


CPU times: user 496 ms, sys: 4 ms, total: 500 ms
Wall time: 497 ms

In [71]:
%time _x = enumerate_all_size_k_subgraphs_parellel(tdg_with_in_labels, 7)


CPU times: user 1.72 s, sys: 140 ms, total: 1.86 s
Wall time: 2.18 s

In [72]:
%time _x = enumerate_all_size_k_subgraphs(tdg_with_in_labels, 7)


CPU times: user 1.41 s, sys: 8 ms, total: 1.42 s
Wall time: 1.42 s

In [73]:
%time _x = enumerate_all_size_k_subgraphs_parellel(tdg_with_in_labels, 10)


CPU times: user 1.72 s, sys: 200 ms, total: 1.92 s
Wall time: 3.76 s

In [74]:
%time _x = enumerate_all_size_k_subgraphs(tdg_with_in_labels, 10)


CPU times: user 5.99 s, sys: 32 ms, total: 6.02 s
Wall time: 6 s

In [75]:
%time _x = enumerate_all_size_k_subgraphs_parellel(tdg_with_in_labels, 12)


CPU times: user 2.36 s, sys: 228 ms, total: 2.59 s
Wall time: 13.5 s

In [76]:
%time _x = enumerate_all_size_k_subgraphs(tdg_with_in_labels, 12)


CPU times: user 15.9 s, sys: 64 ms, total: 15.9 s
Wall time: 15.8 s

In [77]:
print_source(enumerate_all_subgraphs_upto_size_k_parallel)


Out[77]:

def enumerate_all_subgraphs_upto_size_k_parallel(document_graph, k, num_of_workers=4):
    """
    returns all subgraphs of a DiscourseDocumentGraph (i.e. a MultiDiGraph)
    with up to k nodes. This is a trivially parallelized version of
    enumerate_all_subgraphs_upto_size_k()
    """
    document_nodes = len(document_graph)
    if k > document_nodes:
        k = document_nodes

    int_graph = nx.convert_node_labels_to_integers(nx.DiGraph(document_graph),
                                                   first_label=1,
                                                   label_attribute='node_id')

    pool = Pool(processes=num_of_workers) # number of CPUs
    results = [pool.apply_async(enumerate_all_size_k_subgraphs, args=(int_graph, i))
                for i in xrange(1, k+1)]
    pool.close()
    pool.join()

    subgraphs = []
    for result in results:
        tmp_result = result.get()
        if isinstance(tmp_result, list):
            subgraphs.extend(tmp_result)
        else:
            subgraphs.append(tmp_result)
    return subgraphs

In [78]:
%time _result = enumerate_all_subgraphs_upto_size_k_parallel(merged_graph, 3)


CPU times: user 48 ms, sys: 76 ms, total: 124 ms
Wall time: 593 ms

In [79]:
%time _result = enumerate_all_subgraphs_upto_size_k(merged_graph, 3)


CPU times: user 548 ms, sys: 20 ms, total: 568 ms
Wall time: 550 ms

In [80]:
%time _result = enumerate_all_subgraphs_upto_size_k_parallel(merged_graph, 5)


CPU times: user 244 ms, sys: 136 ms, total: 380 ms
Wall time: 3.06 s

In [81]:
%time _result = enumerate_all_subgraphs_upto_size_k_parallel(merged_graph, 5)


CPU times: user 284 ms, sys: 144 ms, total: 428 ms
Wall time: 3.36 s

In [82]:
%time _result = enumerate_all_subgraphs_upto_size_k(merged_graph, 5)


CPU times: user 4.1 s, sys: 32 ms, total: 4.14 s
Wall time: 4.11 s

In [ ]:
# %time _result = enumerate_all_subgraphs_upto_size_k_parallel(merged_graph, 10) # takes too long

In [ ]:
# %time _result = enumerate_all_subgraphs_upto_size_k_naive(merged_graph, 3)

In [ ]:
len(merged_graph)

In [ ]:
len(merged_graph) ** 3

In [ ]:
from math import factorial

def possible_combinations(num_of_elements, k):
    # https://en.wikipedia.org/wiki/Combination
    if k > num_of_elements:
        return 0
    else:
        return factorial(num_of_elements) / ( factorial(k) * factorial((num_of_elements - k)) )

In [ ]:
def possible_combinations_up_to_k(num_of_elements, k):
    assert 1 <= k <= num_of_elements
    result = 0
    for i in xrange(1, k+1):
        result += possible_combinations(num_of_elements, i)
    return result

In [ ]:
possible_combinations_up_to_k(356, 2)

In [ ]:
possible_combinations_up_to_k(356, 3)

In [ ]:
possible_combinations_up_to_k(356, 4)

In [ ]:
possible_combinations_up_to_k(356, 10)

optimizing subgraph enumeration

first try: shelve itermediate results to save memory

  • don't returns subgraphs as nx.Graph() objects (return edge lists instead)
  • write each completed subgraph to disk instantly to reduce memory usage
  • use shelve instead of pickle
  • one shelve per document-size-k

In [20]:
import shelve

def enumerate_all_subgraphs_upto_size_k_shelved(document_graph, k):
    """
    """
    shelve_name = '/tmp/{}-k-{}.db'.format(document_graph.name, k)
    # writeback is important for nested dicts etc.)
    subgraph_shelve = shelve.open(shelve_name, writeback=True)
    subgraph_shelve['graph'] = document_graph
    
    document_nodes = len(document_graph)
    if k > document_nodes:
        k = document_nodes

    int_graph = nx.convert_node_labels_to_integers(nx.DiGraph(document_graph),
                                                   first_label=1,
                                                   label_attribute='node_id')
    try:
        for i in xrange(1, k+1):
            enumerate_all_size_k_subgraphs_shelved(int_graph, i, subgraph_shelve)
    finally:
        subgraph_shelve.close() # this calls .sync() as well
    return shelve_name


def enumerate_all_size_k_subgraphs_shelved(graph, k, graph_shelve):
    """
    shelve : shelve
        an open shelve to store the subgraphs in
    
    Returns
    -------
    num_of_k_subgraphs : the number of subgraphs of size-k generated
    """
    k_shelve = graph_shelve[str(k)] = dict()

    assert all(isinstance(node, int) for node in graph.nodes_iter())
    if not 1 <= k <= len(graph):
        return 0

    num_of_k_subgraphs = 1
    for node in graph.nodes_iter():
        extension = {neighbor for neighbor in graph.neighbors(node)
                     if neighbor > node}
        subgraphs = extend_subgraph(graph, k, {node}, extension, node)
        if isinstance(subgraphs, list):
            for subgraph in subgraphs:
                k_shelve[str(num_of_k_subgraphs)] = subgraph
                num_of_k_subgraphs += 1
        else: # isinstance(subgraphs, nx.Graph)
            k_shelve[str(num_of_k_subgraphs)] = subgraphs
            num_of_k_subgraphs += 1
    return num_of_k_subgraphs

In [63]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 3)


CPU times: user 580 ms, sys: 4 ms, total: 584 ms
Wall time: 632 ms
Out[63]:
'/tmp/maz-1423.xml-k-3.db'

In [64]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 4)


CPU times: user 1.63 s, sys: 20 ms, total: 1.65 s
Wall time: 1.68 s
Out[64]:
'/tmp/maz-1423.xml-k-4.db'

In [65]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 5)


CPU times: user 4.28 s, sys: 16 ms, total: 4.3 s
Wall time: 4.35 s
Out[65]:
'/tmp/maz-1423.xml-k-5.db'

In [67]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 6)


CPU times: user 11.3 s, sys: 32 ms, total: 11.4 s
Wall time: 11.4 s
Out[67]:
'/tmp/maz-1423.xml-k-6.db'

In [68]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 7)


CPU times: user 31 s, sys: 96 ms, total: 31.1 s
Wall time: 31.3 s
Out[68]:
'/tmp/maz-1423.xml-k-7.db'

In [21]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 8)


CPU times: user 1min 26s, sys: 644 ms, total: 1min 26s
Wall time: 1min 27s
Out[21]:
'/tmp/maz-1423.xml-k-8.db'

In [22]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 9)


CPU times: user 4min 31s, sys: 1.86 s, total: 4min 33s
Wall time: 4min 36s
Out[22]:
'/tmp/maz-1423.xml-k-9.db'

In [21]:
%time enumerate_all_subgraphs_upto_size_k_shelved(merged_graph, 10) # ca. 12mins, started: 1714


CPU times: user 14min 49s, sys: 7.81 s, total: 14min 57s
Wall time: 15min 49s
Out[21]:
'/tmp/maz-1423.xml-k-10.db'

In [1]:
# import shelve

# k10_shelve = shelve.open('/tmp/maz-1423.xml-k-10.db')

In [2]:
k10_shelve.keys()


Out[2]:
['2', '4', '6', '8', 'graph', '10', '1', '3', '5', '7', '9']
for key in k10_shelve.keys():
    print key, len(k10_shelve[key])
graph 356
1 356
2 234
3 486
4 1182
5 2884
6 7445
7 21355
8 66773
9 220405
10 752187