CS446/546 Class Session 5 - Adjacency Forests

Exploring the running time for testing if there is an edge between a pair of vertices

In this exercise, we'll compare the asymptotic computational running time for testing if there is an edge between a pair of vertices, averaged over all pairs of vertices in the graph. We'll do it for a series of undirected graphs (each generated using an Barabasi-Albert model), each with 1000 vertices. We will vary the number of edges in the graph; each graph will have a different average number of vertex neighbors for a vertex (i.e., each graph will have a different average vertex degree). We will time how long it takes to test all possible pairs of vertices for whether or not there is an edge between them, for each of four different graph data structures (adjacency matrix, adjacency list, edge list, and adjacency forest).

We'll need the "bintrees" python module in order to get an implementation of a binary search tree (AVLTree is the class that we will use).


In [9]:
import numpy as np
import igraph
import timeit
import itertools
import bintrees

Now we will need to define some functions for testing a pair of vertices to see if they have an edge between them, for each of three different kinds of data structures for representing the graph.

First, we need to create a function, find_matrix, that accepts an adjacency matrix (np.matrix) gmat and a row index i and a column index j, and returns True if there is an edge between vertices i and j (or False otherwise). You'll probably want to use array indexing here.


In [10]:
def find_matrix(gmat, i, j):
    return (gmat[i,j] == 1)

Next, we need to create a function, find_adj_list, that accepts an adjacency list adj_list (which is actually a list of lists of integer vertex IDs). Your function should return True if there is an edge between vertex i and vertex j, or False otherwise). You may wisth to use the built-in keyword in.


In [11]:
def find_adj_list(adj_list, i, j):
    return j in adj_list[i]

Next, we need to create a function, find_edge_list, that accepts an edge list argument edge_list (which is actually a numpy.array of lists (each of length two) containing the vertex IDs of vertices that are connected in the graph). Your function should return True if there is an edge between vertex i and vertex j, or False otherwise). You will want to use the functions numpy.where, tolist, and the keyword in.


In [27]:
def find_edge_list(edge_list, i, j):
    return (([i,j] in edge_list) or ([j,i] in edge_list))

Next we need to create a function, find_bst_forest, that accepts an "adjacency forest" argument bst_forest (which is actually a list of objects of class bintrees.AVLTree). Your function should return True if there is an edge between vertex i and vertex j, or False otherwise). You'll want to use the class method __contains__(j).


In [13]:
def find_bst_forest(bst_forest, i, j):
    return bst_forest[i].__contains__(j)

Next, we will need a function, get_bst_forest, that can create an adjacency forest representation for a graph given an adjacency list as input. Important NOTE: I have deleted the code to create the AVL tree. You should add it.


In [14]:
def get_bst_forest(theadjlist):
    g_adj_list = theadjlist
    n = len(g_adj_list)
    theforest = []
    for i in range(0,n):        
        itree = bintrees.AVLTree()
        for j in g_adj_list[i]:
            itree.insert(j,1)
        theforest.append(itree)
    return theforest

Here is the code to run the simulation (generate the graph and obtain timing statistics). To keep the code running time reasonable, I decided to only compare the running times for the "adjacency list" and "adjacency forest" (aka "adjacency trees") graph data structures. The parameter "n" is the number of vertices (fixed at 1000) and the parameter "k" is the average vertex degree (which we will vary in this exercise). For speed, I have turned off replication (by setting nrep=1 and nsubrep=1), but you can try it with larger values of nrep to see if the results hold up (I expect they will):


In [28]:
def do_sim(n, k):

    retlist = []
    
    nrep = 1
    nsubrep = 1
    
    for _ in itertools.repeat(None, nrep):
      
        # make the random undirected graph
        g = igraph.Graph.Barabasi(n, k)       
        
        # get the graph in three different representations
        g_matrix = np.matrix(g.get_adjacency().data)

        g_adj_list = g.get_adjlist()
        
        g_bst_forest = get_bst_forest(g_adj_list)
        
        start_time = timeit.default_timer()
        
        # inner loop only needs to go from i+1 to n, since the graph is undirected        
        for _ in itertools.repeat(None, nsubrep):
            for i in range(0, n):
                for j in range(i+1, n):
                    find_matrix(g_matrix, i, j)     
        
        matrix_elapsed = timeit.default_timer() - start_time
        
        start_time = timeit.default_timer()
        
        # inner loop only needs to go from i+1 to n, since the graph is undirected        
        for _ in itertools.repeat(None, nsubrep):
            for i in range(0, n):
                for j in range(i+1, n):
                    find_adj_list(g_adj_list, i, j)     
        
        adjlist_elapsed = timeit.default_timer() - start_time
            
        start_time = timeit.default_timer()
        
        # inner loop only needs to go from i+1 to n, since the graph is undirected
        for _ in itertools.repeat(None, nsubrep):
            for i in range(0, n):
                for j in range(i+1, n):
                    j in g_bst_forest[i]
                    
        forest_elapsed = timeit.default_timer() - start_time
        
        retlist.append([matrix_elapsed, adjlist_elapsed, forest_elapsed])

        # get the results in microseconds, and make sure to divide by number of vertex pairs
    return 1000000*np.mean(np.array(retlist), axis=0)/(n*(n-1)/2)

Compare the results for differing average degree (i.e., k) values. At k=50, the "adjacency forest" method (aka "adjacency tree" method) is a bit faster than the adjacency list method. By k=100, the "adjacency forest" method is substantially faster than the "adjacency list" method.


In [29]:
do_sim(1000,5)


Out[29]:
array([ 1.13781991,  0.33787408,  1.41892885])

In [30]:
do_sim(1000,10)


Out[30]:
array([ 1.13926099,  0.52187028,  1.52568663])

In [31]:
do_sim(1000,20)


Out[31]:
array([ 1.13888197,  0.86812382,  1.63835081])

In [32]:
do_sim(1000,50)


Out[32]:
array([ 1.14737513,  1.76272713,  1.77698432])

In [33]:
do_sim(1000,100)


Out[33]:
array([ 1.14888451,  3.03352372,  1.85326944])