Centralities

In this section, I'm going to learn how Centrality works and try to interpret the data based on small real dataset. I'm using Facebook DataSet from SNAP https://snap.stanford.edu/data/egonets-Facebook.html. The data is included in this repository for easier access. The data is in EdgeList format (source, target).

I'm going to use Networkx, iGraph and graph_tool to find all the centralities.


In [4]:
%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt
import operator
import timeit

In [4]:
g_fb = nx.read_edgelist('facebook_combined.txt', create_using = nx.Graph(), nodetype = int)

In [6]:
print nx.info(g_fb)


Name: 
Type: Graph
Number of nodes: 4039
Number of edges: 88234
Average degree:  43.6910

In [7]:
print nx.is_directed(g_fb)


False

Now let's find the celebrities. The most basic centrality is Degree Centrality which is the sum of all in and out nodes (in the case of directed graph).


In [12]:
dg_centrality = nx.degree_centrality(g_fb)
sorted_dg_centrality = sorted(dg_centrality.items(), key=operator.itemgetter(1), reverse=True)
sorted_dg_centrality[:10]


Out[12]:
[(107, 0.258791480931154),
 (1684, 0.1961367013372957),
 (1912, 0.18697374938088163),
 (3437, 0.13546310054482416),
 (0, 0.08593363051015354),
 (2543, 0.07280832095096582),
 (2347, 0.07206537890044576),
 (1888, 0.0629024269440317),
 (1800, 0.06067360079247152),
 (1663, 0.058197127290737984)]

We can see that the node 107 has the highest degree centrality which means node 107 has the highest number of connected nodes. We can prove this by getting the degree of node 107 to see how many friends of node 107 has


In [13]:
nx.degree(g_fb, [107])


Out[13]:
{107: 1045}

Node 107 has 1045 friends and we can divide that by number of nodes to get the normalized degree centrality


In [14]:
float(nx.degree(g_fb, [107]).values()[0]) / g_fb.number_of_nodes()


Out[14]:
0.25872740777420156

Degree centrality might be the easiest number to calculate but it only shows the number of nodes connected which in real social network it might not be very useful as you might have a million followers but if the majority of them is bots then the number is not telling anything new.

Now let's try Betweenness which count all of the shortest path going throw each now. This might mean that if you have the highest shortest path going through you, you might be considered as bridge of your entire network.

Nodes with high betweenness are important in communication and information diffusion

We will be using multiprocessing so we can parallel the computation and distribute the load.


In [15]:
from multiprocessing import Pool  
import itertools

In [16]:
def partitions(nodes, n):
    "Partitions the nodes into n subsets"
    nodes_iter = iter(nodes)
    while True:
        partition = tuple(itertools.islice(nodes_iter,n))
        if not partition:
            return
        yield partition

In [17]:
def btwn_pool(G_tuple):
    return nx.betweenness_centrality_source(*G_tuple)

In [18]:
def between_parallel(G, processes = None):
    p = Pool(processes=processes)
    part_generator = 4*len(p._pool)
    node_partitions = list(partitions(G.nodes(), int(len(G)/part_generator)))
    num_partitions = len(node_partitions)
 
    bet_map = p.map(btwn_pool,
                        zip([G]*num_partitions,
                        [True]*num_partitions,
                        [None]*num_partitions,
                        node_partitions))
 
    bt_c = bet_map[0]
    for bt in bet_map[1:]:
        for n in bt:
            bt_c[n] += bt[n]
    return bt_c

Let's try with multiprocesser.


In [20]:
start = timeit.default_timer()
bt = between_parallel(g_fb)
stop = timeit.default_timer()
top = 10

max_nodes =  sorted(bt.iteritems(), key = lambda v: -v[1])[:top]
bt_values = [5]*len(g_fb.nodes())
bt_colors = [0]*len(g_fb.nodes())
for max_key, max_val in max_nodes:
    bt_values[max_key] = 150
    bt_colors[max_key] = 2
    
print 'It takes {} seconds to finish'.format(stop - start)
print max_nodes


It takes 152.425301075 seconds to finish
[(107, 0.4805180785560146), (1684, 0.33779744973019543), (3437, 0.23611535735892838), (1912, 0.22929533958687434), (1085, 0.14901509211665437), (0, 0.1463059214744287), (698, 0.11533045020561006), (567, 0.09631033121856328), (58, 0.08436020590796522), (428, 0.06430906239323834)]

Now let's try with just one processor


In [22]:
start = timeit.default_timer()
bt = nx.betweenness_centrality(g_fb)
stop = timeit.default_timer()
top = 10

max_nodes =  sorted(bt.iteritems(), key = lambda v: -v[1])[:top]
bt_values = [5]*len(g_fb.nodes())
bt_colors = [0]*len(g_fb.nodes())
for max_key, max_val in max_nodes:
    bt_values[max_key] = 150
    bt_colors[max_key] = 2
 
print 'It takes {} seconds to finish'.format(stop - start)
print max_nodes


It takes 283.723123074 seconds to finish
[(107, 0.48051807855601475), (1684, 0.3377974497302), (3437, 0.23611535735892797), (1912, 0.2292953395868785), (1085, 0.14901509211665198), (0, 0.14630592147442847), (698, 0.11533045020560787), (567, 0.09631033121856226), (58, 0.08436020590796657), (428, 0.0643090623932386)]

Page rank

We're going to try PageRank algorithm. This is very similar to Google's PageRank which they use incoming links to determine the "popularity"


In [24]:
g_fb_pr = nx.pagerank(g_fb)

In [25]:
top = 10
max_pagerank = sorted(g_fb_pr.iteritems(), key = lambda v: -v[1])[:top]
max_pagerank


Out[25]:
[(3437, 0.007614586844749602),
 (107, 0.006936420955866113),
 (1684, 0.006367162138306825),
 (0, 0.006289602618466542),
 (1912, 0.0038769716008844957),
 (348, 0.002348096972780577),
 (686, 0.0022193592598000193),
 (3980, 0.0021703235790099928),
 (414, 0.0018002990470702264),
 (698, 0.0013171153138368812)]

We can see that now the score is different as node 3437 is more popular than node 107.

Who is a "Gray Cardinal"

There's another metric that we can measure most influential node. It's called eigenvector centrality. To put it simply it means that if you're well connected to a lot of important people that means you're important or most influential as well.


In [26]:
g_fb_eg = nx.eigenvector_centrality(g_fb)

In [27]:
top = 10
max_eg = sorted(g_fb_eg.iteritems(), key = lambda v: -v[1])[:top]
max_eg


Out[27]:
[(1912, 0.09540688873596533),
 (2266, 0.08698328226321961),
 (2206, 0.08605240174265634),
 (2233, 0.08517341350597848),
 (2464, 0.0842787836468596),
 (2142, 0.08419312450068117),
 (2218, 0.08415574433673877),
 (2078, 0.08413617905810125),
 (2123, 0.08367142125897375),
 (1993, 0.08353243711860492)]

Now we get quite a different result. This would mean that node 1912 is connected to more important people in the entire network that means that node is more influential than the rest of the network.

iGraph with SNAP Facebook Dataset

Networkx is easy to install and great to start with. However, as it's written in Python it's quite slow. I'm going to try iGraph which is C based. I'm hoping that this would yield the same result but faster.


In [5]:
from igraph import *
import timeit

In [6]:
igraph_fb = Graph.Read_Edgelist('facebook_combined.txt', directed=False)

In [7]:
print igraph_fb.summary()


IGRAPH U--- 4039 88234 -- 

Betweenness


In [8]:
def betweenness_centralization(G):
    vnum = G.vcount()
    if vnum < 3:
        raise ValueError("graph must have at least three vertices")
    denom = (vnum-1)*(vnum-2)
 
    temparr = [2*i/denom for i in G.betweenness()]
    return temparr

In [9]:
start = timeit.default_timer()
igraph_betweenness = betweenness_centralization(igraph_fb)
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 3.41120696068 seconds to finish

In [10]:
igraph_betweenness.sort(reverse=True)
print igraph_betweenness[:10]


[0.4805180785560146, 0.3377974497302, 0.23611535735892794, 0.22929533958687848, 0.14901509211665237, 0.14630592147442847, 0.11533045020560788, 0.09631033121856228, 0.08436020590796657, 0.0643090623932386]

Closeness


In [11]:
start = timeit.default_timer()
igraph_closeness = igraph_fb.closeness()
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 3.21295285225 seconds to finish

In [12]:
igraph_closeness.sort(reverse=True)
print igraph_closeness[:10]


[0.45969945355191255, 0.3974018305284913, 0.3948371956585509, 0.3939127889961955, 0.39360561458231796, 0.37049270575282134, 0.36991572004397216, 0.3698479575013739, 0.3695433330282786, 0.36655773420479304]

Eigen Value


In [13]:
start = timeit.default_timer()
igraph_eg = igraph_fb.evcent()
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 0.0219130516052 seconds to finish

In [14]:
igraph_eg.sort(reverse=True)
print igraph_eg[:10]


[1.0, 0.9117190175727714, 0.901962632218424, 0.8927488203018591, 0.8833723691600294, 0.8824740496738581, 0.8820826938160473, 0.8818776189568721, 0.8770062766705519, 0.8755494858467869]

PageRank


In [15]:
start = timeit.default_timer()
igraph_pr = igraph_fb.pagerank()
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 0.0559718608856 seconds to finish

In [16]:
igraph_pr.sort(reverse=True)
print igraph_pr[:10]


[0.0075745665246258935, 0.006888375869731625, 0.0063084887921967805, 0.0062246948047383005, 0.0038165503710367713, 0.002317366308283276, 0.002216791818397391, 0.002156551114912813, 0.0017822888082657136, 0.00129416751155367]

We can see that iGraph yields similar result from networkx but it's a lot quicker in the same machine.

Graph_tool with SNAP Facebook Dataset

I'm going to try another library which is supposed to be the fastest than networkx and igraph. Graph_tool is also C based which it has OpenMP enabled so a lot of algorithms is multiprocessing.


In [1]:
import sys

from graph_tool.all import *
import timeit

In [2]:
show_config()


version: 2.2.44 (commit 178add3a, Thu Jul 2 01:44:54 2015 +0200)
gcc version: 4.2.1
compilation flags:  -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/include -I/usr/local/lib/python2.7/site-packages/numpy/core/include -I/usr/local/lib/python2.7/site-packages/scipy -I/usr/local/Cellar/google-sparsehash/2.0.2/include  -DSPARSEHASH_PREFIX=sparsehash -Wall -Wextra -ftemplate-backtrace-limit=0  -DNDEBUG -std=gnu++11 -ftemplate-depth-250 -Wno-deprecated -Wno-unknown-pragmas -fvisibility=default -fvisibility-inlines-hidden -Wno-unknown-pragmas -I/usr/local/Cellar/google-sparsehash/2.0.2/include  -DSPARSEHASH_PREFIX=sparsehash  -undefined dynamic_lookup
install prefix: /usr/local/Cellar/graph-tool/2.2.44
python dir: /Library/Python/2.7/site-packages
graph filtering: True
openmp: False
uname: Darwin Noppanits-MacBook-Air.local 14.5.0 Darwin Kernel Version 14.5.0: Wed Jul 29 02:26:53 PDT 2015; root:xnu-2782.40.9~1/RELEASE_X86_64 x86_64

In [3]:
graph_tool_fb = Graph(directed=False)

In [4]:
with open('facebook_combined.txt', 'r') as f:
    for line in f:
        edge_list = line.split()
        source, target = tuple(edge_list)
        graph_tool_fb.add_edge(source, target)

In [5]:
print graph_tool_fb.num_vertices()
print graph_tool_fb.num_edges()


4039
88234

Betweeness


In [6]:
start = timeit.default_timer()
vertext_betweenness, edge_betweenness = betweenness(graph_tool_fb)
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 36.7536098957 seconds to finish

In [7]:
vertext_betweenness.a[107]


Out[7]:
0.48051807855601453

Closeness


In [8]:
start = timeit.default_timer()
v_closeness = closeness(graph_tool_fb)
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 33.3668859005 seconds to finish

In [9]:
v_closeness.a[107]


Out[9]:
0.45969945355191255

Eigenvalue


In [11]:
start = timeit.default_timer()
v_closeness = eigenvector(graph_tool_fb)
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 0.131196975708 seconds to finish

Page Rank


In [13]:
start = timeit.default_timer()
v_closeness = pagerank(graph_tool_fb)
stop = timeit.default_timer()
print 'It takes {} seconds to finish'.format(stop - start)


It takes 0.163510084152 seconds to finish

Information diffusion modelling

I'm going to information diffusion model to simulate how information travels in the graph.


In [14]:
%matplotlib inline
import random as r
import networkx as nx
import matplotlib.pyplot as plot


Couldn't import dot_parser, loading of dot files will not be possible.

In [15]:
class Person(object):
    def __init__(self, id):
        #Start with a single initial preference
        self.id = id
        self.i = r.random()
        self.a = self.i
        # we value initial opinion and subsequent information equally
        self.alpha = 0.8
    
    def __str__(self):
        return (str(self.id))
    
    def step(self):
        # loop through the neighbors and aggregate their preferences
        neighbors = g[self]
        # all nodes in the list of neighbors are equally weighted, including self
        w = 1/float((len(neighbors) + 1 ))
        s = w * self.a
        for node in neighbors:
            s += w * node.a

        # update my beliefs = initial belief plus sum of all influences
        self.a = (1 - self.alpha) * self.i + self.alpha * s

In [16]:
density = 0.9
g = nx.Graph()

## create a network of Person objects
for i in range(10):
    p = Person(i)
    g.add_node(p)
    
## this will be a simple random graph, every pair of nodes has an
## equal probability of connection
for x in g.nodes():
    for y in g.nodes():
        if r.random() <= density:
            g.add_edge(x,y)
            
## draw the resulting graph and color the nodes by their value
col = [n.a for n in g.nodes()]
pos = nx.spring_layout(g)
nx.draw_networkx(g, pos=pos, node_color=col)



In [17]:
## repeat for 30 times periods
for i in range(30):
    ## iterate through all nodes in the network and tell them to make a step
    for node in g.nodes():
        node.step()
        
    ## collect new attitude data, print it to the terminal and plot it.
    col = [n.a for n in g.nodes()]
    print col
    plot.plot(col)


[0.45050753135154914, 0.5888690787037506, 0.4423961312885849, 0.5300498998080747, 0.6988731674943652, 0.6795418820824117, 0.6785718311317842, 0.568814202159901, 0.5706430294584026, 0.49632118526941715]
[0.4855296157010143, 0.5856804164323358, 0.47093717928779294, 0.5244620739653998, 0.6496396099696642, 0.6411899282308859, 0.6484393994615059, 0.5720352467769754, 0.5728158036583522, 0.514839066871444]
[0.4852388874069484, 0.5800425201917674, 0.46682024795791915, 0.5160964412252317, 0.6378977198782583, 0.6329660942199506, 0.6430044848703405, 0.5710302747024006, 0.5712191949768373, 0.5141570563368826]
[0.4817926221791012, 0.5759778764271721, 0.46267707497376387, 0.5119459265022431, 0.6338082132037018, 0.6296889831451993, 0.6402899659669313, 0.5689477848037309, 0.568903307620578, 0.5118553740331476]
[0.47917210885991474, 0.5733724448077611, 0.4600881540183551, 0.509553410753053, 0.6315479888743104, 0.6276208808481694, 0.6383507073752878, 0.5671342971746759, 0.5670689867837366, 0.5100571093106294]
[0.47738726861490655, 0.5716494776372223, 0.4584098871501169, 0.5079763369938789, 0.6300398411813045, 0.6261814021492542, 0.6369663172075363, 0.5658001804905268, 0.5657674458613378, 0.5087969382812318]
[0.47616720799791845, 0.5704749917622336, 0.45727213655018917, 0.5068916613716337, 0.6289959891374665, 0.6251763111703803, 0.6359968245014774, 0.5648619722374798, 0.5648629462562351, 0.507924323127787]
[0.4753256958472638, 0.5696643249358213, 0.45648750404727034, 0.5061396670832817, 0.6282711588427362, 0.6244775014439462, 0.6353228787405164, 0.564209549012505, 0.5642360059552106, 0.5073198878271796]
[0.4747430265725202, 0.569102723893816, 0.45594394476769473, 0.5056179796082445, 0.6277681965172944, 0.6239925673757833, 0.6348553075540804, 0.5637569372186323, 0.5638013941480882, 0.5069009002741897]
[0.47433910911214694, 0.5687133388005052, 0.45556705072545256, 0.5052561432242426, 0.6274193474962474, 0.6236562377410398, 0.6345310482695293, 0.5634430652325614, 0.5635000447609191, 0.5066103790087721]
[0.4740590268561978, 0.5684433196406642, 0.4553056879672762, 0.5050052110408993, 0.6271774256084398, 0.6234230028728354, 0.634306189046178, 0.5632254120963035, 0.5632910785418146, 0.5064089190652907]
[0.4738648033004549, 0.5682560722155623, 0.45512444230147703, 0.5048311982196966, 0.6270096622168863, 0.6232612646097078, 0.6341502596739543, 0.5630744803151793, 0.5631461705667189, 0.5062692157825318]
[0.4737301176939477, 0.5681262239966826, 0.4549987558804413, 0.5047105276775358, 0.6268933256171756, 0.6231491063314711, 0.6340421296775939, 0.5629698159902224, 0.5630456833506607, 0.5061723376902101]
[0.47363671908283966, 0.5680361798861583, 0.4549115977773571, 0.5046268479381407, 0.6268126513137963, 0.6230713295275687, 0.6339671463105065, 0.5628972359187272, 0.56297599988634, 0.5061051569804691]
[0.47357195123592, 0.5679737382403129, 0.45485115745041627, 0.5045688197186189, 0.6267567072365211, 0.6230173947394265, 0.6339151486461881, 0.5628469048340012, 0.5629276774650982, 0.506058570108148]
[0.47352703757816206, 0.5679304377018448, 0.454809244738982, 0.5045285797000933, 0.6267179124793222, 0.6229799933356248, 0.633879090552071, 0.5628120024393705, 0.5628941679894082, 0.5060262641593287]
[0.47349589193029284, 0.5679004106815524, 0.4547801801158474, 0.5045006750162676, 0.626691010020662, 0.6229540571061962, 0.6338540858487557, 0.5627877991627975, 0.5628709306408046, 0.5060038614037401]
[0.473474293792463, 0.5678795882628218, 0.4547600250772673, 0.5044813243446339, 0.6266723543481089, 0.6229360714711223, 0.6338367461860795, 0.5627710152543307, 0.5628548165609253, 0.505988326077442]
[0.47345931643322287, 0.5678651488306864, 0.4547460484439353, 0.5044679055044214, 0.6266594174592056, 0.6229235992237557, 0.6338247218921751, 0.5627593763515408, 0.5628436421548206, 0.5059775530115085]
[0.4734489302929201, 0.5678551357191866, 0.45473635626306397, 0.5044586001285595, 0.6266504462953916, 0.6229149502684235, 0.6338163835728791, 0.5627513052840428, 0.5628358931953572, 0.5059700823638134]
[0.47344172796114287, 0.5678481920664878, 0.45472963516164944, 0.5044521472594167, 0.6266442251876138, 0.6229089525980652, 0.6338106013149579, 0.5627457083534964, 0.5628305196325749, 0.5059649017980091]
[0.47343673346063375, 0.5678433769485494, 0.45472497437324444, 0.5044476724783019, 0.6266399111228184, 0.6229047934772991, 0.6338065915732485, 0.5627418271281754, 0.56282679330298, 0.5059613093033355]
[0.47343326999406377, 0.5678400378759074, 0.4547217423213503, 0.5044445694142877, 0.6266369195085074, 0.6229019093098619, 0.6338038109936212, 0.5627391356690145, 0.5628242092571969, 0.5059588180661874]
[0.4734308682322342, 0.5678377223757991, 0.45471950103542663, 0.5044424175757889, 0.6266348449554792, 0.6228999092665177, 0.6338018827838697, 0.5627372692603893, 0.5628224173350278, 0.5059570905027274]
[0.4734292027159746, 0.5678361166782517, 0.45471794680215405, 0.5044409253703741, 0.6266334063441409, 0.6228985223243704, 0.6338005456551616, 0.5627359749880457, 0.5628211747157675, 0.5059558925134029]
[0.4734280477536563, 0.567835003197581, 0.4547168690096958, 0.5044398905915087, 0.626632408730412, 0.6228975605409541, 0.6337996184151761, 0.5627350774670652, 0.5628203130139298, 0.5059550617604639]
[0.4734272468380549, 0.5678342310476787, 0.45471612160800046, 0.5044391730178491, 0.626631716929141, 0.6228968935864301, 0.6337989754150027, 0.5627344550757155, 0.5628197154615813, 0.5059544856698164]
[0.47342669143828964, 0.5678336955956613, 0.45471560331777666, 0.5044386754120571, 0.6266312371953676, 0.6228964310827836, 0.6337985295226456, 0.5627340234746718, 0.5628193010852929, 0.5059540861762744]
[0.4734263062929649, 0.567833324283223, 0.4547152439062878, 0.5044383303443126, 0.6266309045210825, 0.6228961103568342, 0.6337982203158951, 0.5627337241783191, 0.5628190137335509, 0.5059538091450627]
[0.4734260392116112, 0.5678330667943682, 0.45471499467021825, 0.5044380910549993, 0.6266306738260885, 0.6228958879474856, 0.6337980058946049, 0.5627335166294539, 0.5628188144677484, 0.5059536170360944]

In [18]:
class Influencer(Person):
    def __ini__(self, id):
        self.id = id
        self.i = r.random()
        self.a = 1 ## opinion is strong and immovable
        
    def step(self):
        pass

In [19]:
influencers = 2
connections = 4
## add the influencers to the network and connect each to 3 other nodes
for i in range(influencers):
    inf = Influencer("Inf" + str(i))
    for x in range(connections):
        g.add_edge(r.choice(g.nodes()), inf)
        
## repeat for 30 time periods
for i in range(30):
    ## iterate through all nodes in the network and tell them to make a step
    for node in g.nodes():
        node.step()
        
    ## collect new attitude data, print it to the terminal and plot it.
    col = [n.a for n in g.nodes()]
    #print col
    plot.plot(col)


Networkx Independent Cascade Model


In [20]:
import copy
import networkx as nx
import random

In [28]:
def independent_cascade(G, seeds, steps = 0):
    """
    "Return the active nodes of each diffusion step by the independent cascade
    model

    Parameters
    -- -- -- -- -- -
    G: graph
    A NetworkX graph
    seeds: list of nodes
        The seed nodes for diffusion
    steps: integer
        The number of steps to diffuse.If steps <= 0, the diffusion runs until
        no more nodes can be activated.If steps > 0, the diffusion runs for at
        most "steps" rounds

    Returns
    -- -- -- -
    layer_i_nodes: list of list of activated nodes
    layer_i_nodes[0]: the seeds
    layer_i_nodes[k]: the nodes activated at the kth diffusion step

    Notes
    -- -- -
    When node v in G becomes active, it has a * single * chance of activating
    each currently inactive neighbor w with probability p_ {
        vw
    }

    Examples
    -- -- -- --
    >>> DG = nx.DiGraph() >>> DG.add_edges_from([(1, 2), (1, 3), (1, 5), (2, 1), (3, 2), (4, 2), (4, 3), \ >>> (4, 6), (5, 3), (5, 4), (5, 6), (6, 4), (6, 5)], act_prob = 0.2) >>> H = nx.independent_cascade(DG, [6])

    References
    -- -- -- -- --[1] David Kempe, Jon Kleinberg, and Eva Tardos.
    Influential nodes in a diffusion model
    for social networks.
    In Automata, Languages and Programming, 2005.
    """
    if type(G) == nx.MultiGraph or type(G) == nx.MultiDiGraph:
        raise Exception(\
            "independent_cascade() is not defined for graphs with multiedges.")

    # make sure the seeds are in the graph
    for s in seeds:
        if s not in G.nodes():
            raise Exception("seed", s, "is not in graph")

    # change to directed graph
    if not G.is_directed():
        DG = G.to_directed()
    else:
        DG = copy.deepcopy(G)

    # init activation probabilities
    for e in DG.edges():
        if 'act_prob' not in DG[e[0]][e[1]]:
            DG[e[0]][e[1]]['act_prob'] = 0.1
        elif DG[e[0]][e[1]]['act_prob'] > 1:
            raise Exception("edge activation probability:", DG[e[0]][e[1]]['act_prob'], "cannot be larger than 1")

    # perform diffusion
    A = copy.deepcopy(seeds)# prevent side effect
    if steps <= 0: #perform diffusion until no more nodes can be activated
        return _diffuse_all(DG, A)# perform diffusion for at most "steps" rounds
    return _diffuse_k_rounds(DG, A, steps)

In [29]:
def _diffuse_all(G, A):
    tried_edges = set()
    layer_i_nodes = [ ]
    layer_i_nodes.append([i for i in A])  # prevent side effect
    while True:
        len_old = len(A)
        (A, activated_nodes_of_this_round, cur_tried_edges) = _diffuse_one_round(G, A, tried_edges)
        layer_i_nodes.append(activated_nodes_of_this_round)
        tried_edges = tried_edges.union(cur_tried_edges)
        if len(A) == len_old:
            break
    return layer_i_nodes

In [30]:
def _diffuse_k_rounds(G, A, steps):
    tried_edges = set()
    layer_i_nodes = [ ]
    layer_i_nodes.append([i for i in A])
    while steps > 0 and len(A) < len(G):
        len_old = len(A)
        (A, activated_nodes_of_this_round, cur_tried_edges) = _diffuse_one_round(G, A, tried_edges)
        layer_i_nodes.append(activated_nodes_of_this_round)
        tried_edges = tried_edges.union(cur_tried_edges)
        if len(A) == len_old:
            break
        steps -= 1
    return layer_i_nodes

In [31]:
def _diffuse_one_round(G, A, tried_edges):
    activated_nodes_of_this_round = set()
    cur_tried_edges = set()
    for s in A:
        for nb in G.successors(s):
            if nb in A or (s, nb) in tried_edges or (s, nb) in cur_tried_edges:
                continue
            if _prop_success(G, s, nb):
                activated_nodes_of_this_round.add(nb)
            cur_tried_edges.add((s, nb))
    activated_nodes_of_this_round = list(activated_nodes_of_this_round)
    A.extend(activated_nodes_of_this_round)
    return A, activated_nodes_of_this_round, cur_tried_edges

def _prop_success(G, src, dest):
    return random.random() <= G[src][dest]['act_prob']

In [32]:
run_times = 10

G = nx.DiGraph()
G.add_edge(1,2,act_prob=.5)
G.add_edge(2,1,act_prob=.5)
G.add_edge(1,3,act_prob=.2)
G.add_edge(3,1,act_prob=.2)
G.add_edge(2,3,act_prob=.3)
G.add_edge(2,4,act_prob=.5)
G.add_edge(3,4,act_prob=.1)
G.add_edge(3,5,act_prob=.2)
G.add_edge(4,5,act_prob=.2)
G.add_edge(5,6,act_prob=.6)
G.add_edge(6,5,act_prob=.6)
G.add_edge(6,4,act_prob=.3)
G.add_edge(6,2,act_prob=.4)
nx.draw_networkx(G)



In [33]:
independent_cascade(G, [1], steps=0)


Out[33]:
[[1], []]

In [45]:
n_A = 0.0
for i in range(run_times):
    A = independent_cascade(G, [1], steps=1)
    print A
    for layer in A:
        n_A += len(layer)
n_A / run_times
#assert_almost_equal(n_A / run_times, 1.7, places=1)


[[1], []]
[[1], [2]]
[[1], [2]]
[[1], []]
[[1], [2, 3]]
[[1], [2, 3]]
[[1], []]
[[1], []]
[[1], []]
[[1], []]
Out[45]:
1.6

Conclusion

Both igraph and graph_tool yield similar performance. So, for the real data I'm going to use graph_tool as the documentation is more comprehensive.