PyGraphistry Example: Graphing the Marvel Universe

Install: pip install "graphistry[igraph]"

Note: pip install igraph is the wrong package. if installing manually, use python-igraph

Uses pandas, igraph, and PyGraphistry
Combines comic book and hero data
Near end, computes clusters and to avoid a hairball, weakens the edge weights between nodes of different clusters



In [1]:

    
from __future__ import print_function
from io import open
import pandas as pd
import igraph # Install Igraph with pip install python-igraph
import graphistry

#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com') #https://www.graphistry.com/api-request

Load heroes, comics, appearences



In [2]:

    
with open('../../data/characters.txt', encoding="latin-1") as f:
    lines = f.readlines()
heroes = pd.DataFrame(
    list(map(lambda x: (int(x.split(':')[0].split(' ')[1]), x.split(': ', 1)[1].split('\n')[0]), lines)),
    columns=['hero_id', 'hero_name'])
print('#Heroes:', len(heroes))
heroes[:3]









    



#Heroes: 6486






    Out[2]:







  
    
      
      hero_id
      hero_name
    
  
  
    
      0
      1
      24-HOUR MAN/EMMANUEL
    
    
      1
      2
      3-D MAN/CHARLES CHANDLER & HAROLD CHANDLER
    
    
      2
      3
      4-D MAN/MERCURIO



In [3]:

    
with open('../../data/comics.txt', encoding="latin-1") as f:
    lines = f.readlines()
comics = pd.DataFrame(
    list(map(lambda x: (int(x.split(':')[0].split(' ')[1]), x.split(': ', 1)[1].split('\n')[0]), lines)),
    columns=['comic_id', 'comic_name'])
print('#Comics: ', len(comics))
comics[:3]









    



#Comics:  12942






    Out[3]:







  
    
      
      comic_id
      comic_name
    
  
  
    
      0
      6487
      AA2 35
    
    
      1
      6488
      M/PRM 35
    
    
      2
      6489
      M/PRM 36



In [4]:

    
with open('../../data/appearances.txt', encoding="latin-1") as f:
    lines = f.readlines()[len(heroes) + len(comics) + 2:]
def expand (line):
    parts = list(map(int, line.split(' ')))
    return [(parts[0], role) for role in parts[1:]]
appearences = pd.DataFrame(
    [item for sublist in list(map(expand, lines)) for item in sublist],
    columns=['hero', 'comic'])
appearences[:3]

Link heroes who co-appear



In [6]:

    
# You may need to install numexpr: pip install numexpr
coappearences = \
    appearences\
        .merge(appearences, on='comic')\
        .merge(comics, left_on='comic', right_on='comic_id')\
        [['hero_x', 'hero_y']]\
        .query('hero_x > hero_y')
unique_coappearences = coappearences.drop_duplicates(['hero_x', 'hero_y']).set_index(['hero_x', 'hero_y'])
unique_coappearences['counts'] = coappearences.groupby(['hero_x', 'hero_y']).size()
unique_coappearences = unique_coappearences.reset_index()
print('#edges', len(unique_coappearences))        
unique_coappearences[:3]









    



#edges 168267






    Out[6]:







  
    
      
      hero_x
      hero_y
      counts
    
  
  
    
      0
      1999
      1
      1
    
    
      1
      6459
      1
      1
    
    
      2
      6459
      1999
      1

Plot!



In [7]:

    
g = graphistry.bind(source='hero_x', destination='hero_y', edge_title='counts')



In [8]:

    
g.plot(unique_coappearences)









    Out[8]:

Label Nodes



In [9]:

    
# Here we are using two dataframes, one for edges and one for nodes
g2 = g.bind(node='hero_id', point_title='hero_name')



In [10]:

    
g2.plot(unique_coappearences, heroes)









    Out[10]:

Color using igraph infomap

Infomap Community Detection



In [11]:

    
#Warning: slow
ig = g2.pandas2igraph(unique_coappearences, directed=False)
clusters = ig.community_infomap()
(i_edges, i_nodes) = g2.igraph2pandas(ig)
print('#clusters', str(len(list(set(clusters.membership)))))









    



#clusters 212



In [12]:

    
nodes_colored = pd.DataFrame({'cluster': clusters.membership})\
    .reset_index().rename(columns={'index': 'denseid'})\
    .merge(i_nodes.reset_index().rename(columns={'index':'denseid'}), on='denseid')\
    .merge(heroes, left_on='hero_id', right_on='hero_id')
print('#colored nodes', str(len(nodes_colored)))
nodes_colored[:3]









    



#colored nodes 6467






    Out[12]:







  
    
      
      denseid
      cluster
      hero_id
      hero_name
    
  
  
    
      0
      0
      32
      1999
      FROST, CARMILLA
    
    
      1
      1
      32
      1
      24-HOUR MAN/EMMANUEL
    
    
      2
      2
      32
      6459
      G'RATH



In [13]:

    
nodes_colored['color'] = nodes_colored.apply(lambda x: x['cluster'] % 9, axis=1)
nodes_colored.pivot_table(index=['color'], aggfunc=lambda x: len(x.unique()))



In [15]:

    
g3 = g2.bind(point_color='color', edge_weight='counts')



In [16]:

    
g3.plot(unique_coappearences,  nodes_colored)









    Out[16]:

Restrict to biggest communities



In [17]:

    
big_clusters = nodes_colored\
    .pivot_table(index=['cluster'], aggfunc=lambda x: len(x.unique()))\
    .rename(columns={'hero_id': 'cluster_size'})\
    .query('cluster_size > 100')\
    .reset_index()[['cluster', 'cluster_size']]
print('# big clusters', len(big_clusters))
big_clusters[:3]









    



# big clusters 10






    Out[17]:







  
    
      
      cluster
      cluster_size
    
  
  
    
      0
      0
      1260
    
    
      1
      1
      820
    
    
      2
      2
      535



In [18]:

    
good_nodes = nodes_colored.merge(big_clusters, on='cluster')
print('# nodes', len(good_nodes))
good_nodes[:3]









    



# nodes 3612






    Out[18]:







  
    
      
      denseid
      cluster
      hero_id
      hero_name
      color
      cluster_size
    
  
  
    
      0
      6
      0
      2186
      GORILLA-MAN
      0
      1260
    
    
      1
      7
      0
      2
      3-D MAN/CHARLES CHANDLER & HAROLD CHANDLER
      0
      1260
    
    
      2
      8
      0
      2555
      HUMAN ROBOT
      0
      1260



In [19]:

    
good_edges = unique_coappearences\
    .merge(good_nodes, left_on='hero_x', right_on='hero_id')\
    .merge(good_nodes, left_on='hero_y', right_on='hero_id')\
    [['hero_x', 'hero_y', 'counts']]
print('# edges', len(good_edges))
good_edges[:3]









    



# edges 114648






    Out[19]:







  
    
      
      hero_x
      hero_y
      counts
    
  
  
    
      0
      2186
      2
      3
    
    
      1
      2555
      2
      3
    
    
      2
      3491
      2
      3



In [20]:

    
g3.plot(good_edges, good_nodes)









    Out[20]:

Seperate communities

Treat intra-community edges as strong edge weights, and inter-community as weak edge weight



In [21]:

    
#label edges whether they stay inside a cluster or connect nodes in different clusters
good_edges2 = good_edges\
        .merge(\
                 good_nodes[['cluster', 'hero_id']].rename(columns={'cluster': 'cluster_x'}),\
                 left_on='hero_x', right_on='hero_id')\
        .merge(\
                 good_nodes[['cluster', 'hero_id']].rename(columns={'cluster': 'cluster_y'}),\
                 left_on='hero_y', right_on='hero_id')
good_edges2['is_inner'] = good_edges2.apply(lambda x: x['cluster_x'] == x['cluster_y'], axis=1)

#bind to edge_weight
good_edges2['weight'] = good_edges2.apply(lambda x: 10 if x['is_inner'] else 8, axis=1)
good_edges2 = good_edges2[['hero_x', 'hero_y', 'counts', 'is_inner', 'weight']]
good_edges2[:3]

Plot; control the edge weight in the settings panel



In [22]:

    
g3.bind(edge_weight='weight').plot(good_edges2, good_nodes)









    Out[22]:

Filter by k-core shell



In [23]:

    
shells = ig.shell_index()
print('#shells', str(len(list(set(shells)))))









    



#shells 98



In [24]:

    
nodes_shelled = pd.DataFrame({'shell': shells})\
    .reset_index().rename(columns={'index': 'denseid'})\
    .merge(nodes_colored, on='denseid')
print('#shelled nodes', str(len(nodes_shelled)))
nodes_shelled[:3]









    



#shelled nodes 6467






    Out[24]:







  
    
      
      denseid
      shell
      cluster
      hero_id
      hero_name
      color
    
  
  
    
      0
      0
      13
      32
      1999
      FROST, CARMILLA
      5
    
    
      1
      1
      5
      32
      1
      24-HOUR MAN/EMMANUEL
      5
    
    
      2
      2
      5
      32
      6459
      G'RATH
      5

Plot: Use the histogram tool to filter for the smaller shells



In [25]:

    
g3.plot(unique_coappearences,  nodes_shelled)









    Out[25]:



In [ ]:

	cluster	denseid	hero_id	hero_name
color
0	24	1626	1626	1626
1	24	1171	1171	1171
2	24	890	890	890
3	24	574	574	574
4	24	490	490	490
5	23	468	468	468
6	23	449	449	449
7	23	413	413	413
8	23	386	386	386

	hero_id	hero_name
0	1	24-HOUR MAN/EMMANUEL
1	2	3-D MAN/CHARLES CHANDLER & HAROLD CHANDLER
2	3	4-D MAN/MERCURIO

	denseid	cluster	hero_id	hero_name
0	0	32	1999	FROST, CARMILLA
1	1	32	1	24-HOUR MAN/EMMANUEL
2	2	32	6459	G'RATH

	denseid	hero_id	hero_name	cluster_size
0	6	2186	GORILLA-MAN	1260
1	7	2	3-D MAN/CHARLES CHANDLER & HAROLD CHANDLER	1260
2	8	2555	HUMAN ROBOT	1260

PyGraphistry Example: Graphing the Marvel Universe

Plots hero social network based on co-appearences between heroes

Load heroes, comics, appearences

Link heroes who co-appear

Plot!

Label Nodes

Color using igraph infomap

Infomap Community Detection

Restrict to biggest communities

Seperate communities

Treat intra-community edges as strong edge weights, and inter-community as weak edge weight

Plot; control the edge weight in the settings panel

Filter by k-core shell

Plot: Use the histogram tool to filter for the smaller shells