igraphs
is sort of like the Gephi for Python. It allows you to construct very nice network graphs but gives you much more control over the whole process than you get with Gephi. In this lesson, we will dive quickly into igraphs
with a focus on how to construct semantic network visualizations. We will also learn how to export our graph data in a form that can be imported into a program like Gephi and, best of all, one strategy for dealing with huge network graphs that will avoid the 'hairball' usually associated with these graphs.
So, let's jump right in.
The first step is to initialize a graph as follows:
In [14]:
from igraph import *
graph = Graph()
graph
Out[14]:
We now have an instance of an igraph.Graph
object that we can start filling with data. The first step is to add vertices, AKA nodes. We can simply tell igraph how many nodes to add with a command like:
graph.add_vertices(1000)
This command will add 1000 vertices to our graph and give them $ID$s from 0-999. But, since we are dealing with words in our semantic network visualization, we would like to initialize our vertices so that they also have $name$s that are the actual words. We do this by passing an iterator of words to the add_vertices
method. Something like this:
In [15]:
import pandas as pd
CS_graph = Graph() # so that it resets our vertices every time we run this script
NT_CS = 1-pd.read_pickle('GNT_CS_50x.pickle') #Converts cosine distance to cosine similarity
CS_graph.add_vertices(list(NT_CS.index))
print(CS_graph)
The rather cryptic information about the graph tells us the we have 262 vertices, 0 edges, and that our vertices (v) have the name
attribute. The graph now has a sequence of vertices, which are represented by a .vs
attribute on the graph. We can access the attributes that exist like this.
In [16]:
CS_graph.vs['name'][:10] #only showing the first 10
Out[16]:
And we can add other attributes to our vertices like this:
In [17]:
NT_occ = pd.read_pickle('./GNT_coll_dict.pickle') #word occurrence dictionary
CS_graph.vs['occurrences'] = [NT_occ[x] for x in CS_graph.vs['name']] #we need to pass a list of values
print(NT_occ['πνεῦμα'])
print(CS_graph.vs.select(name_eq= 'πνεῦμα')['occurrences'])
print(CS_graph.summary())
Notice the strange way of selecting vertices in the graph based on attributes. Vertex attributes are done in terms of the name of the attribute (e.g., 'name' or 'occurrences') followed by an underscore (_) and then followed by the relationship one is looking for. The possible relationships can be found here: http://igraph.org/python/doc/igraph.VertexSeq-class.html#select. A few examples below.
In [18]:
print(CS_graph.vs.select(occurrences_eq=50)['name']) #all words occurring exactly 50x
print(CS_graph.vs.select(occurrences_gt=379)['name']) #all words occurring more than 379x
Our graph is not yet very interesting because we don't have any edges. So let's take care of that.
igraph wants us to add edges by giving the $name$ of the two vertices as well as any attributes (e.g., 'weight') for each edge. We could add all of the edges and then add the weights later, as we did for occurrences
for the vertices above, but below we add this information along with the creation of the edge.
In [19]:
CS_graph.add_edge(NT_CS.ix[0].name, NT_CS.ix[1].name, weight=NT_CS.ix[0, 1])
In [20]:
CS_graph.es[0]
Out[20]:
Quiz:
Write a function below that takes as input a DataFrame with some sort of $weight$ measure (co-occurrences, log-likelihood ratio, cosine similarity values) and a Graph()
object where the $vertices==df.index$ and adds the appropriate edges to the Graph()
.
NB: Since our graph is undirected, do not add the same edge twice, i.e., don't add 0--1 and 1--0, since the values of these two edges are the same.
Hint: You can use combinations
on DataFrames as well.
In [21]:
from itertools import combinations
# The two lines below delete any existing edged in the graph
# This will allow you to run this cell more than just one time w/o repeating edges.
i = [x for x in range(len(CS_graph.es))]
CS_graph.delete_edges(i)
def add_my_edges(g, df):
#input your code here
#you shouldn't need to return any value.
for a, b in combinations(df, 2):
g.add_edge(a, b, weight = df.ix[a, b])
return None
#The lines below are to check your function.
add_my_edges(CS_graph, NT_CS)
print(len(CS_graph.es) == 34191)
print(CS_graph.es[17000].tuple == (76, 91) and CS_graph.es[12000].attributes() == {'weight': 0.15165763823647904})
In [22]:
CS_graph.summary()
Out[22]:
We are now ready to draw our first graph. So let's do it!
In [23]:
bbox = (3000, 2000)
plot(CS_graph, bbox=bbox, target = 'plot.svg')
Out[23]:
Take a look at the plot.svg
file that was created in the same directory as this IPython Notebook. It isn't really that helpful of a visualization, is it?
Take a look at the code below and the resulting graph.
In [24]:
CS_graph.summary()
reduced.summary()
In [ ]:
from copy import deepcopy
def reduce_graph_edges(graph, value):
# reduce a graph's edges based their weight being <= value
reduced = deepcopy(graph)
reduced.delete_edges(weight_le=value)
return reduced
CS_reduced = reduce_graph_edges(CS_graph, .4)
def create_word_graph(word, orig_graph):
node = orig_graph.vs.find(name = word)
word_graph = orig_graph.subgraph(node.neighbors())
return word_graph
g_theos = create_word_graph('θεός', CS_reduced)
summary(g_theos)
Notice we have cut our graph down to a more manageable size, only 85 vertices and 1191 edges.
Quiz:
Explain each part of the code above. Write your answers in the space below.
reduced = deepcopy(graph)
Write your answer here
reduced.delete_edges(weight_le=value)
Write your answer here
node = orig_graph.vs.find(name = word)
Write your answer here
word_graph = orig_graph.subgraph(node.neighbors())
Write your answer here
Now that we have a more manageable graph, let's plot it to see how it looks.
In [ ]:
plot(g_theos, layout = 'random', bbox = bbox, target = 'plot_theos.svg')
In [ ]:
g_theos['name'] = 'g_theos'
print(g_theos.summary())
g_theos['name']
Hmmm, it looks better but still pretty random
. Maybe we should choose a better layout.
Currently, the best widely available and used network-graphing layouts are the so-called "force-directed" layouts. A description of these from Wikipedia:
Force-directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthetically pleasing way. Their purpose is to position the nodes of a graph in two-dimensional or three-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible, by assigning forces among the set of edges and the set of nodes, based on their relative positions, and then using these forces either to simulate the motion of the edges and nodes or to minimize their energy. (source: en.wikipedia.org/wiki/Force-directed_graph_drawing)
igraph has several different force-directed layouts. Check out the list here: http://igraph.org/python/doc/tutorial/tutorial.html#layouts-and-plotting.
Quiz: Write a short function below that will take as input a list of igraph plotting layouts and a Graph() object, plots the Graph() in each of these layouts, and saves each graph to a file name that makes sense (e.g., {graphname}_{layout}.svg).
In [ ]:
def compare_layout(layouts, g):
#list of layouts and a Graph() object
for layout in layouts:
print(layout)
l_o = g.layout(layout)
target = 'plot_%s_%s.svg' % (g['name'], layout)
print(target)
plot(g, layout = l_o, bbox = bbox, target = target)
return None
# Write your own code to check your function below.
compare_layout(['drl', 'fr', 'grid_fr', 'kk', 'lgl'], g_theos)
Now check out your graphs. Which layout do you like the best for the data that we have? And why?
Your graph is still not very readable. So let's add a few things to make it better.
Adding labels is quite easy. We simply insert the vertex_label
keyword into our plot
function with the value being the iterable of names we want to use. See below.
In [ ]:
plot(g_theos,
layout = 'random',
vertex_label = g_theos.vs['name'],
bbox = bbox,
target = 'plot_theos.svg'
)
On my computer, the labels are far too small. Play around a bit with the other vertex
attributes in the plot
function (see here: http://igraph.org/python/doc/igraph.Graph-class.html#__plot__) until you get something that looks good on your computer. Also, feel free to insert your favorite layout from the compare_layouts()
function in your plot command.
In [ ]:
#insert your own plot command here.
plot(g_theos,
layout = 'fr',
vertex_label = g_theos.vs['name'],
bbox = bbox,
target = 'plot_theos.svg',
vertex_label_size = 36
)
The final step in creating out plot is to add some community detection. Below is one example of how to do this.
In [ ]:
theos_comms = g_theos.community_multilevel(weights = 'weight')
print(theos_comms)
c_algo = getattr(g_theos, 'community_multilevel')(weights = 'weight')
#g_theos.c_algo(weights = 'weight')
There are also several different community-detection algorithms in igraph. To find a list go to this page (igraph.org/python/doc/igraph.Graph-class.html) and search for 'community'.
Quiz: I bet you can guess what function I am going to ask you to write below!
Hint: Check the name of the function.
In [ ]:
def compare_comm_algos(algos, g):
#insert your code here
comms_dict = {}
for algo in algos:
print(algo)
try:
comms_dict[algo] = getattr(g, 'community_%s' % (algo))(weights = 'weight')
except:
comms_dict[algo] = getattr(g, 'community_%s' % (algo))()
return comms_dict
# Write some lines below to test your script
comms = compare_comm_algos(['leading_eigenvector', 'label_propagation',
'multilevel', 'edge_betweenness',
'walktrap'], g_theos)
In [ ]:
for algo in comms.keys():
print(algo, '\n', comms[algo], '\n')
And then, to integrate the community information into our graph, we do this:
In [ ]:
layout = g_theos.layout('fruchterman_reingold')
palette = ClusterColoringPalette(len(theos_comms))
color = [palette.get(i) for i in theos_comms.membership]
plot(g_theos, layout=layout,
bbox=bbox,
target='plot_communities.svg',
vertex_color=color,
vertex_label_size = 36,
vertex_label = g_theos.vs['name'],
edge_color = 'gray'
)
What we did above is quite nice, but how can you plot a graph will all of the co-occurrence, log-likelihood, or cosine-similarity data that we have calculated this week without getting the same hairball that we did in our plot.svg
plot?
First, let's load different data, in this case, a DataFrame of the log-likelihood relationships of every word that occurs at least 50x in the Greek New Testament. The code below should look familiar to you.
In [ ]:
def create_graph(file):
graph = Graph() # so that it resets our graph every time we run this function
df = pd.read_pickle(file)
graph.add_vertices(list(df.index))
add_my_edges(graph, df) #this will use the function you defined above
return graph
LL_graph = create_graph('GNT_LL_50x.pickle')
The next step is to create our community-based graph. This will create a simpler graph with a node for each community that will be spaced around the graph area. We will then fit the vertices belonging to each community into this community graph shell.
Quiz: Take a look at the code below and see if you can figure out what each line does.
In [ ]:
def create_comm_graph(graph):
comms = graph.community_multilevel(weights = 'weight')
contracted = comms.cluster_graph()
layout = contracted.layout('auto')
layout.scale(contracted.vcount())
return comms, contracted, layout
LL_communities, contracted_graph, outer_layout = create_comm_graph(LL_graph)
Now that we have created the graph for our communities, let's plot it. Take a look at it once it has been plotted. This will give you an idea of the skeleton that we will fit your finished graph onto.
In [25]:
plot(contracted_graph, layout = outer_layout, bbox = bbox, target='plot_outer.svg')
Out[25]:
And now the magic, where we create all of the individual subgraph.
The first 3 lines lines below set the size of our communities. The $r$ value will be subtracted from the $x$ and $y$ values for the center of each subgraph to bound that subgraph. If you want tigher communities, raise the denominator (now set at 4). If you want looser communities that will tend to overlap each other, lower it.
The rest of the lines basically go through every community in the $comms$ object that we pass, creates a subgraph for each community in the $graph$ object that we pass, places each of these subgraphs around the appropriate node in the community graph we created above, and then returns a Layout
object that will be used to create our large graph.
Take a closer look at each line and make sure you know what purpose it serves in the code. You don't have to understand each line 100%. Just know why it is there.
In [26]:
def create_inner_graphs(graph, comms, outer_layout):
outer_box = outer_layout.bounding_box()
from math import sqrt
r = sqrt(sum(outer_box.shape)/4)
inner_layout = Layout([(0,0) for _ in range(graph.vcount())])
for comm, vertices in enumerate(comms):
print('Plotting layout for community %s ...' % comm)
comm_graph = graph.induced_subgraph(vertices)
comm_layout = graph.layout('fruchterman_reingold')
cx, cy = outer_layout[comm]
inner_box = (cx-r, cy-r, cx+r, cy+r)
comm_layout.fit_into(inner_box)
for vertex, coords in zip(vertices, comm_layout):
inner_layout[vertex] = coords
return inner_layout
inner_layout = create_inner_graphs(LL_graph, LL_communities, outer_layout)
And, finally, a function to plot the graph. Take a look at each line to make sure you understand what its purpose is.
In [27]:
def plot_large_graph(comms, graph, layout, target_file):
palette = ClusterColoringPalette(len(comms))
color = [palette.get(i) for i in comms.membership]
plot(graph, layout=layout,
bbox=(3600, 2400),
target=target_file,
vertex_color=color,
vertex_label_size = 36,
vertex_label = graph.vs['name'],
edge_color = 'gray'
)
return None
plot_large_graph(LL_communities, LL_graph, inner_layout, 'plot_large.svg')
And, finally, let's take a look at how to export our graph to Gephi, just in case you want to pretty it up there.
In [28]:
def export_to_Gephi(graph, comms, destination, layout):
graph.vs["community"] = [str(m) for m in comms.membership] # Trick gephi
for vertex, coords in enumerate(layout):
x, y = coords
graph.vs[vertex]["x"], graph.vs[vertex]["y"] = x * 100, -y * 100 # Gephi uses flipped Y coordinate
save(graph, destination)
return None
export_to_Gephi(LL_graph, LL_communities, 'LL_NT.graphml', inner_layout)
Now open up the resulting file in Gephi and make it look like you want!
You now have all the functions you need to create graphs centered around a single word and those containing all the data. So use the co-occurrence, log-likelihood, and cosine-similarity data that you have produced in the last two days to create your own graphs. Experiment with layouts, settings, etc., to create something that you like.