In this class session we are going to compute the outgoing-edge PageRank centrality of each gene (vertex) in a human gene regulatory network (a directed graph) from a landmark paper on human gene regulation (Neph et al., Cell, volume 150, pages 1274-1286, 2012; see PDF on Canvas)
Using Pandas read_csv
, read in the ifle shared/neph_gene_network.txt
, which has two columns of text (first column is the regulator gene, second column is the target gene), into a data frame. The file has no header and is tab-delimited. Assign the column names of the dataframe to be regulator
and target
, respectively.
Let's load the Python packages that we will need for this exercise
In [73]:
import pandas
import igraph
import numpy
import matplotlib.pyplot
import random
Using pandas.read_csv
, read the file shared/neph_gene_network.txt
; name the two columns of the resulting data frame, regulator
and target
.
In [2]:
edge_list_neph = pandas.read_csv("shared/neph_gene_network.txt",
sep="\t",
names=["regulator","target"])
Load the edge-list data into a directed igraph.Graph
object neph_graph
, using igraph.Graph.TupleList
. Make sure to reverse the columns of the data frame when you input the data frame into Graph.TupleList
since we want the outgoing pagerank centrality not incoming pagerank centrality. Print out a summary of the graph, using the igraph.Graph.summary
method:
In [3]:
neph_graph = igraph.Graph.TupleList(edge_list_neph[["target","regulator"]].values.tolist(), directed=True)
neph_graph.summary()
Out[3]:
Compute the pagerank centrality measures of all vertices, using igraph.Graph.pagerank
. Use the resulting object to initialize a numpy.array
, pageranks
.
In [36]:
pageranks = numpy.array(neph_graph.pagerank())
Which vertex has highest pagerank centrality in the gene regulatory network, and what is its pagerank centrality value? (think numpy.max
and numpy.argmax
). Get a VertexSet
sequence using the igraph.Graph.vs
property, and then index into that sequence using Python indexing:
In [37]:
print(numpy.max(pageranks))
neph_graph.vs[numpy.argmax(pageranks)]
Out[37]:
Calculate the in-degree of all vertices in the graph, and scatter plot log(degree)
vs. log(pagerank)
. (Do you see why we used in
here? Note the column swapping we did earlier). Note-- you will have to eliminate one vertex that has zero in-degree.
In [35]:
import matplotlib.pyplot
ax = matplotlib.pyplot.gca()
ax.set_xscale("log")
ax.set_yscale("log")
degrees = numpy.array(neph_graph.indegree())
inds_keep = numpy.where(degrees > 0)
ax.scatter(degrees[inds_keep],
pageranks[inds_keep])
#ax.scatter(neph_graph.indegree(), pageranks)
matplotlib.pyplot.xlabel("degree")
matplotlib.pyplot.ylabel("pagerank")
matplotlib.pyplot.show()
See if you can calculate the pagerank centrality yourself, using the matrix inversion (Eq. 7.19 from Newman). Test your function on a small directed graph.
In [85]:
def pagerank(g):
# N is the number of vertices
N = len(g.vs)
# alpha is the damping parameter
alpha = 0.85
# beta = (1-alpha)/N
beta = (1-alpha)/N
# compute the out-degree of each vertex
degree_values = g.degree(mode="out")
# get a floating-point adjacency matrix M in the Newman format (take transpose from igraph format)
M = numpy.matrix(g.get_adjacency().data).transpose().astype(float)
# or each column in 0,N-1:
for j in range(0,N):
# get the out degree of the vertex as a float
degree_value = float(degree_values[j])
# if degree is nonzero, normalize the column of M
if degree_value > 0:
M[:,j] /= degree_value
else:
# set the column to zero
M[:,j] = 0
# use Equation on the board
pr = numpy.linalg.inv(numpy.diag([1.]*N) - alpha * M) * numpy.matrix([beta]*N).transpose()
pr = pr/numpy.sum(pr)
return(pr.transpose())
Test out your function on a small 5-vertex graph:
In [86]:
g = igraph.Graph.Barabasi(n=5, m=2)
print(pagerank(g))
print(g.pagerank())