In this class session we are going to scatter-plot the harmonic-mean closeness centralities
of the vertices in the gene regulatory network (which we will obtain from Pathway Commons) with the vertices' degree centralities. We will get the geodesic path distances using igraph
, which will use BFS for this graph.
We are going to use pandas
, igraph
, numpy
, and timeit
In [8]:
import pandas
import igraph
import numpy
import timeit
Load in the SIF file for Pathway Commons, using pandas.read_csv
and specifying the three column names species1
, interaction_type
, and species2
:
In [9]:
sif_data = pandas.read_csv("shared/pathway_commons.sif",
sep="\t", names=["species1","interaction_type","species2"])
Subset the data frame to include only rows for which the interaction_type
column contains the string controls-expression-of
; subset columns to include only columns species1
and species2
using the [
operator and the list ["species1","species2"]
; and eliminate redundant edges in the edge-list using the drop_duplicates
method.
In [10]:
interac_grn = sif_data[sif_data.interaction_type == "controls-expression-of"]
interac_grn_unique = interac_grn[["species1","species2"]].drop_duplicates()
Create a undirected graph in igraph, from the dataframe edge-list, using Graph.TupleList
and specifying directed=False
. Print out the graph summary using the summary
instance method.
In [22]:
grn_igraph = igraph.Graph.TupleList(interac_grn_unique.values.tolist(), directed=False)
grn_igraph.summary()
Out[22]:
For one vertex at a time (iterating over the vertex sequence grn_igraph.vs
), compute that vertex's harmonic mean closeness centrality using Eq. 7.30 from Newman's book. Don't forget to eliminate the "0" distance between a vertex and itself, in the results you get back from calling the shortest_paths
method on the Vertex
object. Just for information purposes, measure how long the code takes to run, in seconds, using timeit.default_timer()
.
In [23]:
N = len(grn_igraph.vs)
# allocate a vector to contain the vertex closeness centralities; initialize to zeroes
# (so if a vertex is a singleton we don't have to update its closeness centrality)
closeness_centralities = numpy.zeros(N)
# initialize a counter
ctr = 0
# start the timer
start_time = timeit.default_timer()
# for each vertex in `grn_igraph.vs`
for my_vertex in grn_igraph.vs:
# compute the geodesic distance to every other vertex, from my_vertex, using the `shortest_paths` instance method;
# put it in a numpy.array
my_dists = numpy.array(my_vertex.shortest_paths())
# filter the numpy array to include only entries that are nonzero and finite, using `> 0 & numpy.isfinite(...)`
my_dists = my_dists[numpy.isfinite(my_dists) & (my_dists > 0)]
# if there are any distance values that survived the filtering, take their element-wise reciprocals,
# then compute the sum, then divide by N-1 (following Eq. 7.30 in Newman)
if len(my_dists) > 0:
closeness_centralities[ctr] = numpy.sum(1/my_dists)/(N-1)
ctr += 1
# compute the elapsed time
ci_elapsed = timeit.default_timer() - start_time
print(ci_elapsed)
Histogram the harmonic-mean closeness centralities. Do they have a large dynamic range?
In [24]:
import matplotlib.pyplot
matplotlib.pyplot.hist(closeness_centralities)
matplotlib.pyplot.xlabel("Ci")
matplotlib.pyplot.ylabel("Frequency")
matplotlib.pyplot.show()
Scatter plot the harmonic-mean closeness centralities vs. the log10 degree. Is there any kind of relationship?
In [25]:
ax = matplotlib.pyplot.gca()
ax.scatter(grn_igraph.degree(), closeness_centralities)
ax.set_xscale("log")
matplotlib.pyplot.xlabel("degree")
matplotlib.pyplot.ylabel("closeness")
matplotlib.pyplot.show()
Which protein has the highest harmonic-mean closeness centrality in the network, and what is its centrality value? use numpy.argmax
In [27]:
print(numpy.max(closeness_centralities))
grn_igraph.vs[numpy.argmax(closeness_centralities)]["name"]
Out[27]:
Print names of the top 10 proteins in the network, by harmonic-mean closeness centrality:, using numpy.argsort
:
In [47]:
grn_igraph.vs[numpy.argsort(closeness_centralities)[::-1][0:9].tolist()]["name"]
Out[47]:
Let's do it using a Pandas DataFrame
:
In [75]:
cc_df = pandas.DataFrame(list(zip(grn_igraph.vs["name"],
closeness_centralities.tolist())),
columns=["protein","CC"])
cc_df = cc_df.set_index("protein")
cc_df.sort_values("CC", ascending=False).head(n=10)
Out[75]:
In [ ]: