CS446/546 - Class Session 10 - Closeness centrality

In this class session we are going to scatter-plot the harmonic-mean closeness centralities of the vertices in the gene regulatory network (which we will obtain from Pathway Commons) with the vertices' degree centralities. We will get the geodesic path distances using igraph, which will use BFS for this graph.

We are going to use pandas, igraph, numpy, and timeit


In [8]:
import pandas
import igraph
import numpy
import timeit

Load in the SIF file for Pathway Commons, using pandas.read_csv and specifying the three column names species1, interaction_type, and species2:


In [9]:
sif_data = pandas.read_csv("shared/pathway_commons.sif",
                           sep="\t", names=["species1","interaction_type","species2"])

Subset the data frame to include only rows for which the interaction_type column contains the string controls-expression-of; subset columns to include only columns species1 and species2 using the [ operator and the list ["species1","species2"]; and eliminate redundant edges in the edge-list using the drop_duplicates method.


In [10]:
interac_grn = sif_data[sif_data.interaction_type == "controls-expression-of"]
interac_grn_unique = interac_grn[["species1","species2"]].drop_duplicates()

Create a undirected graph in igraph, from the dataframe edge-list, using Graph.TupleList and specifying directed=False. Print out the graph summary using the summary instance method.


In [22]:
grn_igraph = igraph.Graph.TupleList(interac_grn_unique.values.tolist(), directed=False)
grn_igraph.summary()


Out[22]:
'IGRAPH UN-- 14208 110013 -- \n+ attr: name (v)'

For one vertex at a time (iterating over the vertex sequence grn_igraph.vs), compute that vertex's harmonic mean closeness centrality using Eq. 7.30 from Newman's book. Don't forget to eliminate the "0" distance between a vertex and itself, in the results you get back from calling the shortest_paths method on the Vertex object. Just for information purposes, measure how long the code takes to run, in seconds, using timeit.default_timer().


In [23]:
N = len(grn_igraph.vs)

# allocate a vector to contain the vertex closeness centralities; initialize to zeroes
# (so if a vertex is a singleton we don't have to update its closeness centrality)
closeness_centralities = numpy.zeros(N)

# initialize a counter
ctr = 0

# start the timer
start_time = timeit.default_timer()

# for each vertex in `grn_igraph.vs`
for my_vertex in grn_igraph.vs:
    
    # compute the geodesic distance to every other vertex, from my_vertex, using the `shortest_paths` instance method;
    # put it in a numpy.array
    my_dists = numpy.array(my_vertex.shortest_paths())
    
    # filter the numpy array to include only entries that are nonzero and finite, using `> 0 & numpy.isfinite(...)`
    my_dists = my_dists[numpy.isfinite(my_dists) & (my_dists > 0)]
    
    # if there are any distance values that survived the filtering, take their element-wise reciprocals, 
    # then compute the sum, then divide by N-1 (following Eq. 7.30 in Newman)
    if len(my_dists) > 0:
        closeness_centralities[ctr] = numpy.sum(1/my_dists)/(N-1)
    ctr += 1
    
# compute the elapsed time
ci_elapsed = timeit.default_timer() - start_time
print(ci_elapsed)


126.19425056292675

Histogram the harmonic-mean closeness centralities. Do they have a large dynamic range?


In [24]:
import matplotlib.pyplot
matplotlib.pyplot.hist(closeness_centralities)
matplotlib.pyplot.xlabel("Ci")
matplotlib.pyplot.ylabel("Frequency")
matplotlib.pyplot.show()


Scatter plot the harmonic-mean closeness centralities vs. the log10 degree. Is there any kind of relationship?


In [25]:
ax = matplotlib.pyplot.gca()
ax.scatter(grn_igraph.degree(), closeness_centralities)
ax.set_xscale("log")
matplotlib.pyplot.xlabel("degree")
matplotlib.pyplot.ylabel("closeness")
matplotlib.pyplot.show()


Which protein has the highest harmonic-mean closeness centrality in the network, and what is its centrality value? use numpy.argmax


In [27]:
print(numpy.max(closeness_centralities))
grn_igraph.vs[numpy.argmax(closeness_centralities)]["name"]


0.707169001197
Out[27]:
'CYP26A1'

Print names of the top 10 proteins in the network, by harmonic-mean closeness centrality:, using numpy.argsort:


In [47]:
grn_igraph.vs[numpy.argsort(closeness_centralities)[::-1][0:9].tolist()]["name"]


Out[47]:
['CYP26A1', 'TCF3', 'LEF1', 'MYC', 'MAZ', 'FOXO4', 'MAX', 'PAX4', 'SREBF1']

Let's do it using a Pandas DataFrame:


In [75]:
cc_df = pandas.DataFrame(list(zip(grn_igraph.vs["name"], 
                                  closeness_centralities.tolist())),
                        columns=["protein","CC"])
cc_df = cc_df.set_index("protein")
cc_df.sort_values("CC", ascending=False).head(n=10)


Out[75]:
CC
protein
CYP26A1 0.707169
TCF3 0.624891
LEF1 0.574036
MYC 0.566006
MAZ 0.561361
FOXO4 0.548439
MAX 0.544538
PAX4 0.536578
SREBF1 0.523398
PPARA 0.523246

In [ ]: