In this class, we will analyze the protein-protein interaction network for two classes of yeast proteins, "date hubs" and "party hubs" as defined by Han et al. in their 2004 study of protein-interaction networks and gene expression (Han et al., Nature, v430, p88, 2004). The authors of that study claimed that there is no difference in the local clustering density, between "date hubs" and "party hubs". We will put this to the test. We for each of the "date hub" and "party hub" proteins, we will compute its local clustering coefficient (Ci) in the protein-protein interaction network. We will then histogram the Ci values for the two sets of hubs, so that we can compare the distributions of local clustering coefficients for "date hubs" and "party hubs". We will use a statistical test (Kolmogorov-Smirnov) to compare the two distributions of Ci values.
To get started, we load the modules that we will require:
In [82]:
import igraph
import numpy
import pandas
import matplotlib.pyplot
Next, we'll load the file of hub types shared/han_hub_data.txt
(which is a two-column TSV file in which the first column is the protein name and the second column contains the string date
or party
for each row; the first row of the file contains the column headers), using our old friend pandas.read_csv
. This file has a header so pass header=0
to read_csv
.
In [83]:
hub_data = pandas.read_csv("shared/han_hub_data.txt", sep="\t", header=0)
Let's take a peek at the structure of the hub_data
data frame, using head
and shape
. Here's what it should look like:
In [84]:
hub_data.head()
Out[84]:
In [85]:
hub_data.shape
Out[85]:
Next, let's load the file of yeat protein-protein interaction network edges shared/han_network_edges.txt
(which is a two-column file, with first column is the first protein in the interacting pair, and the second column is the second protein in the interacting pair).This file has a header so pass header=0
to read_csv
.
In [86]:
edge_data = pandas.read_csv("shared/han_network_edges.txt", sep="\t", header=0)
Let's take a peek at the data frame edge_df
, using head
and shape
:
make an undirected igraph Graph
from the edgelist data; show summary data on the graph as a sanity-check
In [87]:
edge_data.head()
Out[87]:
In [88]:
edge_data.shape
Out[88]:
It will be convenient to let igraph
compute the local clustering coefficients. So, we'll want to make an undirected igraph igraph.Graph
object from the edgelist data, using our old friend igraph.Graph.TupleList
:
In [89]:
ppi_graph = igraph.Graph.TupleList(edge_data.values.tolist(), directed=False)
As always, we'll use igraph.Graph.summary
to sanity check the Graph
object:
In [90]:
ppi_graph.summary()
Out[90]:
Generate a list of the names of the proteins in the order of the proteins' corresponding vertices in the igraph Graph
object
In [91]:
graph_vertices = ppi_graph.vs()["name"]
In [106]:
graph_vertices[0:9]
Out[106]:
Make a dataframe containing the protein names (as column "Protein") and the vertex IDs (as column "order"):
In [93]:
graph_vertices_df = pandas.DataFrame(pandas.Series(graph_vertices))
graph_vertices_df.columns = ["Protein"]
graph_vertices_df["order"]=graph_vertices_df.index
Let's take a peek at this data frame:
In [94]:
graph_vertices_df.head()
Out[94]:
Let's use the pandas.DataFrame.merge
method on the graph_vertices_df
object to pull in the hub type (date or party) for vertices that are hubs, by passing hub_data
to merge
. Don't forget to specify how='outer'
and on="Protein"
:
In [95]:
graph_vertices_df_merged = graph_vertices_df.merge(hub_data, how='outer', on="Protein")
graph_vertices_df_merged = graph_vertices_df_merged.sort_values("order")
Having merged the hub type information into graph_vertices_df
, let's take a peek at it using head
and shape
:
In [96]:
graph_vertices_df_merged.head()
Out[96]:
In [97]:
graph_vertices_df.shape
Out[97]:
Let's pull out the HubType
column as a numpy array, using column indexing (["HubType"]
) and then values.tolist()
:
In [98]:
vertex_types_np = numpy.array(graph_vertices_df_merged["HubType"].values.tolist())
Let's take a peek at this numpy.array
that we have just created:
In [99]:
vertex_types_np
Out[99]:
Use numpy.where
in order to find the index numbers of the proteins that are "date hubs" and that are "party hubs":
In [100]:
date_hub_inds = numpy.where(vertex_types_np == "date")
party_hub_inds = numpy.where(vertex_types_np == "party")
Use the igraph.Graph.transitivity_local_undirected
function in igraph to compute the local clustering coefficients for every vertex in the graph. Make a numpy.array
from the resulting list of Ci values:
In [101]:
ci_values = ppi_graph.transitivity_local_undirected()
ci_values_np = numpy.array(ci_values)
Let's take a peek at the ci_values_np
array that you have just created. What are the nan
values, and what do they signify? Is this normal?
In [102]:
ci_values_np
Out[102]:
Make a numpy.array
of the Ci values of the date hubs (ci_values_date_hubs
) and the Ci values of the party hubs (ci_values_party_hubs
)
In [103]:
ci_values_date_hubs = ci_values_np[date_hub_inds]
ci_values_party_hubs = ci_values_np[party_hub_inds]
Plot the histograms of the local clustering coefficients of the "date hubs" and the "party hubs".
In [104]:
matplotlib.pyplot.hist(ci_values_date_hubs, normed=1, alpha=0.5, label="date")
matplotlib.pyplot.hist(ci_values_party_hubs, normed=1, alpha=0.5, label="party")
matplotlib.pyplot.legend(loc="upper center")
matplotlib.pyplot.xlabel("Ci")
matplotlib.pyplot.ylabel("frequency")
matplotlib.pyplot.show()
Do these histograms look the same?
In [ ]: