In this class session we are going to plot the degree distribution of the undirected human
protein-protein interaction network (PPI), without using igraph
. We'll obtain the interaction data from the Pathway Commons SIF file (in the shared/
folder) and we'll
manually compute the degree of each vertex (protein) in the network. We'll then compute
the count N(k)
of vertices that have a given vertex degree k
, for all k
values.
Finally, we'll plot the degree distribution and discuss whether it is consistent with the
results obtained in the Jeong et al. article for the yeast PPI.
We'll start by loading all of the Python modules that we will need for this notebook. Because we'll be calling a bunch of functions from numpy
and matplotlib.pyplot
, we'll alias them as np
and plt
, respectively.
In [3]:
import pandas
import collections
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
Step 1: load in the SIF file as a pandas
data frame using pandas.read_csv
. Make sure the column names of your data frame are species1
, interaction_type
, and species2
. Save the data frame as the object sif_data
.
In [4]:
Step 2: restrict the interactions to protein-protein undirected ("in-complex-with", "interacts-with"). The restricted data frame should be called interac_ppi
.
In [5]:
Step 3: for each interaction, reorder species1
and species2
(if necessary) so that
species1 < species2
(in terms of the species names, in lexicographic order). Since iterating is reasonably fast in Python, you can do this using a for
loop through all of the rows of the data frame, swapping species1
and species2
entries as needed (and in-place in the data frame) so that in the resulting data frame interac_ppi
satisfies species1 < species2
for all rows.
In [6]:
Step 4: Restrict the data frame to only the columns species1
and species2
. Use the drop_duplicates
method to subset the rows of the resulting two-column data frame to only unique rows. Assign the resulting data frame object to have the name interac_ppi_unique
. This is basically selecting only unique pairs of proteins, regardless of interaction type.
In [7]:
Get a list of all proteins. We'll need this in the next step:
Step 5: compute the degree of each vertex (though we will not associate the vertex degrees with vertex names here, since for this exercise we only need the vector of vertex degree values, not the associated vertex IDs). You'll want to create an object called vertex_degrees_ctr
which is of class collections.Counter
. You'll want to name the final list of vertex degrees, vertex_degrees
.
In [8]:
vertex_degrees_ctr = collections.Counter()
allproteins = interac_ppi_unique["species1"].tolist() + interac_ppi_unique["species2"].tolist()
for proteinname in allproteins:
## add the code HERE to increment the counter for protein [proteinname]
vertex_degrees = list(vertex_degrees_ctr.values())
Let's print out the vertex degrees of the first 10 vertices, in whatever the key order is. Pythonistas -- anyone know of a less convoluted way to do this?
In [27]:
dict(list(dict(vertex_degrees_ctr).items())[0:9])
Out[27]:
Let's print out the first ten entries of the vertex_degrees
list. Note that we don't expect it to be in the same order as the output from the previous command above, since dict
changes the order in the above.
In [38]:
vertex_degrees[0:9]
Out[38]:
Step 6: Calculate the histogram of N(k) vs. k, using 30 bins, using plt.hist
. You'll probably want to start by making a numpy.array
from your vertex_degrees
. Call the resulting object from plt.hist
, hist_res
. Obtain a numpy array of the bin counts as element zero from hist_res
(name this object hist_counts
) and obtain a numpy array of the bin centers (which are k
values) as element one from hist_res
(name this object hist_breaks
). Finally, you want the k
values of the centers of the bins, not the breakpoint values. So you'll have to do some arithmetic to go from the 31 k
values of the bin breakpoints, to a numpy array of the 30 k
values of the centers of the bins. You should call that object kvals
.
In [50]:
Let's print the k values of the bin centers:
In [51]:
kvals
Out[51]:
Let's print the histogram bin counts:
In [52]:
hist_counts
Out[52]:
Step 7: Plot N(k)
vs. k
, on log-log scale (using only the first 14 points, which is plenty sufficient to see the approximatey scale-free degree distribution and where it becomes exponentially suppressed at high k
. For this you'll use plt.loglog
. You'll probably want to adjust the x-axis limits using plt.gca().set_xlim()
. To see the plot, you'll have to do plt.show()
.
In [55]:
Step 8: Do a linear fit to the log10(N(k)) vs. log10(k) data (just over the range in which the relationship appears to be linear, which is the first four points). You'll want to use scipy.stats.linregress
to do the linear regression. Don't forget to log10-transform the data using np.log10
.
In [56]:
Out[56]:
Slope is -1.87 with SE 0.084, i.e., gamma = 1.87 with a 95% CI of about +/- 0.17.
Now let's compute the slope for the degree distribution Fig. 1b in the Jeong et al. article, for the yeast PPI. The change in ordinate over the linear range is about -6.5 in units of natural logarithm. The change in abscissa over the linear range is approximately log(45)-log(2), so we can compute the Jeong et al. slope thus:
In [58]:
jeong_slope = -6.5/(np.log(45)-np.log(2))
print("%.2f" % jeong_slope)
How close was your slope from the human PPI, to the slope for the yeast PPI from the Jeong et al. article?