Analyzing the Pathway Commons 2 (PC2) database SIF file

CS446/546 class session 2

Goal: count the number of different types of biological interactions in PC2

Approach: retrieve compressed tab-delimited "edge-list" file and tabulate "interaction" column

Information you will need:

The URL is: http://www.pathwaycommons.org/archives/PC2/v9/PathwayCommons9.All.hgnc.sif.gz
You'll be using the Python modules gzip, timeit, pandas, urllib.request, collections and operator

Other stuff you should do:

Print the first six lines of the uncompressed data file
Use a timer to time how long your program takes
Count how many rows there are in the data file
Estimate the number of proteins in the database; we'll define them operationally as strings in column 1 or column 3, for which the content of column 2 is one of these interactions: 'interacts-with', 'in-complex-with', 'neighbor-of'
Count the total number of unique pairs of interacting molecules (ignoring interaction type)
Count the number rows for each type of interaction in the database
Pythonistas: do it using Pandas and without using Pandas

Step-by-step instructions for Python3:

Open a file object representing a stream of the remote, compressed data file, using urlopen
Open a file object representing a stream of the uncompressed data file, using gzip.GzipFile
Start the timer
Read one line at a time, until the end of the file
Split line on "\t" and pull out the tuple of species1, interaction_type, species2 from the line of text



In [2]:

    
from urllib.request import urlopen
import gzip
import timeit

baseURL = "http://www.pathwaycommons.org/archives/PC2/v9/"
filename = "PathwayCommons9.All.hgnc.sif.gz"
outFilePath = "pc.sif"
interaction_types_ppi = set(["interacts-with","in-complex-with","neighbor-of"])

start_time = timeit.default_timer()

zfd = urlopen(baseURL + filename)
fd = gzip.GzipFile(fileobj=zfd, mode="r")

# initialize the SIF file interaction counter
intctr = 0
linectr = 0
from collections import defaultdict

interactions = set()
proteins = set()
intnamectr = defaultdict(int)

for line in fd:
    if linectr < 6:
        print(line)
        
    linectr += 1
    
    [prot1, interaction_type, prot2] = line.decode("utf-8").rstrip("\n").split("\t")
    intnamectr[interaction_type] += 1
    if interaction_type in interaction_types_ppi:
        intctr += 1
        proteins |= set([prot1, prot2])
        interactions.add(min(prot1, prot2) + "-" + max(prot1, prot2))       
        
elapsed = timeit.default_timer() - start_time









    



b'A1BG\tcontrols-expression-of\tA2M\n'
b'A1BG\tinteracts-with\tABCC6\n'
b'A1BG\tinteracts-with\tACE2\n'
b'A1BG\tinteracts-with\tADAM10\n'
b'A1BG\tinteracts-with\tADAM17\n'
b'A1BG\tinteracts-with\tADAM9\n'
5.3486454999947455

How long your program take to run?



In [5]:

    
print(elapsed)









    



5.3486454999947455

How many protein-protein interactions are there in the data file?



In [10]:

    
print(intctr)

How many unique protein names are there in the data file?



In [11]:

    
len(proteins)









    Out[11]:





17020

How many unique pairs of proteins (regarless of interaction type name) are there that interact?



In [12]:

    
len(interactions)









    Out[12]:





491784

How many interactions are there of each type, in PC2?



In [13]:

    
from operator import itemgetter
sorted(intnamectr.items(), key=itemgetter(1), reverse=True)









    Out[13]:





[('interacts-with', 369895),
 ('in-complex-with', 153603),
 ('chemical-affects', 135268),
 ('catalysis-precedes', 120948),
 ('controls-expression-of', 110013),
 ('controls-state-change-of', 106156),
 ('controls-production-of', 18482),
 ('consumption-controlled-by', 16816),
 ('controls-phosphorylation-of', 15636),
 ('used-to-produce', 13705),
 ('controls-transport-of', 6960),
 ('reacts-with', 3607),
 ('controls-transport-of-chemical', 2847)]

Pythonistas: do it again, using Pandas:

read from the uncompressed data stream, and parse it into a data frame, using pandas.read_csv



In [7]:

    
import pandas
zfd = urlopen(baseURL + filename)
fd = gzip.GzipFile(fileobj=zfd, mode="r")
df = pandas.read_csv(fd, sep="\t", names=["species1","interaction_type","species2"])

Use the head method on the data frame, to print out the first six lines



In [8]:

    
print(df.head())









    



  species1        interaction_type species2
0     A1BG  controls-expression-of      A2M
1     A1BG          interacts-with    ABCC6
2     A1BG          interacts-with     ACE2
3     A1BG          interacts-with   ADAM10
4     A1BG          interacts-with   ADAM17

Print the unique types of interactions in the data frame, using the unique method:



In [9]:

    
df.interaction_type.unique()









    Out[9]:





array(['controls-expression-of', 'interacts-with',
       'controls-phosphorylation-of', 'controls-state-change-of',
       'in-complex-with', 'controls-production-of', 'catalysis-precedes',
       'controls-transport-of', 'controls-transport-of-chemical',
       'chemical-affects', 'consumption-controlled-by', 'reacts-with',
       'used-to-produce'], dtype=object)

Subset the data frame by interaction type (using isin method), to include only the protein-protein interactions, then count



In [63]:

    
ppirows = df.interaction_type.isin(interaction_types_ppi)
sum(ppirows)









    Out[63]:





523498

Make a list of all proteins that occur in a protein-protein interaction, and count the unique protein names by putting them in a set and calling len on the set



In [64]:

    
newlist = df["species1"][ppirows].tolist() + df["species2"][ppirows].tolist()
len(set(newlist))









    Out[64]:





17020

Count unique protein-protein interaction pairs (specific type of interaction irrelevant), again using set and len



In [68]:

    
len(set(df["species1"][ppirows] + "-" + df["species2"][ppirows]))









    Out[68]:





491784

Count each type of interaction in the database, by subsetting to the interaction column and using value_counts



In [70]:

    
df["interaction_type"].value_counts()









    Out[70]:





interacts-with                    369895
in-complex-with                   153603
chemical-affects                  135268
catalysis-precedes                120948
controls-expression-of            110013
controls-state-change-of          106156
controls-production-of             18482
consumption-controlled-by          16816
controls-phosphorylation-of        15636
used-to-produce                    13705
controls-transport-of               6960
reacts-with                         3607
controls-transport-of-chemical      2847
Name: interaction_type, dtype: int64