Analyzing the Pathway Commons 2 (PC2) database SIF file

CS446/546 class session 2

Goal: count the number of different types of biological interactions in PC2

Approach: retrieve compressed tab-delimited "edge-list" file and tabulate "interaction" column

Information you will need:

Other stuff you should do:

  • Print the first six lines of the uncompressed data file
  • Use a timer to time how long your program takes
  • Count how many rows there are in the data file
  • Estimate the number of proteins in the database; we'll define them operationally as strings in column 1 or column 3, for which the content of column 2 is one of these interactions: 'interacts-with', 'in-complex-with', 'neighbor-of'
  • Count the total number of unique pairs of interacting molecules (ignoring interaction type)
  • Count the number rows for each type of interaction in the database
  • Pythonistas: do it using Pandas and without using Pandas

Step-by-step instructions for Python3:

  • Open a file object representing a stream of the remote, compressed data file, using urlopen
  • Open a file object representing a stream of the uncompressed data file, using gzip.GzipFile
  • Start the timer
  • Read one line at a time, until the end of the file
  • Split line on "\t" and pull out the tuple of species1, interaction_type, species2 from the line of text

In [2]:
from urllib.request import urlopen
import gzip
import timeit

baseURL = "http://www.pathwaycommons.org/archives/PC2/v9/"
filename = "PathwayCommons9.All.hgnc.sif.gz"
outFilePath = "pc.sif"
interaction_types_ppi = set(["interacts-with","in-complex-with","neighbor-of"])

start_time = timeit.default_timer()

zfd = urlopen(baseURL + filename)
fd = gzip.GzipFile(fileobj=zfd, mode="r")

# initialize the SIF file interaction counter
intctr = 0
linectr = 0
from collections import defaultdict

interactions = set()
proteins = set()
intnamectr = defaultdict(int)

for line in fd:
    if linectr < 6:
        print(line)
        
    linectr += 1
    
    [prot1, interaction_type, prot2] = line.decode("utf-8").rstrip("\n").split("\t")
    intnamectr[interaction_type] += 1
    if interaction_type in interaction_types_ppi:
        intctr += 1
        proteins |= set([prot1, prot2])
        interactions.add(min(prot1, prot2) + "-" + max(prot1, prot2))       
        
elapsed = timeit.default_timer() - start_time


b'A1BG\tcontrols-expression-of\tA2M\n'
b'A1BG\tinteracts-with\tABCC6\n'
b'A1BG\tinteracts-with\tACE2\n'
b'A1BG\tinteracts-with\tADAM10\n'
b'A1BG\tinteracts-with\tADAM17\n'
b'A1BG\tinteracts-with\tADAM9\n'
5.3486454999947455

How long your program take to run?


In [5]:
print(elapsed)


5.3486454999947455

How many protein-protein interactions are there in the data file?


In [10]:
print(intctr)


523498

How many unique protein names are there in the data file?


In [11]:
len(proteins)


Out[11]:
17020

How many unique pairs of proteins (regarless of interaction type name) are there that interact?


In [12]:
len(interactions)


Out[12]:
491784

How many interactions are there of each type, in PC2?


In [13]:
from operator import itemgetter
sorted(intnamectr.items(), key=itemgetter(1), reverse=True)


Out[13]:
[('interacts-with', 369895),
 ('in-complex-with', 153603),
 ('chemical-affects', 135268),
 ('catalysis-precedes', 120948),
 ('controls-expression-of', 110013),
 ('controls-state-change-of', 106156),
 ('controls-production-of', 18482),
 ('consumption-controlled-by', 16816),
 ('controls-phosphorylation-of', 15636),
 ('used-to-produce', 13705),
 ('controls-transport-of', 6960),
 ('reacts-with', 3607),
 ('controls-transport-of-chemical', 2847)]

Pythonistas: do it again, using Pandas:

read from the uncompressed data stream, and parse it into a data frame, using pandas.read_csv


In [7]:
import pandas
zfd = urlopen(baseURL + filename)
fd = gzip.GzipFile(fileobj=zfd, mode="r")
df = pandas.read_csv(fd, sep="\t", names=["species1","interaction_type","species2"])

Use the head method on the data frame, to print out the first six lines


In [8]:
print(df.head())


  species1        interaction_type species2
0     A1BG  controls-expression-of      A2M
1     A1BG          interacts-with    ABCC6
2     A1BG          interacts-with     ACE2
3     A1BG          interacts-with   ADAM10
4     A1BG          interacts-with   ADAM17

Print the unique types of interactions in the data frame, using the unique method:


In [9]:
df.interaction_type.unique()


Out[9]:
array(['controls-expression-of', 'interacts-with',
       'controls-phosphorylation-of', 'controls-state-change-of',
       'in-complex-with', 'controls-production-of', 'catalysis-precedes',
       'controls-transport-of', 'controls-transport-of-chemical',
       'chemical-affects', 'consumption-controlled-by', 'reacts-with',
       'used-to-produce'], dtype=object)

Subset the data frame by interaction type (using isin method), to include only the protein-protein interactions, then count


In [63]:
ppirows = df.interaction_type.isin(interaction_types_ppi)
sum(ppirows)


Out[63]:
523498

Make a list of all proteins that occur in a protein-protein interaction, and count the unique protein names by putting them in a set and calling len on the set


In [64]:
newlist = df["species1"][ppirows].tolist() + df["species2"][ppirows].tolist()
len(set(newlist))


Out[64]:
17020

Count unique protein-protein interaction pairs (specific type of interaction irrelevant), again using set and len


In [68]:
len(set(df["species1"][ppirows] + "-" + df["species2"][ppirows]))


Out[68]:
491784

Count each type of interaction in the database, by subsetting to the interaction column and using value_counts


In [70]:
df["interaction_type"].value_counts()


Out[70]:
interacts-with                    369895
in-complex-with                   153603
chemical-affects                  135268
catalysis-precedes                120948
controls-expression-of            110013
controls-state-change-of          106156
controls-production-of             18482
consumption-controlled-by          16816
controls-phosphorylation-of        15636
used-to-produce                    13705
controls-transport-of               6960
reacts-with                         3607
controls-transport-of-chemical      2847
Name: interaction_type, dtype: int64