The HIPPIE database is a dataset that integrates multiple experimental PPI datasets. For a given protein interaction it can produce a confidence value that the interaction exists. Unfortunately, it is important to note a concern with using this is that the gold standard dataset currently chosen to use in this project is integrated into the HIPPIE dataset. This could mean that the classifier can simply rely on this feature to predict the training or test set 100%.
In this case the solution is simply to threshold the HIPPIE dataset at a high confidence value and use it as the gold standard dataset instead. This might be a good idea regardless of the results of the test.
The aim of this notebook is to:
The complete list is spread among a few different files. These files need to be reformatted to be entered into HIPPIE's web service. First, we need to strip the zeros and ones from the two gold standard training files:
training.negative.Entrez.txt
training.positive.Entrez.txt
The new files will be called:
training.nolabel.negative.Entrez.txt
training.nolabel.positive.Entrez.txt
In [1]:
cd /home/gavin/Documents/MRes/DIP/human/
In [2]:
ls
In [6]:
import csv
In [8]:
rc = csv.reader(open("training.negative.Entrez.txt"), delimiter="\t")
wc = csv.writer(open("training.nolabel.negative.Entrez.txt", "w"), delimiter="\t")
for line in rc:
line = (line[0],line[1])
wc.writerow(line)
In [10]:
rc = csv.reader(open("training.positive.Entrez.txt"), delimiter="\t")
wc = csv.writer(open("training.nolabel.positive.Entrez.txt", "w"), delimiter="\t")
for line in rc:
line = (line[0],line[1])
wc.writerow(line)
In [16]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/BAITS/
In [17]:
ls
In [18]:
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"))))
In [19]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/PREYS/
In [20]:
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"))))
In [21]:
#combine these lists
pulldownids = baitids + preyids
At this point we have a list pulldownids
of Entrez IDs for all proteins found in the pulldown experiments.
We can now use that list to build a set of all possible combinations using itertools
:
In [22]:
import itertools
In [23]:
#initialise list
pulldowncomb = []
#iterate over all possible combinations, adding to the list
for pair in itertools.combinations(pulldownids,2):
pulldowncomb.append(frozenset(pair))
#convert list to set
pulldowncomb = set(pulldowncombb)
In [24]:
print "Number of combinations of pulldown protein IDs: %i"%(len(pulldowncomb))
Saving this set to a file.
The file will be named pulldown.combinations.Entrez.txt
.
In [30]:
csv.writer(open("pulldown.combinations.Entrez.txt", "w"), delimiter="\t").writerows(map(lambda x: list(x) ,list(pulldowncomb)))
In [32]:
cd /home/gavin/Documents/MRes/HIPPIE/
To retreive the most recent version of the HIPPIE dataset, script and readme:
In [35]:
%%bash
wget -q http://cbdm.mdc-berlin.de/tools/hippie/hippie_current.txt
wget -q http://cbdm.mdc-berlin.de/tools/hippie/NC/HIPPIE_NC.jar
wget -q http://cbdm.mdc-berlin.de/tools/hippie/NC/README.txt
The quick start guide gives usage for taking an input file when the hippie database is in the same directory and named hippie_current.txt
:
java -jar HIPPIE_NC.jar -i=query.txt
We would also like to specify an output file, which is done using:
java -jar HIPPIE_NC.jar -i=query.txt -o=out.txt
Also, we should probably specify that the proteins will be given in Entrez format:
java -jar HIPPIE_NC.jar -i=query.txt -t=e -o=out.txt
There is also an option to restrict HIPPIE to the proteins that it is supplied, otherwise it searches for interactions between the proteins supplied and all the proteins it knows about. This is the layer option and must be set to zero:
java -jar HIPPIE_NC.jar -i=query.txt -l=0 -t=e -o=out.txt
The files which must be queried using this tool are given below:
training.nolabel.negative.Entrez.txt
- negative training examples from gold standard dataset.training.nolabel.positive.Entrez.txt
- positive training examples from gold standard dataset.pulldown.combinations.Entrez.txt
- all possible combinations of proteins from pulldown experiments.Starting with the smallest file, which is training.nolabel.positive.Entrez.txt
:
In [45]:
%%bash
java -jar HIPPIE_NC.jar -i=../DIP/human/training.nolabel.positive.Entrez.txt -t=e -l=0 -o=training.positive.HIPPIE.txt
Looking at the file and checking to see if all pairs were mapped:
In [49]:
%%bash
head training.positive.HIPPIE.txt
wc -l ../DIP/human/training.positive.Entrez.txt
wc -l training.positive.HIPPIE.txt
There are a much larger number of pairs after conversion because the HIPPIE script simply takes the proteins as a list and finds all the interactions it knows about between those proteins. As this includes not just the DIP dataset but also many others there is a larger number of interactions available without setting a cuttoff. To deal with this, we require a script to match the confidence values in the file produced with only the interacting pairs we care about.
In [50]:
cd /home/gavin/Documents/MRes/DIP/human/
In [51]:
#initialise csv reader
c = csv.reader(open("training.nolabel.positive.Entrez.txt"), delimiter="\t")
#make dictionary using frozensets as keys:
posids = {}
for line in c:
line = frozenset(line)
posids[line] = 1
In [52]:
cd /home/gavin/Documents/MRes/HIPPIE/
In [55]:
#initialise csv reader
c = csv.reader(open("training.positive.HIPPIE.txt"), delimiter="\t")
#make dictionary using frozensets as keys with the confidence scores as values
hippieids = {}
for line in c:
k = frozenset([line[1],line[3]])
hippieids[k] = line[4]
In [57]:
%%bash
head training.positive.HIPPIE.txt
wc -l ../DIP/human/training.positive.Entrez.txt
wc -l training.positive.HIPPIE.txt
Strangely enough, we've dropped some pairs in the conversion, which doesn't make a lot of sense. It could be that the HIPPIE database is not up to date with the pairs that are in the DIP database. Or the method used to map from the DIP protein identifiers to Entrez used by HIPPIE differs from our method.
Also odd is the fact that some of these protein pairs have a confidence value of zero attached to them. At this point it's uncertain why this might be. It could be that these values were known to be missing from the HIPPIE database so the HIPPIE script has labelled them zero. Or, it could be that the script is assigns no confidence to the evidence for these interactions from DIP.
In any case, the two databases clearly do not match up exactly. Deciding whether to use the HIPPIE database as a gold standard instead of DIP is going to have to be discussed. The code will be designed such that the gold standard database can be easily switched in any case.
The above code was modified and made into a simple Python script called hippiematch.py
to use on the remaining two files.
So repeating the above for the final two files can be done as follows:
In [68]:
%%bash
python2 ../opencast-bio/scripts/hippiematch.py -h
In [73]:
%%bash
java -jar HIPPIE_NC.jar -i=../DIP/human/training.nolabel.negative.Entrez.txt -t=e -l=0 -o=training.negative.HIPPIE.txt
python2 ../opencast-bio/scripts/hippiematch.py training.negative.HIPPIE.txt ../DIP/human/training.nolabel.negative.Entrez.txt training.negative.HIPPIE.txt
In [74]:
%%bash
java -jar HIPPIE_NC.jar -i=../forGAVIN/pulldown_data/PREYS/pulldown.combinations.Entrez.txt -t=e -l=0 -o=pulldown.combinations.HIPPIE.txt
python2 ../opencast-bio/scripts/hippiematch.py pulldown.combinations.HIPPIE.txt ../forGAVIN/pulldown_data/PREYS/pulldown.combinations.Entrez.txt pulldown.combinations.HIPPIE.txt