Using the DIP training set created in [this notebook][dipnotes] is problematic with a HIPPIE feature as the HIPPIE database includes DIP in its confidence values. It seems like the HIPPIE database itself would be a more useful source for a training set. Unfortunately, as the HIPPIE database does not simply say whether a protein interaction exists but instead gives a confidence value in it existing we are faced with two options:

Approach the problem as regression, predicting the confidence values of HIPPIE where HIPPIE does not predict a value.
Threshold the HIPPIE database, picking only high-confidence pairs as true interactions.

To do the first would fundamentally change the task we are attempting. But, to do the second would throw away a large amount of useful data. Luckily, there is a third way if we categorise the confidence values and use multiclass classification. Specifically, one-vs-rest may be well suited to our task as this equates to a set of binary classification tasks.

Whether this is a viable strategy depends on the distribution of confidence values in the HIPPIE database. We can inspect this directly:



In [1]:

    
cd ../../HIPPIE/









    



/data/opencast/MRes/HIPPIE



In [2]:

    
ls









    



feature2.HIPPIE.db@   hippie_current.confidencevalues.txt@  HIPPIE_NC.jar@                     README.txt@                    training.positive.HIPPIE.db@
feature.HIPPIE.2.db@  hippie_current.pairs.txt@             prematch.positive.HIPPIE.txt@      testdb@                        training.positive.HIPPIE.txt@
feature.HIPPIE.db     hippie_current.txt@                   pulldown.combinations.HIPPIE.txt@  training.negative.HIPPIE.txt@



In [3]:

    
import csv



In [7]:

    
f = open("hippie_current.txt")
c = csv.reader(f,delimiter="\t")
cvals = {}
for l in c:
    cval = float(l[4])
    for thresh in range(50,100,10):
        if cval > thresh/100.0:
            try:
                cvals[thresh] += [cval]
            except KeyError:
                cvals[thresh] = [cval]
f.close()



In [13]:

    
h=hist(cvals[50],bins=50)
t=title("Above 50% threshold.")
print "Number of samples above 50%: {0}".format(len(cvals[50]))









    



Number of samples above 50%: 155183



In [14]:

    
h=hist(cvals[90],bins=50)
t=title("Above 90% threshold.")
print "Number of samples above 90%: {0}".format(len(cvals[90]))









    



Number of samples above 90%: 2451



In [15]:

    
h=hist(cvals[80],bins=50)
t=title("Above 80% threshold.")
print "Number of samples above 80%: {0}".format(len(cvals[80]))









    



Number of samples above 80%: 16394

Extracting protein pairs involved

To test coverage of some features we require a file that simply lists protein pairs in HIPPIE by their Entrez identifier.



In [59]:

    
fr,fw = open("hippie_current.txt"), open("hippie_current.pairs.txt","w")
cr,cw = csv.reader(fr,delimiter="\t"),csv.writer(fw,delimiter="\t")
for l in cr:
    cw.writerow([l[1],l[3]])
fr.close()
fw.close()



In [60]:

    
!head hippie_current.pairs.txt



In [61]:

    
!wc -l hippie_current.pairs.txt









    



169626 hippie_current.pairs.txt

90% condfidence value

Thresholding this data set at a high confidence value is an acceptable way to use this dataset to train a classifier. Alternative approaches using the condfidence values may be approached, but it is likely this would make the project too complicated to be finished on time, especially with the time already invested in supervised binary classification



In [16]:

    
fr,fw = open("hippie_current.txt"), open("hippie_current.90.pairs.txt","w")
cr,cw = csv.reader(fr,delimiter="\t"),csv.writer(fw,delimiter="\t")
for l in cr:
    if float(l[4]) > 0.9:
        cw.writerow([l[1],l[3]])
fr.close()
fw.close()



In [18]:

    
!head hippie_current.90.pairs.txt

80% confidence value

As above, but with a larger number of proteins, if it turns out that 90% is too stringent.



In [17]:

    
fr,fw = open("hippie_current.txt"), open("hippie_current.80.pairs.txt","w")
cr,cw = csv.reader(fr,delimiter="\t"),csv.writer(fw,delimiter="\t")
for l in cr:
    if float(l[4]) > 0.8:
        cw.writerow([l[1],l[3]])
fr.close()
fw.close()

Saving corresponding confidence values



In [64]:

    
fw = open("hippie_current.confidencevalues.txt","w")
cw = csv.writer(fw,delimiter="\t")
for v in cvals:
    cw.writerow([v])
fw.close()