Using the DIP training set created in [this notebook][dipnotes] is problematic with a HIPPIE feature as the HIPPIE database includes DIP in its confidence values. It seems like the HIPPIE database itself would be a more useful source for a training set. Unfortunately, as the HIPPIE database does not simply say whether a protein interaction exists but instead gives a confidence value in it existing we are faced with two options:
To do the first would fundamentally change the task we are attempting. But, to do the second would throw away a large amount of useful data. Luckily, there is a third way if we categorise the confidence values and use multiclass classification. Specifically, one-vs-rest may be well suited to our task as this equates to a set of binary classification tasks.
Whether this is a viable strategy depends on the distribution of confidence values in the HIPPIE database. We can inspect this directly:
In [1]:
cd ../../HIPPIE/
In [2]:
ls
In [3]:
import csv
In [7]:
f = open("hippie_current.txt")
c = csv.reader(f,delimiter="\t")
cvals = {}
for l in c:
cval = float(l[4])
for thresh in range(50,100,10):
if cval > thresh/100.0:
try:
cvals[thresh] += [cval]
except KeyError:
cvals[thresh] = [cval]
f.close()
In [13]:
h=hist(cvals[50],bins=50)
t=title("Above 50% threshold.")
print "Number of samples above 50%: {0}".format(len(cvals[50]))
In [14]:
h=hist(cvals[90],bins=50)
t=title("Above 90% threshold.")
print "Number of samples above 90%: {0}".format(len(cvals[90]))
In [15]:
h=hist(cvals[80],bins=50)
t=title("Above 80% threshold.")
print "Number of samples above 80%: {0}".format(len(cvals[80]))
In [59]:
fr,fw = open("hippie_current.txt"), open("hippie_current.pairs.txt","w")
cr,cw = csv.reader(fr,delimiter="\t"),csv.writer(fw,delimiter="\t")
for l in cr:
cw.writerow([l[1],l[3]])
fr.close()
fw.close()
In [60]:
!head hippie_current.pairs.txt
In [61]:
!wc -l hippie_current.pairs.txt
Thresholding this data set at a high confidence value is an acceptable way to use this dataset to train a classifier. Alternative approaches using the condfidence values may be approached, but it is likely this would make the project too complicated to be finished on time, especially with the time already invested in supervised binary classification
In [16]:
fr,fw = open("hippie_current.txt"), open("hippie_current.90.pairs.txt","w")
cr,cw = csv.reader(fr,delimiter="\t"),csv.writer(fw,delimiter="\t")
for l in cr:
if float(l[4]) > 0.9:
cw.writerow([l[1],l[3]])
fr.close()
fw.close()
In [18]:
!head hippie_current.90.pairs.txt
In [17]:
fr,fw = open("hippie_current.txt"), open("hippie_current.80.pairs.txt","w")
cr,cw = csv.reader(fr,delimiter="\t"),csv.writer(fw,delimiter="\t")
for l in cr:
if float(l[4]) > 0.8:
cw.writerow([l[1],l[3]])
fr.close()
fw.close()
In [64]:
fw = open("hippie_current.confidencevalues.txt","w")
cw = csv.writer(fw,delimiter="\t")
for v in cvals:
cw.writerow([v])
fw.close()