Due to the problem of extremely large database files being produced when extracting features with extremely good coverage, such as Gene Ontology, a new version of the code is required to deal with this problem.
The new code will store the ProteinPairParser objects as pickle files and the features will be indexed from these objects through a __getitem__
method with the ProteinPairParser only interacting with it's database, if it has one.
Each ProteinPairParser will have it's own generator function which will either be created using the options handed to it or loaded from another pickle file. The default generator will act as the code currently does, by creating a database then indexing said database to retrieve files. According to the Python documentation if the Parser opens the database at initialisation and then is pickled the database will be closed and opened again at unpickling time: pickle documentation - see the TextReader example.
In [8]:
cd /home/gavin/Documents/MRes/
Going to step through the changes I'm making to the code and list them using git. After making changes I'll run a test case to make sure it's still working.
In [9]:
import sys
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
import ocbio.extract
In [65]:
reload(ocbio.extract)
Out[65]:
__getitem__
This involves ensuring that at initialisation the parser will define a function to return values by default from the database it's going to create when regenerate
is run.
There must also be an option to load an arbitrary pickled object to return items.
Also, the databases must now be opened and closed with the parsers and the opened databases stored within the parser objects. The assembler will have to be modified to deal with this and close the databases when it's done.
Showing these changes:
In [51]:
!git -C opencast-bio/ show HEAD
Testing initialisation:
In [46]:
testparser = ocbio.extract.ProteinPairParser("none","none",generator="geneontology/testgen.pickle")
In [47]:
testpair = frozenset(('114787', '114785'))
In [48]:
print testparser[testpair]
The assembler currently looks at the databases the parsers produce and pulls out the values itself. Now that the parsers can do that there's no reason for this. The assembler just needs to iterate through parsers and index for the pair it's working on at that time:
In [219]:
!git -C opencast-bio/ show HEAD
Regenerating data source table to test with:
In [209]:
f = open("datasource.tab", "w")
# the HIPPIE feature
f.write("HIPPIE/hippie_current.txt"+"\t"+"HIPPIE/feature.HIPPIE.db"+"\t"+"protindexes=(1,3);valindexes=(4)"+"\n")
# the abundance feature
f.write("forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv"+"\t"+"forGAVIN/pulldown_data/dataset/abundance.Entrez.db"+"\t"+"ignoreheader=1"+"\n")
# the affinity feature
f.write("affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv"+"\t"+"affinityresults/results2/affinity.Entrez.db"+"\t"+""+"\n")
# gene ontology features
f.write("Gene_Ontology"+"\t"+"Gene_Ontology"+"\t"+"generator=geneontology/testgen.pickle")
f.close()
In [214]:
reload(ocbio.extract)
Out[214]:
In [215]:
assembler = ocbio.extract.FeatureVectorAssembler("/home/gavin/Documents/MRes/datasource.tab")
In [216]:
assembler.regenerate(force=True, verbose=True)
In [217]:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
"features/training.nolabel.positive.Entrez.vectors.txt", verbose=True)
In [218]:
assembler.close()
These are the notes on the development of the code described in the Feature vector assembly page of the wiki. First, the protein pair parser class will be written and tested on a dataset that has already been extracted. This will be the HIPPIE dataset.
Initialisation currently involves loading in the three command line options and saving them to the object. It must also involve parsing of the options. Testing the initialisation:
In [27]:
import os, time, subprocess, csv, shelve
In [1]:
cd /home/gavin/Documents/MRes/HIPPIE/
In [95]:
#define the class
class ProteinPairParser():
'''Does simple parsing on data files to produce protein pair files with feature values'''
def __init__(self,datadir,outdir,protindexes=(1,2),valindex=3,script=None,csvdelim="\t"):
#first, store the initialisation
self.datadir = datadir
self.outdir = outdir
self.protindexes=protindexes
self.valindex=valindex
self.script=script
self.csvdelim=csvdelim
return None
def regenerate(self):
'''Regenerate the pair file from the data source
if the data source is newer than the pair file'''
# so first check the ages of both files
datamtime = os.stat(self.datadir)[-2]
if os.path.isfile(self.outdir):
pairmtime = os.stat(self.outdir)[-2]
else:
#bit of a hack
pairmtime = 0
#if the data modification time is greater than output modification time
if datamtime > pairmtime:
# now regenerate the data file according to the options defined above:
print "data file is newer than pair file"
#if there's a script to run
if self.script != None:
#then execute the script
retcode=subprocess.call("python2 %s"%self.script, shell=True)
#first delete out of date file, if it's there
if os.path.isfile(self.outdir):
os.remove(self.outdir)
#perform simple parsing to make a file of just protein pairs and the value we care about
#and save these using shelve
db = openpairshelf(self.outdir)
#open the data file
c = csv.reader(open(self.datadir), delimiter=self.csvdelim)
for line in c:
#each line use the protein pair as a key
#by formatting it as a frozenset
pair = frozenset([line[self.protindexes[0]],line[self.protindexes[1]]])
#and the value is indexed by valindex
db[pair] = line[self.valindex]
db.close()
return None
We have to rerun the script to generate a pre-processed HIPPIE file:
In [82]:
%%bash
java -jar HIPPIE_NC.jar -i=../DIP/human/training.nolabel.positive.Entrez.txt -t=e -l=0 -o=prematch.positive.HIPPIE.txt
Then we can use the parser to quickly perform what is done in the notebook previously used to extract the features:
In [113]:
x = ProteinPairParser("prematch.positive.HIPPIE.txt","training.positive.HIPPIE.txt",protindexes=(1,3),valindex=4)
In [114]:
x.regenerate()
The function reports that the data file is newer than the pair file and regenerates the pair files as a persistent dictionary object. This is useful because it means that this can later be indexed quickly for building feature vectors.
We can open this database to show this functionality:
In [115]:
test = openpairshelf("training.positive.HIPPIE.txt")
If we look at the keys that this database uses we can see that it is using strings internally.
It could be useful to modify the .keys()
method of this function so that this would produce the list of frozenset keys instead.
In [116]:
test.keys()[0:10]
Out[116]:
And taking one of these keys and turning it into a frozenset, we can then index the database using that as a key.
In [117]:
testkey = test.keys()[0]
testkey = frozenset(testkey.split("\t"))
print testkey
test[testkey]
Out[117]:
In [118]:
test.close()
The problem is that a shelve (shelf?) database can't take the frozenset as a key. The recommended way to deal with this is to make a wrapper. As the database used by shelf is a class we can build a child class from this, modifying the functions to deal with protein pairs stored in frozensets as keys. This code will not deal with arbitrary frozensets as keys.
Essentially, it will use a string of the two protein identifiers separated by a tab as the key. To index though it will take a frozenset and convert it to two strings which are the two iterations of the two strings.
In [110]:
class ProteinPairDB(shelve.DbfilenameShelf):
'''A simple database for protein pairs using shelve.'''
def __setitem__(self,key,value):
#key will be frozenset so make it a list first
key = list(key)
#then make it a string
if len(key) == 1:
key = key[0]*2
else:
key = key[0] + "\t" + key[1]
shelve.DbfilenameShelf.__setitem__(self,key,value)
return None
def __getitem__(self,key):
#make two strings from the key
key = list(key)
if len(key) == 1:
key1 = key[0]*2
key1 = key[0]*2
else:
key1 = key[0] + "\t" + key[1]
key2 = key[1] + "\t" + key[0]
#try the first one
try:
value = shelve.DbfilenameShelf.__getitem__(self,key1)
except KeyError:
#couldn't find the first key, try the second
value = shelve.DbfilenameShelf.__getitem__(self,key2)
#if we don't find this one then error out as usual
return value
We also have to define a function to apply default arguments to have similar functionality to shelve:
In [57]:
def openpairshelf(filename, flag='c', protocol=None, writeback=False):
return ProteinPairDB(filename, flag, protocol, writeback)
The next thing to do is write the code for the feature vector assembler. This is another class, with methods:
And at initialisation it is expected to do:
To do this it must parse the data source table. It assumes that the data source table is provided as a full path and that this path is the top directory for the data. ie all of the data paths in the data source will be relative to the path of the table itself. The table itself will have structure:
Data source directory | Output database directory | Options |
---|---|---|
/relative/path/to/data |
/relative/path/to/output/database |
protindexes=1,3;valindex=4;script=/path/to/script ;csvdelim=\t |
The available options are given in the options column of the table above.
To test initialisation a first draft of the code is given below:
In [147]:
class FeatureVectorAssembler():
'''Assembles feature vectors from protein pair files, data source lists and gold standard protein pair lists.'''
def __init__(self,sourcetab,goldpos,goldneg):
#store the initialisation
# directories for positive and negative gold standard data files
self.goldpos = goldpos
self.goldneg = goldneg
#check if the gold standard data directories exist
# throw an error if they don't
#instantiate protein pair parsers
# first parse the data source table
# store the directory of the table and it's name
self.sourcetabdir,self.tabfile = os.path.split(sourcetab)
# open the table and parse for initialisation options
c = csv.reader(open(sourcetab), delimiter="\t")
# iterate over lines adding to list of protein pair parsers
self.parserinitlist = []
for line in c:
#store the information in a dictionary
d = {}
d["data path"] = line[0]
d["output path"] = line[1]
#store options in a dictionary in the dictionary
d["options"] = {}
options = line[2].split(";")
for x in options:
#split each option to find out which option it is:
x = x.split("=")
#store it in the dictionary
# if there are invalid options this code WILL NOT DETECT THEM
d["options"][x[0]]= x[1]
#copy the dictionary into the list
self.parserinitlist.append(d.copy())
#then initialise each of these parsers and keep them in a list
self.parserlist = []
for parser in self.parserinitlist:
self.parserlist.append(ProteinPairParser(parser["data path"],
parser["output path"],
**parser["options"]))
return None
Testing initialisation requires a data source table file so this file is created below at the top directory for the data files.
In [139]:
cd /home/gavin/Documents/MRes/
In [141]:
f = open("datasource.tab", "w")
f.write("HIPPIE/prematch.positive.HIPPIE.txt" + "\t" + "HIPPIE/training.positive.HIPPIE.txt" + "\t" + "protindexes=(1,3);valindex=4")
f.close()
Checking this file has written properly:
In [143]:
%%bash
cat datasource.tab
Then attempt to initialise the above version from that. Notice here that the FeatureVectorAssembler requires gold standard data sources at initialisation in this version. This is removed in the second version as it made more sense to allow arbitrary protein pair lists in the feature vector assemble method.
In [148]:
test = FeatureVectorAssembler("datasource.tab",
"DIP/human/training.nolabel.positive.Entrez.txt",
"DIP/human/training.nolabel.negative.Entrez.txt")
It initialises ok, so the second draft of the code with the required methods is given below:
In [174]:
class FeatureVectorAssembler():
'''Assembles feature vectors from protein pair files, data source lists and gold standard protein pair lists.'''
def __init__(self,sourcetab):
#instantiate protein pair parsers
# first parse the data source table
# store the directory of the table and it's name
self.sourcetabdir,self.tabfile = os.path.split(sourcetab)
# open the table and parse for initialisation options
c = csv.reader(open(sourcetab), delimiter="\t")
# iterate over lines adding to list of protein pair parsers
self.parserinitlist = []
for line in c:
#store the information in a dictionary
d = {}
d["data path"] = os.path.join(self.sourcetabdir,line[0])
d["output path"] = os.path.join(self.sourcetabdir,line[1])
#store options in a dictionary in the dictionary
d["options"] = {}
options = line[2].split(";")
for x in options:
#split each option to find out which option it is:
x = x.split("=")
#store it in the dictionary
# if there are invalid options this code WILL NOT DETECT THEM
d["options"][x[0]]= x[1]
#update the script directory
if "script" in d["options"].keys():
d["options"]["script"] = os.path.join(self.sourcetabdir,d["options"]["script"])
#copy the dictionary into the list
self.parserinitlist.append(d.copy())
#then initialise each of these parsers and keep them in a list
self.parserlist = []
for parser in self.parserinitlist:
self.parserlist.append(ProteinPairParser(parser["data path"],
parser["output path"],
**parser["options"]))
return None
def regenerate(self):
'''Calls all known protein parsers and gets them to regenerate their output, if they have to.'''
for parser in self.parserlist:
parser.regenerate()
return None
def assemble(self, pairfile, outputfile):
'''Assembles a file of feature vectors for each protein pair in a protein pair file supplied.
Assumes the pairfile is tab delimited.'''
# first parse the pairfile into a list of frozensets
pairs = map(lambda l: frozenset(l),csv.reader(open(pairfile), delimiter="\t"))
# open the file to put the feature vector in
c = csv.writer(open(outputfile, "w"), delimiter="\t")
#open all the databases and put them in a dictionary
dbdict = {}
for parser in self.parserinitlist:
dbdict[parser["output path"]] = openpairshelf(parser["output path"])
# then iterate through the pairs, querying all parser databases
for pair in pairs:
row = []
lpair = list(pair)
row = row + lpair
for parser in self.parserinitlist:
row.append(dbdict[parser["output path"]][pair])
c.writerow(row)
#close all the databases
for parser in self.parserinitlist:
dbdict[parser["output path"]].close()
return None
Testing initialisation of this code again:
In [175]:
test = FeatureVectorAssembler("/home/gavin/Documents/MRes/datasource.tab")
Regenerating the data sources. It does not report that any of the data sources require regeneration so nothing is done this time:
In [176]:
test.regenerate()
Then testing the assembly of a feature vector file:
In [177]:
test.assemble("DIP/human/training.nolabel.positive.Entrez.txt", "testoutput")
Looking at this file we can see that it has produced a feature vector with only one feature. It also reports the associated protein pairs next to this feature. This is removed from later versions as it makes later classification more complicated.
In [178]:
%%bash
head testoutput