The ocbio.extract
code can be used to quickly extract features and execute arbitrary scripts while doing so.
It's designed to keep track of and update many different data sources.
The primary way it does this is through the data source table file.
This table is of the following form and provides the information about where to find the data file to be processed and where to store the output database.
Data source directory | Output database directory | Options |
---|---|---|
/relative/path/to/data |
/relative/path/to/output/database |
protindexes=1,3;valindex=4;script=/path/to/script ;csvdelim=\t |
... | ... | ... |
The options affect how the data source will be parsed to produce the output file. An important note about the data source directory is that this must be the directory the script produces not the directory the script processes if a script is specified. This is because the data source directory is expected in a form such as:
Spam | Protein 1 | Spam | Protein 2 | Lovely | Value |
---|---|---|---|---|---|
Spam | 1243 | Wonderful | 3214 | Spam | 123.4 |
The parsing done by the actual code is limited to selecting certain columns; two for protein identifiers and one for the value of interest. The rest of the processing is expected to be done by the chosen script.
The output of the script is a ProteinPairDB, which is written as a child class from the database produced by shelve.open (Dbfilenameshelf), modifying the __getitem__
and __setitem__
methods.
The available options to place in the table as above separated by ;
are:
\t
)The data source table must be placed at the top directory containing the data sources. It assumes that the paths given to data and output files in the table are relative to that position. So, first we navigate to this position:
In [124]:
cd /home/gavin/Documents/MRes/
At the time of writing there are three extracted features which we can add to this table. Note that it is tab delimited:
In [125]:
f = open("datasource.tab", "w")
# the HIPPIE feature
f.write("HIPPIE/hippie_current.txt"+"\t"+"HIPPIE/feature.HIPPIE.db"+"\t"+"protindexes=(1,3);valindexes=(4)"+"\n")
# the abundance feature
f.write("forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv"+"\t"+"forGAVIN/pulldown_data/dataset/abundance.Entrez.db"+"\t"+"ignoreheader=1"+"\n")
# the affinity feature
f.write("affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv"+"\t"+"affinityresults/results2/affinity.Entrez.db"+"\t"+"")
f.close()
In [126]:
import sys
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
In [127]:
import ocbio.extract
In [152]:
reload(ocbio.extract)
Out[152]:
In [153]:
assembler = ocbio.extract.FeatureVectorAssembler("/home/gavin/Documents/MRes/datasource.tab")
Generating the protein pair database files at the locations defined in the data source table can be done in two ways. Soft regeneration involves checking to see if the pre-processed files are newer than the processed files. If they are the data source is processed again to regenerate the database. This is the default and will be run if the regenerate method is used:
In [130]:
assembler.regenerate()
In this case the data files are older than the processed files so it does nothing.
We can force it to regenerate the databases with the force=True
option:
In [131]:
assembler.regenerate(force=True)
However, this can take a long time to complete as it regenerates all databases. To force regeneration of a single database either delete the current output file or change the path for the output file in the data source table and re-initialise the assembler.
Assembling a feature vector file can be done using the assemble
method of the FeatureVectorAssembler
instance.
As input this takes a file containing a list of protein pairs and produces a file containing feature vectors corresponding to each protein pair in that file.
An output file must also be specified.
In [132]:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt", "features/training.nolabel.positive.Entrez.vectors.txt")
Looking at this output file we can see that each row is a feature vector.
Missing values are given by the missinglabel
option, which defaults to the string "missing".
In [133]:
%%bash
head features/training.nolabel.positive.Entrez.vectors.txt
This can be changed to any string required:
In [134]:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
"features/training.nolabel.positive.Entrez.vectors.txt", missinglabel="any string required")
In [135]:
%%bash
head features/training.nolabel.positive.Entrez.vectors.txt
Each row corresponds to a row in the protein pair file supplied (in this case training.nolabel.positive.Entrez.txt
).
However, if you would like to have these inside this file then the pairlabels
option can be used:
In [136]:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
"features/training.nolabel.positive.Entrez.vectors.txt",
missinglabel="any string required",
pairlabels=True)
In [137]:
%%bash
head features/training.nolabel.positive.Entrez.vectors.txt
In [154]:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
"features/training.nolabel.positive.Entrez.vectors.txt", verbose=True)
In [139]:
assembler.regenerate(force=True, verbose=True)