Gavin Gray 18th June

The ocbio.extract code can be used to quickly extract features and execute arbitrary scripts while doing so. It's designed to keep track of and update many different data sources. The primary way it does this is through the data source table file.

Data source table

This table is of the following form and provides the information about where to find the data file to be processed and where to store the output database.

Data source directory	Output database directory	Options
`/relative/path/to/data`	`/relative/path/to/output/database`	protindexes=1,3;valindex=4;script=`/path/to/script`;csvdelim=`\t`
...	...	...

The options affect how the data source will be parsed to produce the output file. An important note about the data source directory is that this must be the directory the script produces not the directory the script processes if a script is specified. This is because the data source directory is expected in a form such as:

Spam	Protein 1	Spam	Protein 2	Lovely	Value
Spam	1243	Wonderful	3214	Spam	123.4

The parsing done by the actual code is limited to selecting certain columns; two for protein identifiers and one for the value of interest. The rest of the processing is expected to be done by the chosen script.

The output of the script is a ProteinPairDB, which is written as a child class from the database produced by shelve.open (Dbfilenameshelf), modifying the __getitem__ and __setitem__ methods.

Options

The available options to place in the table as above separated by ; are:

protindexes
- the column indexes of the protein indexes
- starting from 0 and given in brackets
- defaults are (0,1)
valindexes
- the column indexes of the values to be used as features
- as above starting from zero and given in brackets
- default is (2)
script
- relative path to script
csvdelim
- delimiter used in data file csv
- default is tab (ie \t)
ignoreheader
- ignore the first line of the csv
- either 1 (ignore) or 0 (don't ignore)
- default is 0

Location

The data source table must be placed at the top directory containing the data sources. It assumes that the paths given to data and output files in the table are relative to that position. So, first we navigate to this position:



In [124]:

    
cd /home/gavin/Documents/MRes/









    



/home/gavin/Documents/MRes

At the time of writing there are three extracted features which we can add to this table. Note that it is tab delimited:



In [125]:

    
f = open("datasource.tab", "w")
# the HIPPIE feature
f.write("HIPPIE/hippie_current.txt"+"\t"+"HIPPIE/feature.HIPPIE.db"+"\t"+"protindexes=(1,3);valindexes=(4)"+"\n")
# the abundance feature
f.write("forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv"+"\t"+"forGAVIN/pulldown_data/dataset/abundance.Entrez.db"+"\t"+"ignoreheader=1"+"\n")
# the affinity feature
f.write("affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv"+"\t"+"affinityresults/results2/affinity.Entrez.db"+"\t"+"")
f.close()

Adding the module to path

To load the module it will have to be in Python's path. So the opencast-bio directory must be added to Python's path:



In [126]:

    
import sys
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")



In [127]:

    
import ocbio.extract



In [152]:

    
reload(ocbio.extract)









    Out[152]:





<module 'ocbio.extract' from '/home/gavin/Documents/MRes/opencast-bio/ocbio/extract.py'>

Running the extraction

To run the extraction we first need to initialise the FeatureVectorAssembler with the source table defined above.

Initialisation

The full path to the source table should be supplied:



In [153]:

    
assembler = ocbio.extract.FeatureVectorAssembler("/home/gavin/Documents/MRes/datasource.tab")

Regeneration

Generating the protein pair database files at the locations defined in the data source table can be done in two ways. Soft regeneration involves checking to see if the pre-processed files are newer than the processed files. If they are the data source is processed again to regenerate the database. This is the default and will be run if the regenerate method is used:



In [130]:

    
assembler.regenerate()

In this case the data files are older than the processed files so it does nothing. We can force it to regenerate the databases with the force=True option:



In [131]:

    
assembler.regenerate(force=True)

However, this can take a long time to complete as it regenerates all databases. To force regeneration of a single database either delete the current output file or change the path for the output file in the data source table and re-initialise the assembler.

Assembly

Assembling a feature vector file can be done using the assemble method of the FeatureVectorAssembler instance. As input this takes a file containing a list of protein pairs and produces a file containing feature vectors corresponding to each protein pair in that file. An output file must also be specified.



In [132]:

    
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt", "features/training.nolabel.positive.Entrez.vectors.txt")

Looking at this output file we can see that each row is a feature vector. Missing values are given by the missinglabel option, which defaults to the string "missing".



In [133]:

    
%%bash
head features/training.nolabel.positive.Entrez.vectors.txt









    



0.86	missing	missing
0.97	missing	missing
0.9	missing	missing
0.62	missing	missing
missing	missing	missing
missing	missing	missing
missing	missing	missing
0.96	missing	missing
missing	missing	-0.70339
0.7	missing	missing

This can be changed to any string required:



In [134]:

    
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
                   "features/training.nolabel.positive.Entrez.vectors.txt", missinglabel="any string required")



In [135]:

    
%%bash
head features/training.nolabel.positive.Entrez.vectors.txt









    



0.86	any string required	any string required
0.97	any string required	any string required
0.9	any string required	any string required
0.62	any string required	any string required
any string required	any string required	any string required
any string required	any string required	any string required
any string required	any string required	any string required
0.96	any string required	any string required
any string required	any string required	-0.70339
0.7	any string required	any string required

Each row corresponds to a row in the protein pair file supplied (in this case training.nolabel.positive.Entrez.txt). However, if you would like to have these inside this file then the pairlabels option can be used:



In [136]:

    
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
                   "features/training.nolabel.positive.Entrez.vectors.txt",
                   missinglabel="any string required",
                   pairlabels=True)



In [137]:

    
%%bash
head features/training.nolabel.positive.Entrez.vectors.txt









    



4084	207	0.86	any string required	any string required
8360	8356	0.97	any string required	any string required
5914	9612	0.9	any string required	any string required
79833	6634	0.62	any string required	any string required
29102	4090	any string required	any string required	any string required
7074	6382	any string required	any string required	any string required
7159	22059	any string required	any string required	any string required
1869	7029	0.96	any string required	any string required
801	817	any string required	any string required	-0.70339
207	1786	0.7	any string required	any string required

Verbose output

Finally, adding the verbose=True option to the methods of FeatureVectorAssembler will activate a lot of print statements telling you what is happening.



In [154]:

    
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
                   "features/training.nolabel.positive.Entrez.vectors.txt", verbose=True)









    



Reading pairfile: DIP/human/training.nolabel.positive.Entrez.txt
Opening databases:
	/home/gavin/Documents/MRes/HIPPIE/feature.HIPPIE.db open
	/home/gavin/Documents/MRes/forGAVIN/pulldown_data/dataset/abundance.Entrez.db open
	/home/gavin/Documents/MRes/affinityresults/results2/affinity.Entrez.db open
Checking feature sizes:
	Database /home/gavin/Documents/MRes/HIPPIE/feature.HIPPIE.db contains features of size 1.
	Database /home/gavin/Documents/MRes/forGAVIN/pulldown_data/dataset/abundance.Entrez.db contains features of size 1.
	Database /home/gavin/Documents/MRes/affinityresults/results2/affinity.Entrez.db contains features of size 1.
Writing feature vectors....
Wrote 4857 vectors.
Matched 60.08% of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to /home/gavin/Documents/MRes/HIPPIE/feature.HIPPIE.db
Matched 0.29% of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to /home/gavin/Documents/MRes/forGAVIN/pulldown_data/dataset/abundance.Entrez.db
Matched 3.69% of protein pairs in DIP/human/training.nolabel.positive.Entrez.txt to /home/gavin/Documents/MRes/affinityresults/results2/affinity.Entrez.db



In [139]:

    
assembler.regenerate(force=True, verbose=True)









    



Regenerating parsers:
	parser 0
Forcing regeneration of database /home/gavin/Documents/MRes/HIPPIE/feature.HIPPIE.db from data file /home/gavin/Documents/MRes/HIPPIE/hippie_current.txt.
Filling database.........................................................................................................................................................................
Parsed 169626 lines.
	parser 1
Forcing regeneration of database /home/gavin/Documents/MRes/forGAVIN/pulldown_data/dataset/abundance.Entrez.db from data file /home/gavin/Documents/MRes/forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv.
Ignoring header.
Filling database.................
Parsed 17777 lines.
	parser 2
Forcing regeneration of database /home/gavin/Documents/MRes/affinityresults/results2/affinity.Entrez.db from data file /home/gavin/Documents/MRes/affinityresults/results2/unique_data_ppi_coor_C2S_entrez.csv.
Filling database...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Parsed 1871145 lines.