Using the Gene Ontology to generate features for protein interaction prediction was used by Qi. For each annotation a gene has in the Gene Ontology database it can be compared with the other gene in it's pair and if they share it the feature is 1, otherwise it is 0. To do this processing, however, first the annotations for each gene must be obtained from databases online. It appears that the Entrez gene database's original paper reports that it contains Gene ontology information so all that's required is to use Biopython to query this database.
The following code is taken from the Biopython tutorial and cookbook.
In [2]:
import sys
from Bio import Entrez
# *Always* tell NCBI who you are
Entrez.email = "gavingray1729@gmail.com"
def retrieve_annotation(id_list):
"""Annotates Entrez Gene IDs using Bio.Entrez, in particular epost (to
submit the data to NCBI) and esummary to retrieve the information.
Returns a list of dictionaries with the annotations."""
request = Entrez.epost("gene",id=",".join(id_list))
try:
result = Entrez.read(request)
except RuntimeError as e:
#FIXME: How generate NAs instead of causing an error with invalid IDs?
print "An error occurred while retrieving the annotations."
print "The error returned was %s" % e
sys.exit(-1)
webEnv = result["WebEnv"]
queryKey = result["QueryKey"]
data = Entrez.esummary(db="gene", webenv=webEnv, query_key =
queryKey)
annotations = Entrez.read(data)
print "Retrieved %d annotations for %d genes" % (len(annotations),
len(id_list))
return annotations
Trying this out with some arbitrary gene ID:
In [3]:
cd /home/gavin/Documents/MRes/DIP/human/
In [4]:
import csv
In [5]:
ls
In [6]:
dipids = list(flatten(csv.reader(open("flat.Entrez.txt"))))
Testing for the first of these:
In [7]:
annotation1 = retrieve_annotation(dipids[0])
Ok, so why is it retrieving for annotations for a single gene identifier? Surely it should only retrieve one annotation? Looking at these annotations using another function from the cookbook:
In [8]:
def print_data(annotation):
for gene_data in annotation:
gene_id = gene_data["Id"]
gene_symbol = gene_data["NomenclatureSymbol"]
gene_name = gene_data["Description"]
print "ID: %s - Gene Symbol: %s - Gene Name: %s" % (gene_id, gene_symbol, gene_name)
In [9]:
print_data(annotation1)
Looks like it's not expecting a string, maybe it expects a list?
In [10]:
annotation1 = retrieve_annotation([dipids[0]])
In [11]:
print_data(annotation1)
Writing my own function to print out everything nicely:
In [12]:
def print_data_all(annotation):
for a in annotation:
for k in a.keys():
print k+" : "+"%s"%a[k]
return None
In [13]:
print_data_all(annotation1)
Interesting, but don't see anything called "Gene Ontology" unfortunately. Must be integrated into here somewhere but I don't really understand where.
Using a flat file from NBCI on this ftp server can map an Entrez ID to a number of GO IDs:
In [14]:
cd /home/gavin/Documents/MRes/geneconversion/
In [15]:
# read through this file and make a list of GO IDs for the example gene chosen above
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
examplelist=[]
for l in c:
if l[1]==dipids[0]:
examplelist.append(l[2])
In [16]:
print examplelist
In [17]:
import goatools
Then we can parse the obo file obtained from here.
In [18]:
cd /home/gavin/Documents/MRes/geneontology/
In [19]:
ls
In [25]:
parsedobo = goatools.obo_parser.GODag('gene_ontology.1_2.obo')
In [33]:
for i in examplelist:
print parsedobo[i].namespace
print parsedobo[i].name
print " "
In [29]:
dir(parsedobo[id1])
Out[29]:
So I can retrieve these annotations for arbitrary Entrez IDs. Should then be able to generate the database from this conversion table, using all the gene IDs available in it. I guess the question then is, how many IDs are in it?
In [35]:
cd /home/gavin/Documents/MRes/geneconversion/
In [36]:
# read through this file and make a list of GO IDs for the example gene chosen above
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
#take second column and make it a set
EntrezIDs = set(zip(*list(c))[1])
In [41]:
import scipy.misc
In [43]:
print "Number of Entrez IDs in the gene2go table: %i"%(len(EntrezIDs))
print "Number of binary combinations of these genes: %i"%(scipy.misc.comb(len(EntrezIDs),2))
So the resulting database will have 23 billion keys? Could cause problems, but will continue and see if it does.