This section shows how to calculate annotation coverage as described here:
Annotation coverage of Gene Ontology (GO) terms to individual
gene products is high for human or model organisms:
* 87% of ~20k human protein-coding genes have GO annotations
* 76% of ~14k fly protein-coding genes have GO annotations
(Apr 27, 2016)
NCBI's gene2go file contains annotations of GO terms to Entrez GeneIDs for over 35 different species. We are interested in human and fly which have the taxids 9606 and 7227 repectively.
In [1]:
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
gene2go = download_ncbi_associations()
In [2]:
from goatools.anno.genetogo_reader import Gene2GoReader
geneid2gos_human = Gene2GoReader(gene2go, taxids=[9606])
geneid2gos_fly = Gene2GoReader(gene2go, taxids=[7227])
In [3]:
from collections import defaultdict, namedtuple
geneid2gos_all = Gene2GoReader(gene2go, taxids=[9606, 7227])
In [4]:
from genes_ncbi_9606_proteincoding import GENEID2NT as GeneID2nt_human
from genes_ncbi_7227_proteincoding import GENEID2NT as GeneID2nt_fly
lst = [
(9606, GeneID2nt_human),
(7227, GeneID2nt_fly)
]
print('{N:,} human genes'.format(N=len(GeneID2nt_human)))
print('{N:,} fly genes'.format(N=len(GeneID2nt_fly)))
In [5]:
cov_data = []
NtCov = namedtuple("NtCov", "taxid num_GOs num_covgenes coverage num_allgenes")
for taxid, pcGeneID2nt in lst:
# Get GeneID2GOs association for current species
geneid2gos = geneid2gos_all.get_id2gos_nss(taxid=taxid)
# Restrict GeneID2GOs to only protein-coding genes for this report
pcgene_w_gos = set(geneid2gos.keys()).intersection(set(pcGeneID2nt.keys()))
num_pcgene_w_gos = len(pcgene_w_gos)
num_pc_genes = len(pcGeneID2nt)
# Number of GO terms annotated to protein-coding genes
gos_pcgenes = set()
for geneid in pcgene_w_gos:
gos_pcgenes |= geneid2gos[geneid]
# Print report data
cov_data.append(NtCov(
taxid = taxid,
num_GOs = len(gos_pcgenes),
num_covgenes = num_pcgene_w_gos,
coverage = 100.0*num_pcgene_w_gos/num_pc_genes,
num_allgenes = num_pc_genes))
In [6]:
from __future__ import print_function
print(" taxid GOs GeneIDs Coverage")
print("------ ------ ------- ----------------------")
fmtstr = "{TAXID:>6} {N:>6,} {M:>7,} {COV:2.0f}% GO coverage of {TOT:,} protein-coding genes"
for nt in cov_data:
print(fmtstr.format(
TAXID = nt.taxid,
N = nt.num_GOs,
M = nt.num_covgenes,
COV = nt.coverage,
TOT = nt.num_allgenes))
Copyright (C) 2016-present, DV Klopfenstein, H Tang. All rights reserved.