Calculating Annotation Coverage

This section shows how to calculate annotation coverage as described here:

Annotation coverage of Gene Ontology (GO) terms to individual 
gene products is high for human or model organisms: 
  * 87% of ~20k human protein-coding genes have GO annotations
  * 76% of ~14k fly protein-coding genes have GO annotations
(Apr 27, 2016)

1. Download associations

NCBI's gene2go file contains annotations of GO terms to Entrez GeneIDs for over 35 different species. We are interested in human and fly which have the taxids 9606 and 7227 repectively.


In [1]:
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
gene2go = download_ncbi_associations()


  EXISTS: gene2go

2. Read associations

2a. You can read the associations one species at a time...


In [2]:
from goatools.anno.genetogo_reader import Gene2GoReader

geneid2gos_human = Gene2GoReader(gene2go, taxids=[9606])
geneid2gos_fly = Gene2GoReader(gene2go, taxids=[7227])


HMS:0:00:07.574562 323,107 annotations, 19,649 genes, 18,246 GOs, 1 taxids READ: gene2go 
HMS:0:00:03.391011 101,655 annotations, 13,617 genes,  8,512 GOs, 1 taxids READ: gene2go 

2b. Or you can read 'gene2go' once and load all species...


In [3]:
from collections import defaultdict, namedtuple

geneid2gos_all = Gene2GoReader(gene2go, taxids=[9606, 7227])


HMS:0:00:06.648749 424,762 annotations, 33,266 genes, 19,699 GOs, 2 taxids READ: gene2go 

3. Import protein-coding information for human and fly

In this example, the background is all human and fly protein-codinge genes.

Follow the instructions in the background_genes_ncbi notebook to download a set of background population genes from NCBI.


In [4]:
from genes_ncbi_9606_proteincoding import GENEID2NT as GeneID2nt_human
from genes_ncbi_7227_proteincoding import GENEID2NT as GeneID2nt_fly
lst = [
    (9606, GeneID2nt_human),
    (7227, GeneID2nt_fly)
]
print('{N:,} human genes'.format(N=len(GeneID2nt_human)))
print('{N:,} fly genes'.format(N=len(GeneID2nt_fly)))


19,919 human genes
13,968 fly genes

4. Calculate Gene Ontology coverage

Store GO coverage information for human and fly in the list, cov_data.


In [5]:
cov_data = []
NtCov = namedtuple("NtCov", "taxid num_GOs num_covgenes coverage num_allgenes")
for taxid, pcGeneID2nt in lst:
    # Get GeneID2GOs association for current species
    geneid2gos = geneid2gos_all.get_id2gos_nss(taxid=taxid)
    # Restrict GeneID2GOs to only protein-coding genes for this report
    pcgene_w_gos = set(geneid2gos.keys()).intersection(set(pcGeneID2nt.keys()))
    num_pcgene_w_gos = len(pcgene_w_gos)
    num_pc_genes = len(pcGeneID2nt)
    # Number of GO terms annotated to protein-coding genes
    gos_pcgenes = set()
    for geneid in pcgene_w_gos:
        gos_pcgenes |= geneid2gos[geneid]
    # Print report data
    cov_data.append(NtCov(
        taxid = taxid,
        num_GOs = len(gos_pcgenes),
        num_covgenes = num_pcgene_w_gos,
        coverage = 100.0*num_pcgene_w_gos/num_pc_genes,
        num_allgenes = num_pc_genes))

5 Report Gene Ontology coverage for human and fly

Print the human and fly GO coverage information that is stored in the list, cov_data.


In [6]:
from __future__ import print_function

print(" taxid    GOs GeneIDs Coverage")
print("------ ------ ------- ----------------------")
fmtstr = "{TAXID:>6} {N:>6,} {M:>7,}  {COV:2.0f}% GO coverage of {TOT:,} protein-coding genes"
for nt in cov_data:
    print(fmtstr.format(
        TAXID = nt.taxid,
        N = nt.num_GOs,
        M = nt.num_covgenes,
        COV = nt.coverage,
        TOT = nt.num_allgenes))


 taxid    GOs GeneIDs Coverage
------ ------ ------- ----------------------
  9606 18,103  18,598  93% GO coverage of 19,919 protein-coding genes
  7227  8,417  10,660  76% GO coverage of 13,968 protein-coding genes

Copyright (C) 2016-present, DV Klopfenstein, H Tang. All rights reserved.