Annotations in NCBI's gene2go file

1) Download the NCBI annotations

The NCBI annotations are stored in the file named "gene2go".


In [1]:
from goatools.base import download_ncbi_associations
# fin -> Filename of input file (file to be read)
fin_gene2go = download_ncbi_associations()


  EXISTS: gene2go

2) Read NCBI annotation file, "gene2go"

2a) Read one taxid: human


In [2]:
from goatools.anno.genetogo_reader import Gene2GoReader
objanno_hsa = Gene2GoReader(fin_gene2go, taxids=[9606])


HMS:0:00:05.892614 323,107 annotations, 19,649 genes, 18,246 GOs, 1 taxids READ: gene2go 

2b) Read all taxids


In [3]:
objanno_all = Gene2GoReader(fin_gene2go, taxids=True)


HMS:0:00:20.773931 2,057,323 annotations, 205,158 genes, 26,896 GOs, 46 taxids READ: gene2go 

3) Get associations, split by namespace (Only human annotations loaded)

Namespaces include biological_process (BP), molecular_function (MF), cellular_component (CC).

The taxid argument in function, get_ns2assc, is ignored if there only one taxid is loaded.


In [4]:
ns2assc_hsa1 = objanno_hsa.get_ns2assc()

In [5]:
from itertools import chain

def prt_assc_counts(ns2assc):
    """Print the number of genes and GO IDs in an association"""
    for nspc, gene2goids in sorted(ns2assc.items()):
        print("{NS} {N:6,} genes, {GOs:6,} GOs".format(
            NS=nspc, N=len(gene2goids), GOs=len(set.union(*gene2goids.values()))))

In [6]:
prt_assc_counts(ns2assc_hsa1)


BP 17,541 genes, 12,285 GOs
CC 18,648 genes,  1,737 GOs
MF 17,384 genes,  4,170 GOs

4) Get associations, split by namespace (Many taxids loaded)

4a) Get associations for one species (human)


In [7]:
ns2assc_hsa2 = objanno_all.get_ns2assc(9606)

prt_assc_counts(ns2assc_hsa2)


BP 17,541 genes, 12,285 GOs
CC 18,648 genes,  1,737 GOs
MF 17,384 genes,  4,170 GOs

4b) Get associations for one species (mouse)


In [8]:
ns2assc_mmu = objanno_all.get_ns2assc(10090)

prt_assc_counts(ns2assc_mmu)


BP 17,859 genes, 12,282 GOs
CC 18,824 genes,  1,726 GOs
MF 16,721 genes,  4,156 GOs

4c) Combine associations for multiple species (human and mouse)


In [9]:
ns2assc_two = objanno_all.get_ns2assc({9606, 10090})

prt_assc_counts(ns2assc_two)


BP 35,400 genes, 13,017 GOs
CC 37,472 genes,  1,798 GOs
MF 34,105 genes,  4,373 GOs

4d) Combine all associations


In [10]:
ns2assc_all = objanno_all.get_ns2assc(True)
prt_assc_counts(ns2assc_all)


BP 147,644 genes, 17,777 GOs
CC 155,564 genes,  2,546 GOs
MF 140,246 genes,  6,514 GOs

4e) Try getting unspecified taxids

If annotations have been loaded for multiple or all taxids, the taxid argument is required.

If the taxid arg is unused, an error will be printed and an empty dict will be returned.


In [11]:
ns2assc_all = objanno_all.get_ns2assc()


**ERROR: ARG taxid MUST BE AN int, list of ints, OR True

In [12]:
print(ns2assc_all)


{}