Reading Annotations from a GO Association File (GAF)

  1. Download a GAF file
  2. Load the GAF file into the GafReader
  3. Get Annotations

Bonus: Each line in the GAF file is stored in a namedtuple:

  • Namedtuple fields
  • Print a subset of the namedtuple fields

1) Download a GAF file


In [1]:
import os
if not os.path.exists('goa_human.gaf.gz'):
    !wget http://current.geneontology.org/annotations/goa_human.gaf.gz
    !gunzip goa_human.gaf.gz

2) Load the GAF file into the GafReader


In [2]:
from goatools.anno.gaf_reader import GafReader

ogaf = GafReader("goa_human.gaf")


HMS:0:00:13.490551 424,966 annotations READ: goa_human.gaf 

3) Get Annotations

The annotations will be stored in three dicts, one for each GODAG branch, where:

  • the key is the protein ID and
  • the value is a list of GO IDs associated with the protein.

In [3]:
ns2assc = ogaf.get_ns2assc()

In [4]:
for namespace, associations in ns2assc.items():
    for protein_id, go_ids in sorted(associations.items())[:3]:
        print("{NS} {PROT:7} : {GOs}".format(
            NS=namespace,
            PROT=protein_id,
            GOs=' '.join(sorted(go_ids))))


BP A0A075B6H9 : GO:0002250
BP A0A075B6I0 : GO:0002250
BP A0A075B6I1 : GO:0002250
MF A0A024RBG1 : GO:0003723 GO:0008486 GO:0046872 GO:0052840 GO:0052842
MF A0A075B6H9 : GO:0003823
MF A0A075B6I0 : GO:0003823
CC A0A024RBG1 : GO:0005829
CC A0A075B6H9 : GO:0005886
CC A0A075B6I0 : GO:0005886

Bonus: The GAF is stored as a list of named tuples

The list of namedtuples is stored in the GafReader data member named associations.

Each namedtuple stores data for one line in the GAF file.


In [5]:
# Sort the list of GAF namedtuples by ID
nts = sorted(ogaf.associations, key=lambda nt:nt.DB_ID)

# Print one namedtuple
print(nts[0])


ntgafobj(DB='UniProtKB', DB_ID='A0A024RBG1', DB_Symbol='NUDT4B', Qualifier=set(), GO_ID='GO:0003723', DB_Reference={'GO_REF:0000037'}, Evidence_Code='IEA', With_From={'UniProtKB-KW:KW-0694'}, NS='MF', DB_Name={'Diphosphoinositol polyphosphate phosphohydrolase NUDT4B'}, DB_Synonym={'NUDT4B'}, DB_Type='protein', Taxon=[9606], Date=datetime.date(2019, 4, 6), Assigned_By='UniProt', Extension=None, Gene_Product_Form_ID=set())

Namedtuple fields

DB             #  0 required 1              UniProtKB
DB_ID          #  1 required 1              P12345
DB_Symbol      #  2 required 1              PHO3
Qualifier      #  3 optional 0 or greater   NOT
GO_ID          #  4 required 1              GO:0003993
DB_Reference   #  5 required 1 or greater   PMID:2676709
Evidence_Code  #  6 required 1              IMP
With_From      #  7 optional 0 or greater   GO:0000346
Aspect         #  8 required 1              F
DB_Name        #  9 optional 0 or 1         Toll-like receptor 4
DB_Synonym     # 10 optional 0 or greater   hToll|Tollbooth
DB_Type        # 11 required 1              protein
Taxon          # 12 required 1 or 2         taxon:9606
Date           # 13 required 1              20090118
Assigned_By    # 14 required 1              SGD
Annotation_Extension # 15 optional 0 or greater part_of(CL:0000576)
Gene_Product_Form_ID # 16 optional 0 or 1       UniProtKB:P12345-2

In [6]:
fmtpat = '{DB_ID} {DB_Symbol:13} {GO_ID} {Evidence_Code} {Date} {Assigned_By}'
for nt_line in nts[:10]:
    print(fmtpat.format(**nt_line._asdict()))


A0A024RBG1 NUDT4B        GO:0003723 IEA 2019-04-06 UniProt
A0A024RBG1 NUDT4B        GO:0005829 IDA 2016-12-04 HPA
A0A024RBG1 NUDT4B        GO:0008486 IEA 2019-04-06 UniProt
A0A024RBG1 NUDT4B        GO:0046872 IEA 2019-04-06 UniProt
A0A024RBG1 NUDT4B        GO:0052840 IEA 2019-04-06 UniProt
A0A024RBG1 NUDT4B        GO:0052842 IEA 2019-04-06 UniProt
A0A075B6H9 IGLV4-69      GO:0002250 IEA 2019-04-06 UniProt
A0A075B6H9 IGLV4-69      GO:0003823 IEA 2019-04-06 UniProt
A0A075B6H9 IGLV4-69      GO:0005886 IEA 2019-04-06 UniProt
A0A075B6I0 IGLV8-61      GO:0002250 IEA 2019-04-06 UniProt

Copyright (C) 2010-2019, DV Klopfenstein, Haibao Tang. All rights reserved.