How to get Descendants Count and Information Content

  1. Download Ontologies, if necessary
  2. Choose a set of GO IDs to print
  3. GoSubDag contains descendants count
  4. Print the descendants count (dcnt)
  5. Compare information content (tinfo) for human, mouse, and fly
  6. Print both descendants count and information content: dcnt and tinfo

1. Download Ontologies, if necessary


In [1]:
from goatools.base import get_godag
godag = get_godag("go-basic.obo")


  EXISTS: go-basic.obo
go-basic.obo: fmt(1.2) rel(2019-05-09) 47,407 GO Terms

2. Choose a set of GO IDs to print


In [2]:
# Choose a deep leaf-level GO ID associated with "bacteria"
DESC = 'bacteria'            # GO Term name contains this
NSPC = 'cellular_component'  # Desired namespace

# Create a chooser function which returns True or False
def chooser(goterm):
    """Choose a leaf-level GO term based on its name"""
    b_match = DESC in goterm.name
    # True if GO term is leaf-level (has no children)
    b_leaf = not goterm.children
    # True if GO term is in 'cellular_component' namespace (nspc)
    b_nspc = goterm.namespace == NSPC
    return b_match and b_leaf and b_nspc

# Get GO terms with desired name in desired GO DAG branch
go_ids_selected = set(o.item_id for o in godag.values() if chooser(o))

print('{N} {desc} GO terms'.format(N=len(go_ids_selected), desc=DESC))


24 bacteria GO terms

2b. Get the GO-DAG subset for your GO IDs


In [3]:
from goatools.gosubdag.gosubdag import GoSubDag
gosubdag = GoSubDag(go_ids_selected, godag)


INITIALIZING GoSubDag:  24 sources in  60 GOs rcnt(True). 0 alt GO IDs
             GoSubDag: namedtuple fields: NS level depth GO alt GO_name dcnt D1 id
             GoSubDag: relationships: set()

2c. Get the deepest GO ID in the GO DAG subset


In [4]:
go_id, go_term = max(gosubdag.go2obj.items(), key=lambda t: t[1].depth)

# Print GO ID, using print format in gosubdag
print(go_id, go_term.name)


GO:1990061 bacterial degradosome

2d. Get all parents of the deepest GO ID


In [5]:
go_ids_chosen = go_term.get_all_parents()
print('{N} ancestors for {GO} "{name}"'.format(
    N=len(go_ids_chosen), GO=go_term.item_id, name=go_term.name))


10 ancestors for GO:1990061 "bacterial degradosome"

3. GoSubDag contains descendants count

gosubdag.go2nt

The data member, go2nt, of the class, GoSubDag, is a dict where:

  • go is the GO ID (e.g., GO:1990061)
  • nt is the namedtuple

The namedtuple field, dcnt, contains the descendants count.

Additional namedtuple fields:

field description
NS Namespace: BP, MF, or CC
level Minimum path from the top of the branch
depth Maximum path from the top of the branch
GO GO ID
GO_name GO Name - A short description
dcnt Descendants count

In [6]:
# Add the deep GO ID to its list of ancestors for printing
go_ids_chosen.add(go_id)
nts = [gosubdag.go2nt[go] for go in go_ids_chosen]

4. Print the descendants count (dcnt)


In [7]:
fmt_str = '{I:2}) {NS} {GO:10} {dcnt:11}        D{depth:02}  {GO_name}'

# Print selected GO information
print('IDX NS GO ID      Descendants Count Depth Name')
print('--- -- ---------- ----------------- ----- --------------------')
for idx, nt_go in enumerate(sorted(nts, key=lambda nt: nt.depth), 1):
    print(fmt_str.format(I=idx, **nt_go._asdict()))


IDX NS GO ID      Descendants Count Depth Name
--- -- ---------- ----------------- ----- --------------------
 1) CC GO:0005575        4206        D00  cellular_component
 2) CC GO:0044464        3321        D01  cell part
 3) CC GO:0032991        2116        D01  protein-containing complex
 4) CC GO:0044424        2376        D02  intracellular part
 5) CC GO:1902494         546        D02  catalytic complex
 6) CC GO:0044444        1271        D03  cytoplasmic part
 7) CC GO:1905354           8        D03  exoribonuclease complex
 8) CC GO:0044445          87        D04  cytosolic part
 9) CC GO:0000178           5        D04  exosome (RNase complex)
10) CC GO:0000177           2        D05  cytoplasmic exosome (RNase complex)
11) CC GO:1990061           0        D06  bacterial degradosome

$ ../scripts/go_plot.py GO:1990061

5. Compare information content (tinfo) for human, mouse, and fly


In [8]:
# Short alias, indicating species and the annotation filename
abc2gpad = {
    'hsa': 'goa_human.gpad',  # human
    'mmu': 'mgi.gpad',        # mouse
    'dme': 'fb.gpad'          # fly
}

5a. Download annotations, if necessary


In [9]:
import os
for abc, gpad in abc2gpad.items():
    if not os.path.exists(gpad):
        fin_gz = '{FILE}.gz'.format(FILE=gpad)
        wget_gz = 'current.geneontology.org/annotations/{GZ}'.format(GZ=fin_gz)
        !wget $wget_gz
        !gunzip $fin_gz

5b. Read annotations


In [10]:
from goatools.anno.factory import get_objanno
# Load each annotation into an annotation object
abc2objanno = {s:get_objanno(f, 'gpad', godag=godag) for s, f in abc2gpad.items()}


HMS:0:00:15.276513 479,437 annotations READ: goa_human.gpad 
HMS:0:00:13.508003 409,743 annotations READ: mgi.gpad 
HMS:0:00:04.770149 114,704 annotations READ: fb.gpad 

5c. Get TermCounts object for the annotation of each species


In [11]:
from goatools.semantic import TermCounts
abc2objtcnt = {s:TermCounts(godag, o.get_id2gos_nss()) for s, o in abc2objanno.items()}

5d. Print tinfo for GO IDs for human, mouse, and fly


In [12]:
from goatools.semantic import get_info_content
fmt_str = ('{I:2}) {NS} {GO:10} {dcnt:4}  D{depth:02}   '
           '{hsa:6.3f} {mmu:6.3f} {dme:6.3f} '
           '{GO_name}')

# Print selected GO information
print('                              |<----- tinfo --->|')
print('IDX NS GO ID      dcnt Depth   hsa    mmu    dme  Name')
print('--- -- ---------- ---- ------ ------ ------ ----- --------------------')
for idx, nt_go in enumerate(sorted(nts, key=lambda nt: nt.depth), 1):
    abc2tinfo = {s: get_info_content(nt_go.GO, o) for s, o in abc2objtcnt.items()}
    print(fmt_str.format(I=idx, **abc2tinfo, **nt_go._asdict()))


                              |<----- tinfo --->|
IDX NS GO ID      dcnt Depth   hsa    mmu    dme  Name
--- -- ---------- ---- ------ ------ ------ ----- --------------------
 1) CC GO:0005575 4206  D00    2.846  2.755  2.411 cellular_component
 2) CC GO:0044464 3321  D01    2.937  2.912  2.589 cell part
 3) CC GO:0032991 2116  D01    4.059  3.983  3.543 protein-containing complex
 4) CC GO:0044424 2376  D02    3.096  3.086  2.734 intracellular part
 5) CC GO:1902494  546  D02    5.488  5.428  4.713 catalytic complex
 6) CC GO:0044444 1271  D03    3.501  3.583  3.403 cytoplasmic part
 7) CC GO:1905354    8  D03    9.469  9.437  8.937 exoribonuclease complex
 8) CC GO:0044445   87  D04    7.147  6.904  6.012 cytosolic part
 9) CC GO:0000178    5  D04    9.510  9.437  8.937 exosome (RNase complex)
10) CC GO:0000177    2  D05   10.049  9.976  9.379 cytoplasmic exosome (RNase complex)
11) CC GO:1990061    0  D06    0.000  0.000  0.000 bacterial degradosome

**Note: tinfo is 0.000 for 'bacterial degradosome' in human mouse and fly because GO:1990061 is not found in the annotations downloaded May 29, 2019

6. Print both descendants count and information content: dcnt and tinfo


In [13]:
gpad_filename = abc2gpad['hsa']
print('GPAD ANNOTATION FILENAME: {GPAD}'.format(GPAD=gpad_filename))


GPAD ANNOTATION FILENAME: goa_human.gpad

In [14]:
# Re-read annotation file for clarity in this tutorial
annoobj = get_objanno(gpad_filename, 'gpad', godag=godag)

# Get associations (Gene ID - to - set of GO IDs) for all namespaces (BP, MF, and CC)
id2goids = annoobj.get_id2gos_nss()

# Create TermCounts object
tcntobj = TermCounts(godag, id2goids)

# Manage a subset of GO IDs using GoSubDag
# This time, load the GO term counts
gosubdag = GoSubDag(go_ids_selected, godag, tcntobj=tcntobj)


HMS:0:00:14.915225 479,437 annotations READ: goa_human.gpad 
INITIALIZING GoSubDag:  24 sources in  60 GOs rcnt(True). 0 alt GO IDs
             GoSubDag: namedtuple fields: NS level depth GO alt GO_name dcnt D1 tcnt tfreq tinfo id
             GoSubDag: relationships: set()

In [15]:
# Print both descendants count and information content: dcnt and tinfo

fmt_str = '{NS} {GO:10} {dcnt:4}  {tinfo:6.3f}  D{depth:02}  {GO_name}'
print('NS GO         dcnt   tinfo depth name')
print('-- ---------- ----- ------  ---  -----------------------')
nts = gosubdag.prt_goids(go_ids_chosen, prtfmt=fmt_str)


NS GO         dcnt   tinfo depth name
-- ---------- ----- ------  ---  -----------------------
CC GO:0005575 4206   2.846  D00  cellular_component
CC GO:0044464 3321   2.937  D01  cell part
CC GO:0032991 2116   4.059  D01  protein-containing complex
CC GO:0044424 2376   3.096  D02  intracellular part
CC GO:1902494  546   5.488  D02  catalytic complex
CC GO:0044444 1271   3.501  D03  cytoplasmic part
CC GO:1905354    8   9.469  D03  exoribonuclease complex
CC GO:0044445   87   7.147  D04  cytosolic part
CC GO:0000178    5   9.510  D04  exosome (RNase complex)
CC GO:0000177    2  10.049  D05  cytoplasmic exosome (RNase complex)
CC GO:1990061    0   0.000  D06  bacterial degradosome

Copyright (C) 2016-2019, DV Klopfenstein et al. All rights reserved