obonet tutorial: import and analyze the Gene Ontology in Python

This tutotial shows:

  1. How to read the Gene Ontology OBO export into networkx using the obonet package.
  2. Simple tasks you can do with the networkx.MultiDiGraph data structure.

The notebook is written for Python 3.6, but obonet itself works with Python 3.4+.


In [1]:
import networkx
import obonet

Read the Gene Ontology

Learn more about the Gene Ontology (GO) downloads here. Note how we can read the OBO file from a URL. obonet.read_obo automically detects whether it's passed a local path, URL, or open file. In addition, obonet.read_obo will automtically decompress files ending in .gz, .bz2, or .gz.


In [2]:
%%time
url = 'http://purl.obolibrary.org/obo/go/go-basic.obo'
graph = obonet.read_obo(url)


CPU times: user 7.52 s, sys: 280 ms, total: 7.8 s
Wall time: 12.2 s

In [3]:
# Number of nodes
len(graph)


Out[3]:
44560

In [4]:
# Number of edges
graph.number_of_edges()


Out[4]:
92680

In [5]:
# Check if the ontology is a DAG
networkx.is_directed_acyclic_graph(graph)


Out[5]:
True

Lookup node properties

Returns a dictionary.


In [6]:
# Retreive properties of phagocytosis
graph.node['GO:0006909']


Out[6]:
{'def': '"An endocytosis process that results in the engulfment of external particulate material by phagocytes. The particles are initially contained within phagocytic vacuoles (phagosomes), which then fuse with primary lysosomes to effect digestion of the particles." [ISBN:0198506732]',
 'name': 'phagocytosis',
 'namespace': 'biological_process',
 'xref': ['Wikipedia:Phagocytosis']}

In [7]:
# Retreive properties of pilus shaft
graph.node['GO:0009418']


Out[7]:
{'def': '"The long, slender, mid section of a pilus." [GOC:jl]',
 'name': 'pilus shaft',
 'namespace': 'cellular_component',
 'subset': ['gosubset_prok'],
 'synonym': ['"fimbrial shaft" EXACT []']}

Create name mappings

Note that for some OBO ontologies, some nodes only have an id and not a name (see issue).


In [8]:
id_to_name = {id_: data.get('name') for id_, data in graph.nodes(data=True)}
name_to_id = {data['name']: id_ for id_, data in graph.nodes(data=True) if 'name' in data}

In [9]:
# Get the name for GO:0042552
id_to_name['GO:0042552']


Out[9]:
'myelination'

In [10]:
# Get the id for myelination
name_to_id['myelination']


Out[10]:
'GO:0042552'

Find parent or child relationships


In [11]:
# Find edges to parent terms
node = name_to_id['pilus part']
for child, parent, key in graph.out_edges(node, keys=True):
    print(f'• {id_to_name[child]}{key}{id_to_name[parent]}')


• pilus part ⟶ is_a ⟶ intracellular organelle part
• pilus part ⟶ is_a ⟶ cell projection part
• pilus part ⟶ part_of ⟶ pilus

In [12]:
# Find edges to children terms
node = name_to_id['pilus part']
for parent, child, key in graph.in_edges(node, keys=True):
    print(f'• {id_to_name[child]}{key}{id_to_name[parent]}')


• pilus part ⟵ is_a ⟵ pilus shaft
• pilus part ⟵ is_a ⟵ pilus tip

Find all superterms of myelination


In [13]:
sorted(id_to_name[superterm] for superterm in networkx.descendants(graph, 'GO:0042552'))


Out[13]:
['anatomical structure development',
 'axon ensheathment',
 'biological_process',
 'cellular process',
 'developmental process',
 'ensheathment of neurons',
 'multicellular organism development',
 'multicellular organismal process',
 'nervous system development',
 'single-multicellular organism process',
 'single-organism cellular process',
 'single-organism developmental process',
 'single-organism process',
 'system development']

Find all subterms of myelination


In [14]:
sorted(id_to_name[subterm] for subterm in networkx.ancestors(graph, 'GO:0042552'))


Out[14]:
['central nervous system myelin formation',
 'central nervous system myelin maintenance',
 'central nervous system myelination',
 'myelin assembly',
 'myelin maintenance',
 'myelination in peripheral nervous system',
 'myelination of anterior lateral line nerve axons',
 'myelination of lateral line nerve axons',
 'myelination of posterior lateral line nerve axons',
 'negative regulation of myelination',
 'paranodal junction assembly',
 'peripheral nervous system myelin formation',
 'peripheral nervous system myelin maintenance',
 'positive regulation of myelination',
 'regulation of myelination']

Find all paths to the root


In [15]:
paths = networkx.all_simple_paths(
    graph,
    source=name_to_id['starch binding'],
    target=name_to_id['molecular_function']
)
for path in paths:
    print('•', ' ⟶ '.join(id_to_name[node] for node in path))


• starch binding ⟶ polysaccharide binding ⟶ pattern binding ⟶ binding ⟶ molecular_function
• starch binding ⟶ polysaccharide binding ⟶ carbohydrate binding ⟶ binding ⟶ molecular_function

See the ontology metadata


In [16]:
graph.graph


Out[16]:
{'auto-generated-by': 'TermGenie 1.0',
 'data-version': 'releases/2017-03-26',
 'date': '23:02:2017 10:01',
 'default-namespace': ['gene_ontology'],
 'format-version': '1.2',
 'instances': [],
 'name': 'go',
 'ontology': 'go',
 'remark': ['cvs version: use data-version',
  'Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]'],
 'saved-by': 'slaulederkind',
 'subsetdef': ['goantislim_grouping "Grouping classes that can be excluded"',
  'gocheck_do_not_annotate "Term not to be used for direct annotation"',
  'gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"',
  'goslim_agr "AGR slim"',
  'goslim_aspergillus "Aspergillus GO slim"',
  'goslim_candida "Candida GO slim"',
  'goslim_chembl "ChEMBL protein targets summary"',
  'goslim_generic "Generic GO slim"',
  'goslim_goa "GOA and proteome slim"',
  'goslim_metagenomics "Metagenomics GO slim"',
  'goslim_mouse "Mouse GO slim"',
  'goslim_pir "PIR GO slim"',
  'goslim_plant "Plant GO slim"',
  'goslim_pombe "Fission yeast GO slim"',
  'goslim_synapse "synapse GO slim"',
  'goslim_virus "Viral GO slim"',
  'goslim_yeast "Yeast GO slim"',
  'gosubset_prok "Prokaryotic GO subset"',
  'mf_needs_review "Catalytic activity terms in need of attention"',
  'termgenie_unvetted "Terms created by TermGenie that do not follow a template and require additional vetting by editors"',
  'virus_checked "Viral overhaul terms"'],
 'synonymtypedef': ['syngo_official_label "label approved by the SynGO project"',
  'systematic_synonym "Systematic synonym" EXACT'],
 'typedefs': [{'id': 'negatively_regulates',
   'is_a': ['regulates'],
   'name': 'negatively regulates',
   'namespace': 'external',
   'transitive_over': ['part_of'],
   'xref': ['RO:0002212']},
  {'id': 'never_in_taxon',
   'is_class_level': 'true',
   'is_metadata_tag': 'true',
   'name': 'never_in_taxon',
   'namespace': 'external',
   'xref': ['RO:0002161']},
  {'id': 'part_of',
   'is_transitive': 'true',
   'name': 'part of',
   'namespace': 'external',
   'xref': ['BFO:0000050']},
  {'holds_over_chain': ['negatively_regulates negatively_regulates'],
   'id': 'positively_regulates',
   'is_a': ['regulates'],
   'name': 'positively regulates',
   'namespace': 'external',
   'transitive_over': ['part_of'],
   'xref': ['RO:0002213']},
  {'id': 'regulates',
   'is_transitive': 'true',
   'name': 'regulates',
   'namespace': 'external',
   'transitive_over': ['part_of'],
   'xref': ['RO:0002211']}]}