GOparser Demo

Author: Florian Wagner
Email: florian.wagner@duke.edu

This notebook demonstrates how to use the GOparser package to parse and query Gene Ontology data.



In [1]:

    
# get package versions
from pkg_resources import require

print 'Package versions'
print '----------------'
print require('genometools')[0]
print require('goparser')[0]









    



Package versions
----------------
genometools 1.1.0
goparser 1.1.0



In [2]:

    
gene_annotation_file = 'Homo_sapiens.GRCh38.82.gtf.gz'
protein_coding_gene_file = 'protein_coding_genes_human.tsv'
go_annotation_file = 'gene_association.goa_human.149.gz'
go_ontology_file = 'go-basic_2015-10-12.obo'

Download all required data



In [3]:

    
# download gene annotations
!curl -o "$gene_annotation_file" \
        "ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz"









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.2M  100 44.2M    0     0  1079k      0  0:00:41  0:00:41 --:--:-- 1158k



In [4]:

    
# download UniProt-GOA GO annotation file
!curl -o "$go_annotation_file" \
        "ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/HUMAN/gene_association.goa_human.149.gz"









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6573k  100 6573k    0     0   934k      0  0:00:07  0:00:07 --:--:-- 1287k



In [5]:

    
# look at which version of the Gene Ontology was used for the GO annotation file
# (answer: 2015-10-12)
!gunzip -c "$go_annotation_file" | head -n 12









    



!gaf-version: 2.1
!
!The set of protein accessions included in this file is based on UniProt complete proteomes, which may provide more than one protein per gene.
!They include all Swiss-Prot entries for the species plus any TrEMBL entries that have an Ensembl DR line. The TrEMBL entries are likely to overlap with the Swiss-Prot entries or their isoforms.
!If a particular protein accession is not annotated with GO, then it will not appear in this file.
!
!Note that the annotation set in this file is filtered in order to reduce redundancy; the full, unfiltered set can be found in
!ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
!
!Generated: 2015-10-12 09:17
!GO-version: http://purl.obolibrary.org/obo/go/releases/2015-10-09/go.owl
!

gzip: stdout: Broken pipe



In [6]:

    
# download gene ontology file
!curl -o "$go_ontology_file" \
        http://viewvc.geneontology.org/viewvc/GO-SVN/ontology-releases/2015-10-12/go-basic.obo?revision=29122









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29.7M    0 29.7M    0     0  1145k      0 --:--:--  0:00:26 --:--:-- 1111k

Extract list of human protein-coding genes



In [7]:

    
# generate list of human protein-coding genes
species = 'human'

!gunzip -c $gene_annotation_file | \
        extract_protein_coding_genes.py -s $species -o $protein_coding_gene_file









    



[2015-11-21 13:12:52] INFO: Regular expression used for filtering chromosome names: "(?:\d\d?|MT|X|Y)$"
[2015-11-21 13:12:52] INFO: Parsing data...
[2015-11-21 13:13:01] INFO: done (parsed 2561568 lines).
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: Gene chromosomes (25):
[2015-11-21 13:13:01] INFO: 	1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: Excluded chromosomes (15):
[2015-11-21 13:13:01] INFO: 	GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1, GL000218.1, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: Gene sources:
[2015-11-21 13:13:01] INFO: 	ensembl_havana: 18864
[2015-11-21 13:13:01] INFO: 	havana: 738
[2015-11-21 13:13:01] INFO: 	ensembl: 223
[2015-11-21 13:13:01] INFO: 	insdc: 13
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: Gene types:
[2015-11-21 13:13:01] INFO: 	protein_coding: 19779
[2015-11-21 13:13:01] INFO: 	polymorphic_pseudogene: 59
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: # Genes with redundant annotations: 76
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: Polymorphic pseudogenes (59): AKR7L (1), C6orf183 (1), CASP12 (1), CYP2D7 (1), FCGR2C (1), GBA3 (1), GSTT2 (1), IFNL4 (1), KIR2DS4 (1), MROH5 (1), NAT8B (1), OR10A6 (1), OR10AC1 (1), OR10C1 (1), OR10J4 (1), OR10X1 (1), OR11H7 (1), OR12D1 (1), OR12D2 (1), OR13C7 (1), OR1B1 (1), OR1S1 (1), OR2AG1 (1), OR2F1 (1), OR2J1 (1), OR2L8 (1), OR2S2 (1), OR2T11 (1), OR4A8 (1), OR4C16 (1), OR4Q2 (1), OR4X1 (1), OR4X2 (1), OR51B2 (1), OR51F1 (1), OR51G1 (1), OR52B4 (1), OR52E1 (1), OR52R1 (1), OR52Z1 (1), OR5AC1 (1), OR5AL1 (1), OR5AR1 (1), OR5D13 (1), OR5G3 (1), OR5H6 (1), OR5H8 (1), OR5L1 (1), OR5R1 (1), OR6Q1 (1), OR8B4 (1), OR8D2 (1), OR8J2 (1), OR8K3 (1), PKD1L2 (1), PNLIPRP2 (1), SERPINA2 (1), TUBB8P7 (1), UBE2NL (1)
[2015-11-21 13:13:01] INFO: 
[2015-11-21 13:13:01] INFO: Total protein-coding genes: 19761

Parse human GO annotations



In [8]:

    
import sys
from genometools import misc
from goparser import GOParser

# configure a logger that prints to stdout
misc.configure_logger('goparser',log_stream=sys.stdout)

# instantiate a GOParser object
P = GOParser()

# parse the ontology (.obo) file from the Gene Ontology Consortium
P.parse_ontology(go_ontology_file)

# parse the GO annotation (.gaf) file from the UniProt-GOA database
# --- only include annotations with certain evidence codes
select_evidence = ['IDA','IGI','IMP','ISO','ISS','IC','NAS','TAS']
P.parse_annotations(go_annotation_file,protein_coding_gene_file,\
                    select_evidence=select_evidence)









    



[2015-11-21 13:13:01] INFO: Parsed 43799 GO term definitions.
[2015-11-21 13:13:01] INFO: Adding child and part relationships...
[2015-11-21 13:13:01] INFO: Flattening ancestors...
[2015-11-21 13:13:06] INFO: Flattening descendants...
[2015-11-21 13:13:10] INFO: Read 19761 genes.
[2015-11-21 13:13:11] INFO: Parsing annotations...
[2015-11-21 13:13:15] INFO: Parsed 499839 positive GO annotations (274178 = 54.9% excluded based on evidence type).
[2015-11-21 13:13:15] WARNING: Warning: 7375 annotations with 329 unkonwn gene names.
[2015-11-21 13:13:15] INFO: Found a total of 218286 valid annotations.
[2015-11-21 13:13:15] INFO: 142152 unique Gene-Term associations.

Get information about a specific GO term



In [9]:

    
term = P.get_term_by_name('canonical Wnt signaling pathway')
print 'Term name:'
print '----------'
print term.get_pretty_format()
print

term_id = term.id
term_name = term.name
annotated_genes = P.get_goterm_genes(term_id)
print 'Genes annotated with GO term "%s", %s:' %(term_id,term_name)
print '------------------------------------------------------------------------'
print 'Total number of genes: %d' %(len(annotated_genes))
print ', '.join(sorted(annotated_genes))









    



Term name:
----------
BP: canonical Wnt signal. pathway (GO:0060070)

Genes annotated with GO term "GO:0060070", canonical Wnt signaling pathway:
------------------------------------------------------------------------
Total number of genes: 48
APC, AXIN1, CAV1, CDH3, CHD8, CTNNB1, DVL1, DVL2, DVL3, FZD1, FZD2, FZD3, FZD4, FZD5, FZD7, FZD8, GATA3, GSK3B, HOXB9, LEF1, LRP5, LRP6, MYC, NDP, OTULIN, PPAP2B, PTEN, PTK7, PTPRU, RAB5A, RARG, SCYL2, SFRP1, SNAI2, SOX4, TBL1X, TBL1XR1, TCF7L2, UBE2B, WNT1, WNT10B, WNT2, WNT3, WNT3A, WNT4, WNT7A, WNT7B, WNT8A

Find all GO terms that the MYC gene is annotated with



In [10]:

    
associated_terms = P.get_gene_goterms('MYC')
print 'GO terms associated with MYC:'
print '-----------------------------'
print '\n'.join(sorted([t.get_pretty_format() for t in associated_terms]))









    



GO terms associated with MYC:
-----------------------------
BP: MAPK cascade (GO:0000165)
BP: Notch signal. pathway (GO:0007219)
BP: branching involved in ureteric bud morphogenesis (GO:0001658)
BP: canonical Wnt signal. pathway (GO:0060070)
BP: cell cycle arrest (GO:0007050)
BP: cellular iron ion homeostasis (GO:0006879)
BP: cellular response to DNA damage stimulus (GO:0006974)
BP: cellular response to drug (GO:0035690)
BP: chromatin remodeling (GO:0006338)
BP: chromosome organization (GO:0051276)
BP: energy reserve metabolic process (GO:0006112)
BP: fibroblast apoptotic process (GO:0044346)
BP: gene expression (GO:0010467)
BP: neg. regulation of apoptotic process (GO:0043066)
BP: neg. regulation of cell division (GO:0051782)
BP: neg. regulation of fibroblast prolif. (GO:0048147)
BP: neg. regulation of monocyte differentiation (GO:0045656)
BP: neg. regulation of stress-activated MAPK cascade (GO:0032873)
BP: neg. regulation of transcription from RNA polymerase II promoter (GO:0000122)
BP: oxygen transport (GO:0015671)
BP: pos. regulation of DNA biosynthetic process (GO:2000573)
BP: pos. regulation of cell prolif. (GO:0008284)
BP: pos. regulation of cysteine-type endopeptidase activity involved in apoptotic process (GO:0043280)
BP: pos. regulation of epithelial cell prolif. (GO:0050679)
BP: pos. regulation of fibroblast prolif. (GO:0048146)
BP: pos. regulation of mesenchymal cell prolif. (GO:0002053)
BP: pos. regulation of metanephric cap mesenchymal cell prolif. (GO:0090096)
BP: pos. regulation of response to DNA damage stimulus (GO:2001022)
BP: pos. regulation of transcription from RNA polymerase II promoter (GO:0045944)
BP: pos. regulation of transcription, DNA-templated (GO:0045893)
BP: regulation of gene expression (GO:0010468)
BP: regulation of telomere maintenance (GO:0032204)
BP: response to gamma radiation (GO:0010332)
BP: response to growth factor (GO:0070848)
BP: transcription initiation from RNA polymerase II promoter (GO:0006367)
BP: transcription, DNA-templated (GO:0006351)
BP: transforming growth factor beta receptor signal. pathway (GO:0007179)
CC: cytosol (GO:0005829)
CC: nucleolus (GO:0005730)
CC: nucleoplasm (GO:0005654)
CC: nucleus (GO:0005634)
CC: protein complex (GO:0043234)
MF: DNA binding (GO:0003677)
MF: E-box binding (GO:0070888)
MF: RNA polymerase II core promoter proximal region sequence-specific DNA binding (GO:0000978)
MF: protein complex binding (GO:0032403)
MF: transcription factor activity, sequence-specific DNA binding (GO:0003700)
MF: transcriptional activator activity, RNA polymerase II core promoter proximal region sequence-specific binding (GO:0001077)