Run a Gene Ontology Enrichment Analysis (GOEA)

We use data from a 2014 Nature paper:
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells

Note: you must have the Python package, xlrd, installed to run this example.

Note: To create plots, you must have:

Python packages: pyparsing, pydot
Graphviz loaded and your PATH environmental variable pointing to the Graphviz bin directory.

1. Download Ontologies and Associations

1a. Download Ontologies, if necessary



In [1]:

    
# Get http://geneontology.org/ontology/go-basic.obo
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()









    



  EXISTS: go-basic.obo

1b. Download Associations, if necessary



In [2]:

    
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
gene2go = download_ncbi_associations()









    



  EXISTS: gene2go

2. Load Ontologies, Associations and Background gene set

2a. Load Ontologies



In [3]:

    
from goatools.obo_parser import GODag

obodag = GODag("go-basic.obo")









    



load obo file go-basic.obo
46518





    



go-basic.obo: format-version(1.2) data-version(releases/2016-04-27)






    



 nodes imported

2b. Load Associations



In [4]:

    
from __future__ import print_function
from goatools.associations import read_ncbi_gene2go

geneid2gos_mouse = read_ncbi_gene2go("gene2go", taxids=[10090])

print("{N:,} annotated mouse genes".format(N=len(geneid2gos_mouse)))









    



19,030 annotated mouse genes

2c. Load Background gene set

In this example, the background is all mouse protein-codinge genes



In [5]:

    
from goatools.test_data.genes_NCBI_10090_ProteinCoding import GeneID2nt as GeneID2nt_mus

3. Initialize a GOEA object

The GOEA object holds the Ontologies, Associations, and background.
Numerous studies can then be run withough needing to re-load the above items.
In this case, we only run one GOEA.



In [6]:

    
from goatools.go_enrichment import GOEnrichmentStudy

goeaobj = GOEnrichmentStudy(
        GeneID2nt_mus.keys(), # List of mouse protein-coding genes
        geneid2gos_mouse, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method









    



fisher module not installed.  Falling back on scipy.stats.fisher_exact
18,437 out of 28,212 population items found in association

4. Read study genes

~400 genes from the Nature paper supplemental table 4



In [7]:

    
# Data will be stored in this variable
import os
geneid2symbol = {}
# Get xlsx filename where data is stored
ROOT = os.path.dirname(os.getcwd()) # go up 1 level from current working directory
din_xlsx = os.path.join(ROOT, "goatools/test_data/nbt_3102/nbt.3102-S4_GeneIDs.xlsx")
# Read data
if os.path.isfile(din_xlsx):  
    import xlrd
    book = xlrd.open_workbook(din_xlsx)
    pg = book.sheet_by_index(0)
    for r in range(pg.nrows):
        symbol, geneid, pval = [pg.cell_value(r, c) for c in range(pg.ncols)]
        if geneid:
            geneid2symbol[int(geneid)] = symbol

5. Run Gene Ontology Enrichment Analysis (GOEA)

You may choose to keep all results or just the significant results. In this example, we choose to keep only the significant results.



In [8]:

    
# 'p_' means "pvalue". 'fdr_bh' is the multipletest method we are currently using.
geneids_study = geneid2symbol.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]









    



Calculating uncorrected p-values using fisher_scipy_stats
   376 out of    400 study items found in association
Running multitest correction: statsmodels fdr_bh
16,953 GO terms are associated with 376 of 400 study items in a population of 28,212

6. Write results to an Excel file and to a text file



In [9]:

    
goeaobj.wr_xlsx("nbt3102.xlsx", goea_results_sig)
goeaobj.wr_txt("nbt3102.txt", goea_results_sig)









    



    164 items WROTE: nbt3102.xlsx
    164 items WROTE: nbt3102.txt

7. Plot all significant GO terms

Plotting all significant GO terms produces a messy spaghetti plot. Such a plot can be useful sometimes because you can open it and zoom and scroll around. But sometimes it is just too messy to be of use.

The "{NS}" in "nbt3102_{NS}.png" indicates that you will see three plots, one for "biological_process"(BP), "molecular_function"(MF), and "cellular_component"(CC)



In [10]:

    
from goatools.godag_plot import plot_gos, plot_results, plot_goid2goobj

plot_results("nbt3102_{NS}.png", goea_results_sig)









    



  WROTE: nbt3102_CC.png
  WROTE: nbt3102_MF.png
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.591505 to fit

  WROTE: nbt3102_BP.png

7a. These plots are likely to messy

The Cellular Component plot is the smallest plot...

7b. So make a smaller sub-plot

This plot contains GOEA results:

GO terms colored by P-value:
- pval < 0.005 (light red)
- pval < 0.01 (light orange)
- pval < 0.05 (yellow)
- pval > 0.05 (grey) Study terms that are not statistically significant
GO terms with study gene counts printed. e.g., "32 genes"



In [11]:

    
# Plot subset starting from these significant GO terms
goid_subset = [
    'GO:0003723', # MF D04 RNA binding (32 genes)
    'GO:0044822', # MF D05 poly(A) RNA binding (86 genes)
    'GO:0003729', # MF D06 mRNA binding (11 genes)
    'GO:0019843', # MF D05 rRNA binding (6 genes)
    'GO:0003746', # MF D06 translation elongation factor activity (5 genes)
]
plot_gos("nbt3102_MF_RNA_genecnt.png", 
    goid_subset, # Source GO ids
    obodag, 
    goea_results=goea_results_all) # Use pvals for coloring









    



  WROTE: nbt3102_MF_RNA_genecnt.png

7c. Add study gene Symbols to plot

e.g., 11 genes: Calr, Eef1a1, Pabpc1



In [12]:

    
plot_gos("nbt3102_MF_RNA_Symbols.png", 
    goid_subset, # Source GO ids
    obodag,
    goea_results=goea_results_all, # use pvals for coloring
    # We can further configure the plot...
    id2symbol=geneid2symbol, # Print study gene Symbols, not Entrez GeneIDs
    study_items=6, # Only only 6 gene Symbols max on GO terms
    items_p_line=3, # Print 3 genes per line
    )









    



  WROTE: nbt3102_MF_RNA_Symbols.png