Run a GOEA. Print study genes as either IDs symbols

We use data from a 2014 Nature paper:
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells

Note: you must have the Python package, xlrd, installed to run this example.

1. Download Ontologies and Associations

1a. Download Ontologies, if necessary



In [1]:

    
# Get http://geneontology.org/ontology/go-basic.obo
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()









    



  EXISTS: go-basic.obo

1b. Download Associations, if necessary



In [2]:

    
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
file_gene2go = download_ncbi_associations()









    



  EXISTS: gene2go

2. Load Ontologies, Associations and Background gene set

2a. Load Ontologies



In [3]:

    
from goatools.obo_parser import GODag

obodag = GODag("go-basic.obo")









    



go-basic.obo: fmt(1.2) rel(2020-01-01) 47,337 GO Terms

2b. Load Associations



In [4]:

    
from __future__ import print_function
from goatools.anno.genetogo_reader import Gene2GoReader

# Read NCBI's gene2go. Store annotations in a list of namedtuples
objanno = Gene2GoReader(file_gene2go, taxids=[10090])

# Get associations for each branch of the GO DAG (BP, MF, CC)
ns2assoc = objanno.get_ns2assc()

for nspc, id2gos in ns2assoc.items():
    print("{NS} {N:,} annotated mouse genes".format(NS=nspc, N=len(id2gos)))









    



HMS:0:00:06.334969 367,335 annotations, 24,267 genes, 18,190 GOs, 1 taxids READ: gene2go 
CC 18,824 annotated mouse genes
BP 17,859 annotated mouse genes
MF 16,721 annotated mouse genes

2c. Load Background gene set

In this example, the background is all mouse protein-codinge genes.

Follow the instructions in the background_genes_ncbi notebook to download a set of background population genes from NCBI.



In [5]:

    
from genes_ncbi_10090_proteincoding import GENEID2NT as GeneID2nt_mus

3. Initialize a GOEA object

The GOEA object holds the Ontologies, Associations, and background.
Numerous studies can then be run withough needing to re-load the above items.
In this case, we only run one GOEA.



In [6]:

    
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS

goeaobj = GOEnrichmentStudyNS(
        GeneID2nt_mus, # List of mouse protein-coding genes
        ns2assoc, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method









    



Load BP Gene Ontology Analysis ...
fisher module not installed.  Falling back on scipy.stats.fisher_exact
 65% 16,811 of 26,033 population items found in association

Load CC Gene Ontology Analysis ...
fisher module not installed.  Falling back on scipy.stats.fisher_exact
 70% 18,198 of 26,033 population items found in association

Load MF Gene Ontology Analysis ...
fisher module not installed.  Falling back on scipy.stats.fisher_exact
 63% 16,331 of 26,033 population items found in association

4. Read study genes

~400 genes from the Nature paper supplemental table 4



In [7]:

    
# Data will be stored in this variable
import os
geneid2symbol = {}
# Get xlsx filename where data is stored
ROOT = os.path.dirname(os.getcwd()) # go up 1 level from current working directory
din_xlsx = os.path.join(ROOT, "goatools/test_data/nbt_3102/nbt.3102-S4_GeneIDs.xlsx")
# Read data
if os.path.isfile(din_xlsx):  
    import xlrd
    book = xlrd.open_workbook(din_xlsx)
    pg = book.sheet_by_index(0)
    for r in range(pg.nrows):
        symbol, geneid, pval = [pg.cell_value(r, c) for c in range(pg.ncols)]
        if geneid:
            geneid2symbol[int(geneid)] = symbol
    print('READ: {XLSX}'.format(XLSX=din_xlsx))
else:
    raise RuntimeError('CANNOT READ: {XLSX}'.format(XLSX=fin_xlsx))









    



READ: /mnt/c/Users/note2/Data/git/goatools/goatools/test_data/nbt_3102/nbt.3102-S4_GeneIDs.xlsx

5. Run Gene Ontology Enrichment Analysis (GOEA)

You may choose to keep all results or just the significant results. In this example, we choose to keep only the significant results.



In [8]:

    
# 'p_' means "pvalue". 'fdr_bh' is the multipletest method we are currently using.
geneids_study = geneid2symbol.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]









    



Run BP Gene Ontology Analysis: current study set of 400 IDs ... 94%    357 of    381 study items found in association
 95%    381 of    400 study items found in population(26033)
Calculating 12,254 uncorrected p-values using fisher_scipy_stats
  12,254 GO terms are associated with 16,811 of 26,033 population items
   2,085 GO terms are associated with    357 of    400 study items
  METHOD fdr_bh:
      55 GO terms found significant (< 0.05=alpha) ( 53 enriched +   2 purified): statsmodels fdr_bh
     209 study items associated with significant GO IDs (enriched)
       4 study items associated with significant GO IDs (purified)

Run CC Gene Ontology Analysis: current study set of 400 IDs ... 99%    376 of    381 study items found in association
 95%    381 of    400 study items found in population(26033)
Calculating 1,724 uncorrected p-values using fisher_scipy_stats
   1,724 GO terms are associated with 18,198 of 26,033 population items
     449 GO terms are associated with    376 of    400 study items
  METHOD fdr_bh:
      83 GO terms found significant (< 0.05=alpha) ( 83 enriched +   0 purified): statsmodels fdr_bh
     373 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

Run MF Gene Ontology Analysis: current study set of 400 IDs ... 89%    339 of    381 study items found in association
 95%    381 of    400 study items found in population(26033)
Calculating 4,145 uncorrected p-values using fisher_scipy_stats
   4,145 GO terms are associated with 16,331 of 26,033 population items
     580 GO terms are associated with    339 of    400 study items
  METHOD fdr_bh:
      54 GO terms found significant (< 0.05=alpha) ( 52 enriched +   2 purified): statsmodels fdr_bh
     276 study items associated with significant GO IDs (enriched)
       0 study items associated with significant GO IDs (purified)

6. Write results to an Excel file and to a text file



In [9]:

    
goeaobj.wr_xlsx("nbt3102_symbols.xlsx", goea_results_sig, itemid2name=geneid2symbol)
goeaobj.wr_xlsx("nbt3102_geneids.xlsx", goea_results_sig)









    



    192 items WROTE: nbt3102_symbols.xlsx
    192 items WROTE: nbt3102_geneids.xlsx