Run a GOEA. Print study genes as either IDs symbols

We use data from a 2014 Nature paper:
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells

Note: you must have the Python package, xlrd, installed to run this example.

1. Download Ontologies and Associations

1a. Download Ontologies, if necessary



In [2]:

    
# Get http://geneontology.org/ontology/go-basic.obo
from goatools.base import download_go_basic_obo
obo_fname = download_go_basic_obo()









    



  EXISTS: go-basic.obo

1b. Download Associations, if necessary



In [3]:

    
# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
from goatools.base import download_ncbi_associations
gene2go = download_ncbi_associations()









    



  EXISTS: gene2go

2. Load Ontologies, Associations and Background gene set

2a. Load Ontologies



In [4]:

    
from goatools.obo_parser import GODag

obodag = GODag("go-basic.obo")









    



load obo file go-basic.obo
go-basic.obo: fmt(1.2) rel(2017-06-30) 48,928 GO Terms

2b. Load Associations



In [5]:

    
from __future__ import print_function
from goatools.associations import read_ncbi_gene2go

geneid2gos_mouse = read_ncbi_gene2go("gene2go", taxids=[10090])

print("{N:,} annotated mouse genes".format(N=len(geneid2gos_mouse)))









    



19,710 annotated mouse genes

2c. Load Background gene set

In this example, the background is all mouse protein-codinge genes



In [6]:

    
from goatools.test_data.genes_NCBI_10090_ProteinCoding import GeneID2nt as GeneID2nt_mus

3. Initialize a GOEA object

The GOEA object holds the Ontologies, Associations, and background.
Numerous studies can then be run withough needing to re-load the above items.
In this case, we only run one GOEA.



In [7]:

    
from goatools.go_enrichment import GOEnrichmentStudy

goeaobj = GOEnrichmentStudy(
        GeneID2nt_mus.keys(), # List of mouse protein-coding genes
        geneid2gos_mouse, # geneid/GO associations
        obodag, # Ontologies
        propagate_counts = False,
        alpha = 0.05, # default significance cut-off
        methods = ['fdr_bh']) # defult multipletest correction method









    



fisher module not installed.  Falling back on scipy.stats.fisher_exact
18,807 out of 28,212 population items found in association

4. Read study genes

~400 genes from the Nature paper supplemental table 4



In [8]:

    
# Data will be stored in this variable
import os
geneid2symbol = {}
# Get xlsx filename where data is stored
ROOT = os.path.dirname(os.getcwd()) # go up 1 level from current working directory
din_xlsx = os.path.join(ROOT, "goatools/test_data/nbt_3102/nbt.3102-S4_GeneIDs.xlsx")
# Read data
if os.path.isfile(din_xlsx):  
    import xlrd
    book = xlrd.open_workbook(din_xlsx)
    pg = book.sheet_by_index(0)
    for r in range(pg.nrows):
        symbol, geneid, pval = [pg.cell_value(r, c) for c in range(pg.ncols)]
        if geneid:
            geneid2symbol[int(geneid)] = symbol

5. Run Gene Ontology Enrichment Analysis (GOEA)

You may choose to keep all results or just the significant results. In this example, we choose to keep only the significant results.



In [9]:

    
# 'p_' means "pvalue". 'fdr_bh' is the multipletest method we are currently using.
geneids_study = geneid2symbol.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]









    



Calculating uncorrected p-values using fisher_scipy_stats
   379 out of    400 study items found in association
Running multitest correction: statsmodels fdr_bh
  2,944 GO terms are associated with 379 of 400 study items
  17,345 GO terms are associated with 18,807 of 28,212 population items

6. Write results to an Excel file and to a text file



In [10]:

    
goeaobj.wr_xlsx("nbt3102_symbols.xlsx", goea_results_sig, itemid2name=geneid2symbol)
goeaobj.wr_xlsx("nbt3102_geneids.xlsx", goea_results_sig)









    



    188 items WROTE: nbt3102_symbols.xlsx
    188 items WROTE: nbt3102_geneids.xlsx