Group Significant GO terms by Frequently Seen Words

We use data from a 2014 Nature paper:
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells

2016 GOATOOLS Manuscript

This iPython notebook demonstrates one approach to explore Gene Ontology Enrichment Analysis (GOEA) results:

Create sub-plots containing significant GO terms which share a common word, like RNA.
Create detailed reports showing all significant GO terms and all study gene symbols for the common word.

2016 GOATOOLS Manuscript text, summary

The code in this notebook generates the data used in this statement in the GOATOOLS manuscript:

We observed:
    93 genes associated with RNA,
    47 genes associated with translation,
    70 genes associated with mitochondrial or mitochondrian, and
    37 genes associated with ribosomal, as reported by GOATOOLS.

2016 GOATOOLS Manuscript text, details

Details summarized here are also found in the file, nbt3102_GO_word_genes.txt, which is generated by this iPython notebook.

RNA: 93 study genes, 6 GOs:
0) BP GO:0006364 rRNA processing (8 genes)
1) MF GO:0003723 RNA binding (32 genes)
2) MF GO:0003729 mRNA binding (11 genes)
3) MF GO:0008097 5S rRNA binding (4 genes)
4) MF GO:0019843 rRNA binding (6 genes)
5) MF GO:0044822 poly(A) RNA binding (86 genes)
translation: 47 study genes, 5 GOs:
0) BP GO:0006412 translation (41 genes)
1) BP GO:0006414 translational elongation (7 genes)
2) BP GO:0006417 regulation of translation (9 genes)
3) MF GO:0003746 translation elongation factor activity (5 genes)
4) MF GO:0031369 translation initiation factor binding (4 genes)
mitochond: 70 study genes, 8 GOs:
0) BP GO:0051881 regulation of mitochondrial membrane potential (7 genes)
1) CC GO:0000275 mitochondrial proton-transporting ATP synthase complex, catalytic core F(1) (3 genes)
2) CC GO:0005739 mitochondrion (68 genes)
3) CC GO:0005743 mitochondrial inner membrane (28 genes)
4) CC GO:0005747 mitochondrial respiratory chain complex I (5 genes)
5) CC GO:0005753 mitochondrial proton-transporting ATP synthase complex (4 genes)
6) CC GO:0005758 mitochondrial intermembrane space (7 genes)
7) CC GO:0031966 mitochondrial membrane (6 gene0) CC GO:0005925 focal adhesion (53 genes)
ribosomal: 37 study genes, 6 GOs:
0) BP GO:0000028 ribosomal small subunit assembly (9 genes)
1) BP GO:0042274 ribosomal small subunit biogenesis (6 genes)
2) CC GO:0015934 large ribosomal subunit (4 genes)
3) CC GO:0015935 small ribosomal subunit (13 genes)
4) CC GO:0022625 cytosolic large ribosomal subunit (16 genes)
5) CC GO:0022627 cytosolic small ribosomal subunit (19 genes)

Also seen, but not reported in the manuscript:

ribosome: 41 study genes, 2 GOs:
0) CC GO:0005840 ribosome (35 genes)
1) MF GO:0003735 structural constituent of ribosome (38 genes)
adhesion: 53 study genes, 1 GOs:
0) CC GO:0005925 focal adhesion (53 genes)
endoplasmic: 49 study genes, 3 GOs:
0) CC GO:0005783 endoplasmic reticulum (48 genes)
1) CC GO:0005790 smooth endoplasmic reticulum (5 genes)
2) CC GO:0070971 endoplasmic reticulum exit site (4 genes)
nucleotide: 46 study genes, 1 GOs:
0) MF GO:0000166 nucleotide binding (46 genes)
apoptotic: 42 study genes, 2 GOs:
0) BP GO:0006915 apoptotic process (26 genes)
1) BP GO:0043066 negative regulation of apoptotic process (28 genes)

Methodology

For this exploration, we choose specific sets of GO terms for each plot based on frequently seen words in the GO term name. Examples of GO term names include "rRNA processing", "poly(A) RNA binding", and "5S rRNA binding". The common word for these GO terms is "RNA".

Steps:

Run a Gene Ontology Enrichment Analysis.
Count all words in the significant GO term names.
Inspect word-count list from step 2.
Create curated list of words based on frequently seen GO term words.
Get significant GO terms which contain the words of interest.
Plot GO terms seen for each word of interest.
Print a report with full details

1. Run GOEA. Save results.



In [13]:

    
%run goea_nbt3102_fncs.ipynb
goeaobj = get_goeaobj_nbt3102('fdr_bh')
# Read Nature data from Excel file (~400 study genes)
studygeneid2symbol = read_data_nbt3102()
# Run Gene Ontology Enrichment Analysis using Benjamini/Hochberg FDR correction
geneids_study = studygeneid2symbol.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]









    



  EXISTS: go-basic.obo
go-basic.obo: format-version(1.2) data-version(releases/2016-04-27)





    



load obo file go-basic.obo
46518





    



  EXISTS: gene2go
18,437 out of 28,212 population items found in association
Calculating uncorrected p-values using Fisher's exact test
   376 out of    400 study items found in association
Running multitest correction: statsmodels fdr_bh
16,953 GO terms are associated with 376 of 400 study items in a population of 28,212






    



 nodes imported

2. Count all words in the significant GO term names.

2a. Get list of significant GO term names



In [14]:

    
from __future__ import print_function
go_names = [r.name for r in goea_results_sig]
print(len(go_names)) # Includes ONLY signficant results

2b. Get word count in significant GO term names



In [15]:

    
import collections as cx
word2cnt = cx.Counter([word for name in go_names for word in name.split()])

3. Inspect word-count list generated at step 2.

Words like "mitochondrial" can be interesting. Some words will not be interesting, such as "of".



In [16]:

    
# Print 10 most common words found in significant GO term names
print(word2cnt.most_common(10))









    



[('binding', 29), ('of', 25), ('regulation', 18), ('protein', 17), ('cell', 14), ('complex', 13), ('process', 10), ('positive', 9), ('actin', 8), ('activity', 8)]

4. Create curated list of words based on frequently seen GO term words.



In [17]:

    
freq_seen = ['RNA', 'translation', 'mitochond', 'ribosomal', 'ribosome',
             'adhesion', 'endoplasmic', 'nucleotide', 'apoptotic']

5. For each word of interest, create a list of significant GOs whose name contains the word.



In [18]:

    
# Collect significant GOs for words in freq_seen (unordered)
word2siggos = cx.defaultdict(set)
# Loop through manually curated words of interest
for word in freq_seen:
    # Check each significant GOEA result for the word of interest
    for rec in goea_results_sig:
        if word in rec.name:
            word2siggos[word].add(rec.GO)
# Sort word2gos to have the same order as words in freq_seen
word2siggos = cx.OrderedDict([(w, word2siggos[w]) for w in freq_seen])

6. Plot GO terms seen for each word of interest.

6a. Create a convenient goid-to-goobject dictionary



In [19]:

    
goid2goobj_all = {nt.GO:nt.goterm for nt in goea_results_all}
print(len(goid2goobj_all))

6b. Create plots formed by a shared word in the significant GO term's name



In [20]:

    
# Plot set of GOs for each frequently seen word
from goatools.godag_plot import plot_goid2goobj
for word, gos in word2siggos.items():
    goid2goobj = {go:goid2goobj_all[go] for go in gos}
    plot_goid2goobj(
        "nbt3102_word_{WORD}.png".format(WORD=word),
        goid2goobj, # source GOs to plot and their GOTerm object
        study_items=15, # Max number of gene symbols to print in each GO term
        id2symbol=studygeneid2symbol, # Contains GeneID-to-Symbol from Step 1
        goea_results=goea_results_all, # pvals used for GO Term coloring
        dpi=150)









    



  WROTE: nbt3102_word_RNA.png
  WROTE: nbt3102_word_translation.png
  WROTE: nbt3102_word_mitochond.png
  WROTE: nbt3102_word_ribosomal.png
  WROTE: nbt3102_word_ribosome.png
  WROTE: nbt3102_word_adhesion.png
  WROTE: nbt3102_word_endoplasmic.png
  WROTE: nbt3102_word_nucleotide.png
  WROTE: nbt3102_word_apoptotic.png

6c. Example plot for "apoptotic"

Colors: Please note that to have colors related to GOEA significance, you must provide the GOEA results, as shown here with the "goea_results=goea_results_all" argument.

Levels of Statistical Significance:
1. light red => extremely significant fdr_bh values (p<0.005)
2. orange => very significant fdr_bh values (p<0.01)
3. yellow => significant fdr_bh values (p<0.05)
4. grey => study terms which are not statistically significant (p>0.05)
High-level GO terms:
1. Cyan => Level-01 GO terms

Please note that the variable, goea_results_all, contains gene ids and fdr_bh alpha values for all study GO terms, significant or not. If the argument had only included the significant results, "goea_results=goea_results_sig", the currently colored grey GO terms would be white and would not have study genes annotated inside.

Gene Symbol Names
Please notice that the study gene symbol names are written in thier associated GO term box. Symbol names and not gene count nor gene ids are used because of the argument, "id2symbol=studygeneid2symbol", to the function, "plot_goid2goobj".

7. Print a report with full details

7a. Create detailed report



In [21]:

    
fout = "nbt3102_GO_word_genes.txt"
go2res = {nt.GO:nt for nt in goea_results_all}
with open(fout, "w") as prt:
    prt.write("""This file is generated by test_nbt3102.py and is intended to confirm
this statement in the GOATOOLS manuscript:

        We observed:
            93 genes associated with RNA,
            47 genes associated with translation,
            70 genes associated with mitochondrial or mitochondrian, and
            37 genes associated with ribosomal, as reported by GOATOOLS.

""")
    for word, gos in word2siggos.items():
        # Sort first by BP, MF, CC. Sort second by GO id.
        gos = sorted(gos, key=lambda go: [go2res[go].NS, go])
        genes = set()
        for go in gos:
            genes |= go2res[go].study_items
        genes = sorted([studygeneid2symbol[g] for g in genes])
        prt.write("\n{WD}: {N} study genes, {M} GOs\n".format(WD=word, N=len(genes), M=len(gos)))
        prt.write("{WD} GOs: {GOs}\n".format(WD=word, GOs=", ".join(gos)))
        for i, go in enumerate(gos):
            res = go2res[go]
            prt.write("{I}) {NS} {GO} {NAME} ({N} genes)\n".format(
                I=i, NS=res.NS, GO=go, NAME=res.name, N=res.study_count))
        prt.write("{N} study genes:\n".format(N=len(genes)))
        N = 10 # Number of genes per line
        mult = [genes[i:i+N] for i in range(0, len(genes), N)]
        prt.write("  {}\n".format("\n  ".join([", ".join(str(g) for g in sl) for sl in mult])))
    print("  WROTE: {F}\n".format(F=fout))









    



  WROTE: nbt3102_GO_word_genes.txt

7b Snippet of Report

Shows the word, RNA, and all significant GO terms which contain RNA and all study genes associated with the RNA GO terms.

...

RNA: 93 study genes, 6 GOs
RNA GOs: GO:0006364, GO:0003723, GO:0003729, GO:0008097, GO:0019843, GO:0044822
0) BP GO:0006364 rRNA processing (8 genes)
1) MF GO:0003723 RNA binding (32 genes)
2) MF GO:0003729 mRNA binding (11 genes)
3) MF GO:0008097 5S rRNA binding (4 genes)
4) MF GO:0019843 rRNA binding (6 genes)
5) MF GO:0044822 poly(A) RNA binding (86 genes)
93 study genes:
  Ahnak, Aldoa, Anxa2, Arf1, Atp5a1, Auh, Btf3, Calr, Canx, Coro1a
  Csde1, Edf1, Eef1a1, Eef2, Eif3h, Eif4e2, Eif5a, Eno1, Fcf1, Fdps
  Gdi2, Gltscr2, Gm5506, Gnb2l1, H2-D1, H2-K1, H2-Q10, H2-Q7, Hdlbp, Hnrnpl
  Hsp90ab1, Hspa8, Immt, Lgals1, Mdh2, Mrpl45, Ndufv3, Npm1, P4hb, Pabpc1
  Park7, Pdia3, Pfn1, Pkm, Poldip3, Psma6, R3hdm1, Rbm39, Rbm5, Rpl13
  Rpl13a, Rpl14, Rpl18, Rpl18a, Rpl23, Rpl24, Rpl34, Rpl4, Rpl6, Rpl7
  Rpl7a, Rpl8, Rplp0, Rps10, Rps11, Rps14, Rps15, Rps15a, Rps17, Rps19
  Rps2, Rps20, Rps24, Rps25, Rps26, Rps3, Rps4x, Rps5, Rps7, Rps9
  Rpsa, Samhd1, Slc25a5, Son, Sptbn1, Srp72, Srpk1, Sub1, Suclg1, Sumo1
  Syf2, Ywhae, Zc3hav1

...