We use data from a 2014 Nature paper:
Computational analysis of cell-to-cell heterogeneity
in single-cell RNA-sequencing data reveals hidden
subpopulations of cells
This iPython notebook demonstrates one approach to explore Gene Ontology Enrichment Analysis (GOEA) results:
The code in this notebook generates the data used in this statement in the GOATOOLS manuscript:
We observed:
93 genes associated with RNA,
47 genes associated with translation,
70 genes associated with mitochondrial or mitochondrian, and
37 genes associated with ribosomal, as reported by GOATOOLS.
Details summarized here are also found in the file, nbt3102_GO_word_genes.txt, which is generated by this iPython notebook.
RNA: 93 study genes, 6 GOs:
0) BP GO:0006364 rRNA processing (8 genes)
1) MF GO:0003723 RNA binding (32 genes)
2) MF GO:0003729 mRNA binding (11 genes)
3) MF GO:0008097 5S rRNA binding (4 genes)
4) MF GO:0019843 rRNA binding (6 genes)
5) MF GO:0044822 poly(A) RNA binding (86 genes)
translation: 47 study genes, 5 GOs:
0) BP GO:0006412 translation (41 genes)
1) BP GO:0006414 translational elongation (7 genes)
2) BP GO:0006417 regulation of translation (9 genes)
3) MF GO:0003746 translation elongation factor activity (5 genes)
4) MF GO:0031369 translation initiation factor binding (4 genes)
mitochond: 70 study genes, 8 GOs:
0) BP GO:0051881 regulation of mitochondrial membrane potential (7 genes)
1) CC GO:0000275 mitochondrial proton-transporting ATP synthase complex, catalytic core F(1) (3 genes)
2) CC GO:0005739 mitochondrion (68 genes)
3) CC GO:0005743 mitochondrial inner membrane (28 genes)
4) CC GO:0005747 mitochondrial respiratory chain complex I (5 genes)
5) CC GO:0005753 mitochondrial proton-transporting ATP synthase complex (4 genes)
6) CC GO:0005758 mitochondrial intermembrane space (7 genes)
7) CC GO:0031966 mitochondrial membrane (6 gene0) CC GO:0005925 focal adhesion (53 genes)
ribosomal: 37 study genes, 6 GOs:
0) BP GO:0000028 ribosomal small subunit assembly (9 genes)
1) BP GO:0042274 ribosomal small subunit biogenesis (6 genes)
2) CC GO:0015934 large ribosomal subunit (4 genes)
3) CC GO:0015935 small ribosomal subunit (13 genes)
4) CC GO:0022625 cytosolic large ribosomal subunit (16 genes)
5) CC GO:0022627 cytosolic small ribosomal subunit (19 genes)
Also seen, but not reported in the manuscript:
ribosome: 41 study genes, 2 GOs:
0) CC GO:0005840 ribosome (35 genes)
1) MF GO:0003735 structural constituent of ribosome (38 genes)
adhesion: 53 study genes, 1 GOs:
0) CC GO:0005925 focal adhesion (53 genes)
endoplasmic: 49 study genes, 3 GOs:
0) CC GO:0005783 endoplasmic reticulum (48 genes)
1) CC GO:0005790 smooth endoplasmic reticulum (5 genes)
2) CC GO:0070971 endoplasmic reticulum exit site (4 genes)
nucleotide: 46 study genes, 1 GOs:
0) MF GO:0000166 nucleotide binding (46 genes)
apoptotic: 42 study genes, 2 GOs:
0) BP GO:0006915 apoptotic process (26 genes)
1) BP GO:0043066 negative regulation of apoptotic process (28 genes)
For this exploration, we choose specific sets of GO terms for each plot based on frequently seen words in the GO term name. Examples of GO term names include "rRNA processing", "poly(A) RNA binding", and "5S rRNA binding". The common word for these GO terms is "RNA".
Steps:
In [13]:
%run goea_nbt3102_fncs.ipynb
goeaobj = get_goeaobj_nbt3102('fdr_bh')
# Read Nature data from Excel file (~400 study genes)
studygeneid2symbol = read_data_nbt3102()
# Run Gene Ontology Enrichment Analysis using Benjamini/Hochberg FDR correction
geneids_study = studygeneid2symbol.keys()
goea_results_all = goeaobj.run_study(geneids_study)
goea_results_sig = [r for r in goea_results_all if r.p_fdr_bh < 0.05]
In [14]:
from __future__ import print_function
go_names = [r.name for r in goea_results_sig]
print(len(go_names)) # Includes ONLY signficant results
In [15]:
import collections as cx
word2cnt = cx.Counter([word for name in go_names for word in name.split()])
In [16]:
# Print 10 most common words found in significant GO term names
print(word2cnt.most_common(10))
In [17]:
freq_seen = ['RNA', 'translation', 'mitochond', 'ribosomal', 'ribosome',
'adhesion', 'endoplasmic', 'nucleotide', 'apoptotic']
In [18]:
# Collect significant GOs for words in freq_seen (unordered)
word2siggos = cx.defaultdict(set)
# Loop through manually curated words of interest
for word in freq_seen:
# Check each significant GOEA result for the word of interest
for rec in goea_results_sig:
if word in rec.name:
word2siggos[word].add(rec.GO)
# Sort word2gos to have the same order as words in freq_seen
word2siggos = cx.OrderedDict([(w, word2siggos[w]) for w in freq_seen])
In [19]:
goid2goobj_all = {nt.GO:nt.goterm for nt in goea_results_all}
print(len(goid2goobj_all))
In [20]:
# Plot set of GOs for each frequently seen word
from goatools.godag_plot import plot_goid2goobj
for word, gos in word2siggos.items():
goid2goobj = {go:goid2goobj_all[go] for go in gos}
plot_goid2goobj(
"nbt3102_word_{WORD}.png".format(WORD=word),
goid2goobj, # source GOs to plot and their GOTerm object
study_items=15, # Max number of gene symbols to print in each GO term
id2symbol=studygeneid2symbol, # Contains GeneID-to-Symbol from Step 1
goea_results=goea_results_all, # pvals used for GO Term coloring
dpi=150)
Colors: Please note that to have colors related to GOEA significance, you must provide the GOEA results, as shown here with the "goea_results=goea_results_all" argument.
Please note that the variable, goea_results_all, contains gene ids and fdr_bh alpha values for all study GO terms, significant or not. If the argument had only included the significant results, "goea_results=goea_results_sig", the currently colored grey GO terms would be white and would not have study genes annotated inside.
Gene Symbol Names
Please notice that the study gene symbol names are written in thier associated GO term box. Symbol names and not gene count nor gene ids are used because of the argument, "id2symbol=studygeneid2symbol", to the function, "plot_goid2goobj".
In [21]:
fout = "nbt3102_GO_word_genes.txt"
go2res = {nt.GO:nt for nt in goea_results_all}
with open(fout, "w") as prt:
prt.write("""This file is generated by test_nbt3102.py and is intended to confirm
this statement in the GOATOOLS manuscript:
We observed:
93 genes associated with RNA,
47 genes associated with translation,
70 genes associated with mitochondrial or mitochondrian, and
37 genes associated with ribosomal, as reported by GOATOOLS.
""")
for word, gos in word2siggos.items():
# Sort first by BP, MF, CC. Sort second by GO id.
gos = sorted(gos, key=lambda go: [go2res[go].NS, go])
genes = set()
for go in gos:
genes |= go2res[go].study_items
genes = sorted([studygeneid2symbol[g] for g in genes])
prt.write("\n{WD}: {N} study genes, {M} GOs\n".format(WD=word, N=len(genes), M=len(gos)))
prt.write("{WD} GOs: {GOs}\n".format(WD=word, GOs=", ".join(gos)))
for i, go in enumerate(gos):
res = go2res[go]
prt.write("{I}) {NS} {GO} {NAME} ({N} genes)\n".format(
I=i, NS=res.NS, GO=go, NAME=res.name, N=res.study_count))
prt.write("{N} study genes:\n".format(N=len(genes)))
N = 10 # Number of genes per line
mult = [genes[i:i+N] for i in range(0, len(genes), N)]
prt.write(" {}\n".format("\n ".join([", ".join(str(g) for g in sl) for sl in mult])))
print(" WROTE: {F}\n".format(F=fout))
Shows the word, RNA, and all significant GO terms which contain RNA and all study genes associated with the RNA GO terms.
...
RNA: 93 study genes, 6 GOs
RNA GOs: GO:0006364, GO:0003723, GO:0003729, GO:0008097, GO:0019843, GO:0044822
0) BP GO:0006364 rRNA processing (8 genes)
1) MF GO:0003723 RNA binding (32 genes)
2) MF GO:0003729 mRNA binding (11 genes)
3) MF GO:0008097 5S rRNA binding (4 genes)
4) MF GO:0019843 rRNA binding (6 genes)
5) MF GO:0044822 poly(A) RNA binding (86 genes)
93 study genes:
Ahnak, Aldoa, Anxa2, Arf1, Atp5a1, Auh, Btf3, Calr, Canx, Coro1a
Csde1, Edf1, Eef1a1, Eef2, Eif3h, Eif4e2, Eif5a, Eno1, Fcf1, Fdps
Gdi2, Gltscr2, Gm5506, Gnb2l1, H2-D1, H2-K1, H2-Q10, H2-Q7, Hdlbp, Hnrnpl
Hsp90ab1, Hspa8, Immt, Lgals1, Mdh2, Mrpl45, Ndufv3, Npm1, P4hb, Pabpc1
Park7, Pdia3, Pfn1, Pkm, Poldip3, Psma6, R3hdm1, Rbm39, Rbm5, Rpl13
Rpl13a, Rpl14, Rpl18, Rpl18a, Rpl23, Rpl24, Rpl34, Rpl4, Rpl6, Rpl7
Rpl7a, Rpl8, Rplp0, Rps10, Rps11, Rps14, Rps15, Rps15a, Rps17, Rps19
Rps2, Rps20, Rps24, Rps25, Rps26, Rps3, Rps4x, Rps5, Rps7, Rps9
Rpsa, Samhd1, Slc25a5, Son, Sptbn1, Srp72, Srpk1, Sub1, Suclg1, Sumo1
Syf2, Ywhae, Zc3hav1
...
Copyright (C) 2016, DV Klopfenstein, H Tang. All rights reserved.