Running a gene ontology (GO) term enrichment analysis

Introduction

Once you have your list of differentially expressed (DE) genes, the next step is often to ask whether the DE genes associated with a particular biological process or function. To do this, we can perform a gene ontology (GO) term enrichment analysis. A gene ontology (GO) is a controlled vocabulary which describes gene products from in any organism (i.e. species agnostic). These descriptions are broken down into three categories:

  • molecular function (MF) what the gene product does

  • biological process (BP) why the gene product does this

  • cell component (CC) where the gene product acts

DEAGO uses topGO to compare a target gene list (e.g. your DE genes) to the background (all genes) and identifies GO terms which are significantly enriched or over-represented in your gene list.

The objectives of this part of the tutorial are:

  • run a GO analysis with DEAGO
  • interpret the output GO report from DEAGO

Input files

We will need to give DEAGO three bits of information:

  • the name/location of the directory containing our gene count files (counts)
  • the name/location of our sample/condition mapping file (targets.txt)
  • the name/location of our formatted annotation file (ensembl_mm10_deago_formatted.tsv)

Running a GO analysis with DEAGO

To simplest command to run a GO analysis with DEAGO the command would be:

deago -c <counts_directory> -t <targets file> -a <annotation file> --go

To indicate that we want to run a GO analysis, we use the --go option. Notice here that we must provide a formatted annotation file using the -a option as we need to know the GO terms assoicated with each gene for the analysis.

However, as before, our count files were generated by featureCounts for this tutorial, so we need to also tell DEAGO the count format with the --count_type option:

deago -c <counts_directory> -t <targets file> -a <annotation file> --go \
    --count_type featurecounts

We're also using the --control option, as before, which tells DEAGO the condition you want to use as your reference or control, in this case WT_Ctrl.

deago -c <counts_directory> -t <targets file> -a <annotation file> --go\
    --count_type featurecounts --control <control>

Output files

Once your GO analysis has finished, you should see several new files and directories:

  • deago.config
    config file with key/value parameters defining the analysis
  • deago.rlog
    log of the R output generated when converting the R markdown to HTML
  • deago_markdown.Rmd
    R markdown used to run the analysis
  • deago_markdown.html
    HTML report generated from the R markdown
  • results_<timestammp>
    directory containing unfiltered DE analysis results and normalised counts for all genes, one file per contrast

Results directory

DEAGO also writes the GO term enrichment analysis results tables containing the top 30 significantly enriched GO terms to individual files, split by GO term level (MF and BP), in your timestamped results directory. If possible, there will be three GO tables for each GO term level containing the results for analyses using all genes, up-regulated genes only and down-regulated genes

  • [contrast]_BP.tsv
  • [contrast]_BP_up.tsv
  • [contrast]_BP_down.tsv
  • [contrast]_MF.tsv
  • [contrast]_MF_up.tsv
  • [contrast]_MF_down.tsv

GO analysis report

The output file we're interested in is deago_markdown.html which is your GO analysis report. Go ahead and open it in a web browser (e.g. Chrome, Firefox, IE, Safari...). You can do this by going to "File -> Open" in the top navigation or (if you have Firefox installed, use the command:


In [ ]:
firefox deago_markdown.html

You'll already be familiar with the QC plots and Pairwise contrasts sections. Click on Pairwise contrasts and then go to the contrast subsection for wt_il22_vs_ko_il22. You should now see six new subsections, 6.7.2 - 6.7.7, which contain your GO term enrichment analysis results for that contrast.

The first three subsections have the prefix GO term enrichment - BP and contain the results for the Biological Processes (BP). The remaining three subsections have the prefix GO term enrichment - MF and contain the results for the Molecular Functions (MF).

For each GO level (BP or MF) there is a subsection containing the results from GO term enrichment analyses which were run using all DE genes, up-regulated genes only and down-regulated genes only.

Let's take a look at the BP (upregulated genes only) table for the contrast wt_il22_vs_ko_il22 (subsection 6.7.3). All of the results tables are interactive. Despite performing a single-factor analyis, we can still find the top GO description for up-regulated genes inflammatory response in our results table (GO:0006954).

The paper also mentions Prdm1 which is upregulated by IL-22RA1 signalling and is a susceptibility gene in to inflammatory bowel disease. Let's see whether it's associated with any of the enriched GO terms. Type "prdm1" in the top right search box to search the whole table.

We can see that one GO term matches: GO:0031663 (lipopolysaccharide-mediated signaling pathway).

Scrolling to the right, we can see how the search found this match. For each GO term, the DE genes which are associated with that term are also shown in the table.

Exercise 5

First, let's make sure we're in the data directory.


In [ ]:
cd data

Each DEAGO analysis should be self-contained, so let's create a new directory for our GO analysis.


In [ ]:
mkdir go_analysis

In [ ]:
cd go_analysis

Now, let's get our GO report.


In [ ]:
deago --build_config -c ../counts -t ../targets.txt \
    --count_type featurecounts \
    -a ../ensembl_mm10_deago_formatted.tsv \
    --control WT_Ctrl -go

Questions

In Figure 5C (below), the authors have highlighted four genes: Fut2, Sec1, Fut8 and B4galt1 which are associated with glycosylation.

Answer the following question for all four genes:

Q1: Which biological processes is this upregulated gene associated with?

What's next?

If you want a recap of input file preparation, head back to running a differential expression (DE) analysis.

Otherwise, let's continue on to troubleshooting.