Running a differential expression (DE) analysis

Introduction

By default, DEAGO will run both a quality control (QC) and a differential expression (DE) analysis. DE analyses try to identify genes whose expression levels differ between experimental conditions. We don’t normally have enough replicates to do traditional tests of significance for RNA-Seq data. So, most methods look for outliers in the relationship between average abundance and fold change and assume most genes are not differentially expressed.

Rather than just using a fold change threshold to determine which genes are differentially expressed, DEAs use a variety of statistical tests for significance. These tests give us a p-value which is an estimate of how often your observations would occur by chance.

However, we perform these comparisons for each one of the thousands of genes/transcripts in our dataset. A p-value of 0.01 estimates a probability of 1% for seeing our observation just by chance. In an experiment like ours with 5,000 genes we would expect 5 genes to be significantly differentially expressed by chance (i.e. even if there were no difference between our conditions). Instead of using a p-value we can use an adjusted p-value, also known as the q-value, which accounts for the multiple testing and adjusts the p-value accordingly.

The objectives of this part of the tutorial are:

  • run a DE analysis with DEAGO
  • interpret the output DE report from DEAGO

Input files

We will need to give DEAGO two bits of information:

  • the name/location of the directory containing our gene count files (counts)
  • the name/location of our sample/condition mapping file (targets.txt)

We can optionally give DEAGO a formatted annotation file which contains gene names. These are often more recognisable than the unique gene identifiers found in the counts files. To do this, we use the -a option.

Running a DE analysis with DEAGO

To run a quick, DE analysis with DEAGO the command would be:

deago -c <counts_directory> -t <targets file>

As our count files were generated by featureCounts for this tutorial, we need to also tell DEAGO the count format with the --count_type option:

deago -c <counts_directory> -t <targets file> --count_type featurecounts

As we want to have the gene names in our output tables and plots, we need to provide our formatted annotation file using the -a option.

Finally, we will be using the --control option which tells DEAGO the condition you want to use as your reference or control, in this case WT_Ctrl. We use --control to define our reference condition because, by default, R chooses the reference condition based on alphabetical order. It would assume that from our four conditions (KO_Ctrl, KO_IL22, WT_Ctrl and WT_IL22) that KO_Ctrl is our reference condition because it is first alphabetically. The value you use must be in the condition column in your targets file and is case insensitive.

deago -c <counts directory> -t <targets file> --count_type featurecounts \
    -a <annotation file> --control <control>

DEAGO also makes an assumption that you want the FDR cutoff (alpha) to be 0.05 (default). If you are expecting to use a different cutoff in your downstream filtering, use the -q option to define the FDR cutoff (e.g. -q 0.01).

Output files/directories

Once your DE analysis has finished, you should see several new files and directories:

  • deago.config
    config file with key/value parameters defining the analysis
  • deago.rlog
    log of the R output generated when converting the R markdown to HTML
  • deago_markdown.Rmd
    R markdown used to run the analysis
  • deago_markdown.html
    HTML report generated from the R markdown
  • results_<timestammp>
    directory containing unfiltered DE analysis results and normalised counts for all genes, one file per contrast

Results directory

The report tables are limited to genes with an adjusted p-value < 0.01 and a log2 fold change >= 2 or <= -2. However, you are likely to want to explore and filter these results using different thresholds. So, DEAGO also writes the unfiltered results table containing all genes to individual files, one per contrast in your timestamped results directory.

So, for the full results of the contrast between WT and KO cells treated with IL22 you would look at:

results_<timestamp>/wt_il22_vs_ko_il22_q0.05.txt

DE analysis report

The output file we're interested in is deago_markdown.html which is your DE analysis report. Go ahead and open it in a web browser (e.g. Chrome, Firefox, IE, Safari...). You can do this by going to "File -> Open" in the top navigation or (if you have Firefox installed, use the command:


In [ ]:
firefox deago_markdown.html

In addition to the QC sections we saw before, you should now see a new option in the left-hand sidebar called Pairwise contrasts. Click on it and it will take you to your DE analysis results.

First there is a Contrast summary section which contains a summary table showing how many genes are up-regulated or down-regulated in each contrast (comparison between two sample groups). We can see that there were no differentially expressed (DE) genes between the knockout (KO) samples induced with IL22 (ko_il22) and the control knockout samples. However, there were 860 DE genes between wildtype (WT) and knockout (KO) samples induced with IL22, 510 up-regulated in the WT samples compared to the KO samples and 350 down-regulated.

If there are 2-4 contrasts in the analysis, there will also be a Venn diagram showing the overlap/differences in total DE genes between contrasts. We have 6 contrasts in this analysis, so no Venn diagram was generated, but an example would be:

The DE analysis report then has a series of subsections, one per contrast. Each contrast section has an MA plot and a volcano plot. The top 5 up- and down-regulated gene identifiers are labelled on plots. If an annotation with gene symbols was used then the point labels will be the gene symbols and not the gene identifiers.

Each contrast section will also have a DE results table which contains genes with an adjusted p-value < 0.01 and a log2 fold change >= 2 or <= -2. This is to reduce the number of genes in the table so that the HTML report is compact and sharable.

These tables contain the DESeq2 results and are where you will find your adjusted p-values and log2 fold change values. Also, as our gene identifiers are Ensembl identifiers, they have been converted to a link and if you click on one, it will take you to the current Ensembl page for that gene stable ID. In this example, we had include an annotation in the analysis and so the gene symbols are also shown.

All of the tables are interactive and can be searched or filtered. The paper describes the up-regulation of Fut2 by IL-22RA1 signalling, so let's take a look. The search box at the top right searches the whole table, so we can use it to search for any Fut genes.

This gives us two genes: Fut2 and Fut9.

Now, say we wanted to only see the up-regulated Fut genes. We can limit searches and filters to a single column by using the search/filter boxes at the top of each column. Use the selector at the top of the log2FoldChange column to only include values greater than 0 (i.e. drag the left selector).

We are now left with the only up-regulated Fut gene, Fut2.

Exercise 4

First, let's make sure we're in the data directory.


In [5]:
cd data



Each DEAGO analysis should be self-contained, so let's create a new directory for our DE analysis.


In [ ]:
mkdir de_analysis

In [6]:
cd de_analysis



Now, let's get our DE report.


In [ ]:
deago --build_config -c ../counts -t ../targets.txt \
    --count_type featurecounts \
    -a ../ensembl_mm10_deago_formatted.tsv \
    --control WT_Ctrl

Questions

In Figure 5C (below), the authors have highlighted four genes: Fut2, Sec1, Fut8 and B4galt1 which are associated with glycosylation.

We have already found Fut2, answer the following questions for the remaining 3 genes using the contrast results for WT and KO cells treated with IL22.

Q1: What is the gene identifier (geneID)?

Q2: What is the log2 fold change?

Q3: What is the adjusted p-value?

Hint: you may want to use awk to look at columns 36 and 40 in the unfiltered results files for that contrast if the genes are not found in the report tables

What's next?

If you want a recap of input file preparation, head back to running a quality control (QC) analysis.

Otherwise, let's continue on to running a GO term enrichment analysis.