Identifying differentially expressed genes with Sleuth

Introduction

In the previous sections we have quantified our transcript abundance and looked at why counts are normalised. In this section you will be using sleuth to do some simple quality checks and get a first look at the results.

The objectives of this part of the tutorial are:

  • use sleuth to perform quality control checks
  • use sleuth to identify differentially expressed (DE) transcripts
  • use sleuth to investigate DE transcripts

Differential expression analysis (DEA)

Differential expression analysis tries to identify genes whose expression levels differ between experimental conditions. We don’t normally have enough replicates to do traditional tests of significance for RNA-Seq data. So, most methods look for outliers in the relationship between average abundance and fold change and assume most genes are not differentially expressed.

Rather than just using a fold change threshold to determine which genes are differentially expressed, DEAs use a variety of statistical tests for significance. These tests give us a p-value which is an estimate of how often your observations would occur by chance.

However, we perform these comparisons for each one of the thousands of genes/transcripts in our dataset. A p-value of 0.01 estimates a probability of 1% for seeing our observation just by chance. In an experiment like ours with 5,000 genes we would expect 5 genes to be significantly differentially expressed by chance (i.e. even if there were no difference between our conditions). Instead of using a p-value we can use a q-value which accounts for the multiple testing and adjusts the p-value accordingly.

sleuth

sleuth is a companion tool for Kallisto. Unlike most other tools, sleuth can utilize the technical variation information generated by Kallisto so that you can look at both the technical and biological variation in your dataset.

For the DEA, sleuth essentially tests two models, one which assumes that the abundances are equal between the two conditions (reduced) and one that does not (full). To identify DE transcripts it identifies those with a significantly better fit to the “full” model. For more information on sleuth and how it works, see Lior Pachter's blog post A sleuth for RNA-Seq.

sleuth is written in the R statistical programming language, as is almost all RNA-Seq analysis software. Helpfully, it produces a web page that allows interactive graphical analysis of the data. However, we strongly recommend learning R for anyone doing a significant amount of RNA-seq analysis. It is nowhere near as hard to get started with as full-blown programming languages such as Perl or Python!


Exercise 5

For this tutorial, we've provided a series of R commands as an R script that will get sleuth running.

Make sure you are in the data directory with the tutorial files.


In [ ]:
cd data

Running sleuth

The commands we need to run sleuth are in the file sleuth.R. There's a great overview of the commands and what they do by the developers of sleuth here. Using R is not as hard as it seems, most of this script was copied from the manual!

Open sleuth.R and have a quick look at the commands.


In [ ]:
cat sleuth.R

You may also want to have a look at hiseq_info.txt which is where we define which condition each sample is associated with.


In [ ]:
cat hiseq_info.txt

You can run scripts containing R commands using Rscript followed by the script name. Run sleuth.R.


In [ ]:
Rscript sleuth.R

You won't see any output from this script in the notebook, just a * next to the command input ([*]) to let you know it's running.

If you were to run the script directly on the command line, sleuth will return a link which you can follow (http://127.0.0.1:42427). This will take you to a web page where you can navigate and explore the sleuth results.

Click the link below or type the URL your a web browser (e.g. chrome or firefox) to open the sleuth results.

http://127.0.0.1:42427

You should now see a page with the heading "sleuth live". If not, just give the script a little longer and then refresh the page.

Using sleuth to quality check (QC) transcript quanification

Quality control checks are absolutely vital at every step of the experimental process. We can use sleuth to perform simple quality checks (QC) on our dataset.

At the top of the page, sleuth provides several tabs which we can use to determine whether the data is of good quality and whether we should trust the results we get.

First, lets take a look at a summary of our dataset.

In the web page that has been launched, click on "Summaries -> processed data".

Notice that the number of reads mapping differs quite a bit between MT and SBP samples? This is why we QC our data. In the MT samples >95% of the reads mapped to the genome, but only 15-30% are assigned to the transcriptome compared to >75% for the SBP samples. This suggests that there may be some residual ribosomal RNA left over from the RNA preparation. It's not a problem as we have enough reads and replicates for our analysis.

In some cases, we can identify samples which don't agree with other replicates (outliers) and samples which are related by experimental bias (batch effects). If we don’t have many replicates, it's hard to detect outliers and batch effects meaning our power to detect DE genes is reduced.

Principal component analysis (PCA) plots can be used to look at variation and strong patterns within the dataset. Batch effects and outliers often stand out quite clearly in the PCA plot and mean that you can account for them in any downstream analysis.

Our samples form two condition-related clusters with the two MT samples (red) on the left and the three SBP samples on the right (blue). If we look at the variance bar plot, we can see that the first principal component (PC1) accounts for >90% of the variation in our dataset. As the samples are clearly clustered on the x-axis (PC1) this suggests that most of the variation in the dataset is related to our experimental condition (Mt vs SBP).

Using sleuth to look at DE transcripts

We used the output from Kallisto to identify DE transcripts using sleuth. Let's take a look and see if we found any.

To see the results of the sleuth DEA, go to "analyses -> test table".

The important columns here are the q-value and the beta value (analagous to fold change). By default, the table is sorted by the q-value. We can see that our top transcript is PCHAS_0420800, a hypothetical protein/pseudogene. Now let's take a closer look at that transcript.

Go to "analyses -> transcript view". Enter "PCHAS_0420800" into the "transcript" search box. Click "view".

On the left you have the abundances for the MT replicates and on the right, the SBP replicates. We can see that this transcript is more highly expressed in the MT samples than in the SBP samples. This is also reflected by the fold change in the test table (b = -4.5). The b value is negative as it represents the fold change in SBP samples relative to those in the MT samples.

Finally, let's take a look at the gene level.

To see the results of the sleuth DEA, go to "_analyses -> testtable". Under "table type" select "gene table". Click on the column header "qval" in the table to sort the rows by ascending q-value.

The transcripts have now been grouped by their descriptions. Let's take a closer look at the CIR proteins.

Go to "analyses -> gene view". In the "gene" search box enter "CIR protein" (without the quotes).

Here we can see the individual CIR protein transcript abundances. We can see that PCHAS_1100300 is more highly expressed in the SBP samples and PCHAS_0302100 is more highly expressed in the MT samples.


Questions

Q1: Is our gene from earlier, PCHAS_1402500, significantly differentially expressed?


What's next?

You can head back to transcript quantification with Kallisto or continue on to interpreting the results.