Running a quality control (QC) analysis

Introduction

One of the most important steps in any RNA-Seq analysis is to quality control (QC) check your data. By default, DEAGO will run both a QC and a DE analysis. You can ask DEAGO to build a quick QC report by using the --qc option. This is a great way to get a first look at your data to try and identify any issues e.g. batch effects and outliers.

The objectives of this part of the tutorial are:

  • run a QC analysis with DEAGO
  • interpret the output QC report from DEAGO

Input files

We will need to give DEAGO two bits of information:

  • the name/location of the directory containing our gene count files (counts)
  • the name/location of our sample/condition mapping file (targets.txt)

Running a QC analysis with DEAGO

To run a quick, QC analysis with DEAGO the command would be:

deago -c <counts_directory> -t <targets file> --qc

As our count files were generated by featureCounts for this tutorial, we need to also tell DEAGO the count format with the --count_type option:

deago -c <counts_directory> -t <targets file> --count_type featurecounts --qc

QC output files

Once your QC analysis has finished, you should see several new files:

  • deago.config
    config file with key/value parameters defining the analysis
  • deago.rlog
    log of the R output generated when converting the R markdown to HTML
  • deago_markdown.Rmd
    R markdown used to run the analysis
  • deago_markdown.html
    HTML report generated from the R markdown

See our user guide more information about the output files from DEAGO.

QC analysis report

The output file we're interested in is deago_markdown.html which is your QC analysis report. Go ahead and open it in a web browser (e.g. Chrome, Firefox, IE, Safari...). You can do this by going to "File -> Open" in the top navigation or (if you have Firefox installed, use the command:


In [ ]:
firefox deago_markdown.html

All DEAGO reports have a sidebar on the left so you can quickly navigate the report.

If you click on Pipeline configuration you will see that the report contains the commands used to run the analysis (grey boxes) and their output. The Pipeline configuration section shows the parameters that were used for the analysis. This can be useful for finding input/output data, troubleshooting and debugging.

To take a look at the QC plots, you can click on QC Plots in the left-hand panel.

First, check to see if there were any problems with the sequencing. The Total read counts per sample plot does what it says on the tin and shows you the number of reads overlapping features in each sample. If one sample has a particularly low count compared to the others, it may indicate an issue and so you should take a closer look at that sample in some of the other QC plots.

The principal component analysis (PCA) and sample-to-sample distances plot are indicators of variation and will show how well the samples cluster together. These are a quick way of spotting outliers and potential batch effects.

In the PCA plot, samples are coloured by condition and the sample labels are a combination of the sample condition and the replicate number. Here you will see whether your samples cluster reasonably well with distinct groups for each condition: wt_ctrl, wt_il22 and ko_ctrl/ko_il22.

To compliment the PCA plot, we also have a scree plot which is a histogram of the percentage contribution of each of the principal components (PC). The points indicate the cumulative total and the lines represent broad cutoffs of 70% and 90%. Ideally, in a two-factor analysis we'd hope to see the cumulative total of PC1 and PC2 reaching 70%, although this may not always be the case.

There is also a sample-to-sample distances plot shaded by the distance (or variability) between samples. The darker the box, the more similar the samples (or the smaller the distance between them).

Exercise 3

First, let's make sure we're in the data directory.


In [ ]:
cd data

Each DEAGO analysis should be self-contained, so let's create a new directory for our QC analysis.


In [ ]:
mkdir qc

In [ ]:
cd qc

As this is the first time we're running deago in this tutorial, let's take a look at the usage.


In [ ]:
deago -h

Here's a brief explanation of the options we want to use and what they mean:

  • --build_config
    tells DEAGO that we want to build a new config file using the command line parameters
  • -c
    tells DEAGO the location of the folder containing our count files
  • -t
    tells DEAGO the location of our sample/condition mapping file
  • --count_type
    tells DEAGO the format of the count data (e.g. featurecounts) setting the values for count_column, skip_lines, gene_ids and count_delim
  • --qc
    tells DEAGO to only run the QC analysis

Now, let's get our QC report.


In [ ]:
deago --build_config -c ../counts -t ../targets.txt \
    --count_type featurecounts --qc

Note: because we are running the analysis from our new qc directory, we need to us ../ to say that our directory is one level above where we are now.

Questions

Q1: Do all the samples have similar total read counts?

Q2: Look at the PCA plot. How many clusters have the samples grouped into?

Q3: Do you notice anything in the PCA and sample-to-sample distances plot that you might want to look closer at?
Hint: look at the groupings in both plots, are there sub-groupings, do they relate to anything other than the condition...

Q4: What does the --keep_images option do?
Hint: look at the DEAGO usage with deago -h

What's next?

If you want a recap of input file preparation, head back to preparing input data.

Otherwise, let's continue on to running a differential expression (DE) analysis.