RNA-Seq pipeline results

Introduction

When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf scripts to return the results of each of the automated analysis pipelines.

The RNA-Seq expression pipeline maps your raw sequence reads to a reference and counts the number of reads associated with each gene. We can use pf rnaseq to return the location of the count files that were produced by the RNA-Seq expression pipeline.

In this section of the tutorial we will cover:

  • using pf rnaseq to get count files generated by the RNA-Seq expression pipeline
  • filtering pf rnaseq results by mapper and reference
  • using pf rnaseq to symlink count files generated by the RNA-Seq expression pipeline
  • using pf rnaseq to summarise the relationship between lane, sample and file names
  • using pf rnaseq to get mapping statistics

Exercise 10

First, let's tell the system the location of our tutorial configuration file.


In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf

Let's take a look at the pf rnaseq usage.


In [ ]:
pf rnaseq -h

Now, let's get the RNA-Seq expression pipeline results for lane 8479_4#17.


In [ ]:
pf rnaseq -t lane -i 8479_4#17

By default, this returns the locations of the expression count files which were produced by the RNA-Seq expression pipeline.

For human and mouse data, there will also be featurecount files available which are produced by featureCounts. You can get these files using the --filetype or -f option.


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -f featurecounts

The mapping pipeline is run before the RNA-Seq expression pipeline. A quick way to get information about which mapper and reference were used by the mapping pipeline is to use the --details or -d option.

Let's get the mapping details for lane 8479_4#17.


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -d

Here we can see that "bwa" and "tophat" were used as the mapper and "Mus_musculus_mm9" and "Mus_musculus_mm10" were used as references.

You can request the RNA-Seq expression pipeline be run more than once using different mappers or reference. To filter the output by mapper we can use the --mapper or -M option and the --reference or -R option to filter by reference.

Let's look for RNA-Seq expression pipeline results for lane 8479_4#17 which used the mapper "bwa".


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -M bwa

Let's look for RNA-Seq expression pipeline results for lane 8479_4#17 which used the reference "Mus_musculus_mm9".


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -R "Mus_musculus_mm9"

We can symlink our count files into a directory using the --symlink or -l option.

Let's symlink our count files for lane 8479_4#17 to "my_count_files".


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -l my_count_files

In [ ]:
ls my_count_files

You may want to know the relationship between the lane name, sample name and file name. We can get this relationship using the --summary or -S option.


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -S

This generates a new file called "8479_4_17.rnaseqfind_summary.tsv" which contains the output from pf info for this lane (Lane, Sample, Supplier_Name, Public_Name, Strain), the filename that is symlinked (Filename) and the full location to that file (File_Path). This may also be useful as a starting point for your DEAGO targets file.


In [ ]:
cat 8479_4_17.rnaseqfind_summary.tsv

We can also get some mapping statistics from our RNA-Seq results using the --stats or -s option.

Let's get some mapping statistics for lane 8479_4#17.


In [ ]:
pf rnaseq -t lane -i 8479_4#17 -s

This generated a new file called "8479_4_17.rnaseqfind_stats.csv" which contains our mapping statistics.


In [ ]:
cat 8479_4_17.rnaseqfind_stats.csv

Notice that there are two rows for lane 8479_4#17. This is because the mapping pipeline was run twice on this lane using different references and mappers.

Questions

Q1: How many count files are returned by default for run 8479_4?


In [ ]:
# Enter your answer here

Q2: Which mapper was used with the mapping pipeline for lane 8479_4#18?
Hint: the mapper is in the 3rd column of the details


In [ ]:
# Enter your answer here

Q3: Which reference was used with the mapping pipeline for lane 8479_4#18?
Hint: the reference is in the 2nd column of the details


In [ ]:
# Enter your answer here

Q4: What is the location or path of the featurecounts file for lane 8479_4#18?


In [ ]:
# Enter your answer here

Q5: Which of the lanes in run 8479_4 has the lowest percentage of mapped reads?
Hint: you can use awk to print out column 2 (lane name) and column 12 (mapped %) of your run statistics


In [ ]:
# Enter your answer here

In [ ]:
# Enter your answer here

Q6: What is the sample name and symlinked file name associated with lane 8479_4#18?
Hint: you might want to summarise the results

What's next?

For a quick recap of how to get annotation pipeline results, head back to annotation pipeline results.

Otherwise, let's move on to how to find a reference.