When your sample data is in the Pathogen Informatics databases, it becomes available to the automated analysis pipelines. After the analysis pipelines have been requested and run, you can use the pf
scripts to return the results of each of the automated analysis pipelines.
The RNA-Seq expression pipeline maps your raw sequence reads to a reference and counts the number of reads associated with each gene. We can use pf rnaseq
to return the location of the count files that were produced by the RNA-Seq expression pipeline.
In this section of the tutorial we will cover:
pf rnaseq
to get count files generated by the RNA-Seq expression pipelinepf rnaseq
results by mapper and reference pf rnaseq
to symlink count files generated by the RNA-Seq expression pipelinepf rnaseq
to summarise the relationship between lane, sample and file namespf rnaseq
to get mapping statisticsFirst, let's tell the system the location of our tutorial configuration file.
In [ ]:
export PF_CONFIG_FILE=$PWD/data/pathfind.conf
Let's take a look at the pf rnaseq
usage.
In [ ]:
pf rnaseq -h
Now, let's get the RNA-Seq expression pipeline results for lane 8479_4#17.
In [ ]:
pf rnaseq -t lane -i 8479_4#17
By default, this returns the locations of the expression count files which were produced by the RNA-Seq expression pipeline.
For human and mouse data, there will also be featurecount files available which are produced by featureCounts. You can get these files using the --filetype
or -f
option.
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -f featurecounts
The mapping pipeline is run before the RNA-Seq expression pipeline. A quick way to get information about which mapper and reference were used by the mapping pipeline is to use the --details
or -d
option.
Let's get the mapping details for lane 8479_4#17.
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -d
Here we can see that "bwa" and "tophat" were used as the mapper and "Mus_musculus_mm9" and "Mus_musculus_mm10" were used as references.
You can request the RNA-Seq expression pipeline be run more than once using different mappers or reference. To filter the output by mapper we can use the --mapper
or -M
option and the --reference
or -R
option to filter by reference.
Let's look for RNA-Seq expression pipeline results for lane 8479_4#17 which used the mapper "bwa".
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -M bwa
Let's look for RNA-Seq expression pipeline results for lane 8479_4#17 which used the reference "Mus_musculus_mm9".
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -R "Mus_musculus_mm9"
We can symlink our count files into a directory using the --symlink
or -l
option.
Let's symlink our count files for lane 8479_4#17 to "my_count_files".
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -l my_count_files
In [ ]:
ls my_count_files
You may want to know the relationship between the lane name, sample name and file name. We can get this relationship using the --summary
or -S
option.
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -S
This generates a new file called "8479_4_17.rnaseqfind_summary.tsv" which contains the output from pf info
for this lane (Lane, Sample, Supplier_Name, Public_Name, Strain), the filename that is symlinked (Filename) and the full location to that file (File_Path). This may also be useful as a starting point for your DEAGO targets file.
In [ ]:
cat 8479_4_17.rnaseqfind_summary.tsv
We can also get some mapping statistics from our RNA-Seq results using the --stats
or -s
option.
Let's get some mapping statistics for lane 8479_4#17.
In [ ]:
pf rnaseq -t lane -i 8479_4#17 -s
This generated a new file called "8479_4_17.rnaseqfind_stats.csv" which contains our mapping statistics.
In [ ]:
cat 8479_4_17.rnaseqfind_stats.csv
Notice that there are two rows for lane 8479_4#17. This is because the mapping pipeline was run twice on this lane using different references and mappers.
In [ ]:
# Enter your answer here
Q2: Which mapper was used with the mapping pipeline for lane 8479_4#18?
Hint: the mapper is in the 3rd column of the details
In [ ]:
# Enter your answer here
Q3: Which reference was used with the mapping pipeline for lane 8479_4#18?
Hint: the reference is in the 2nd column of the details
In [ ]:
# Enter your answer here
Q4: What is the location or path of the featurecounts file for lane 8479_4#18?
In [ ]:
# Enter your answer here
Q5: Which of the lanes in run 8479_4 has the lowest percentage of mapped reads?
Hint: you can use awk
to print out column 2 (lane name) and column 12 (mapped %) of your run statistics
In [ ]:
# Enter your answer here
In [ ]:
# Enter your answer here
Q6: What is the sample name and symlinked file name associated with lane 8479_4#18?
Hint: you might want to summarise the results
For a quick recap of how to get annotation pipeline results, head back to annotation pipeline results.
Otherwise, let's move on to how to find a reference.